0001 # Apache Spark
0003 Spark is a unified analytics engine for large-scale data processing. It provides
0004 high-level APIs in Scala, Java, Python, and R, and an optimized engine that
0005 supports general computation graphs for data analysis. It also supports a
0006 rich set of higher-level tools including Spark SQL for SQL and DataFrames,
0007 MLlib for machine learning, GraphX for graph processing,
0008 and Structured Streaming for stream processing.
0012 [![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3)
0013 [![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
0014 [![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
0017 ## Online Documentation
0019 You can find the latest Spark documentation, including a programming
0020 guide, on the [project web page](https://spark.apache.org/documentation.html).
0021 This README file only contains basic setup instructions.
0023 ## Building Spark
0025 Spark is built using [Apache Maven](https://maven.apache.org/).
0026 To build Spark and its example programs, run:
0028 ./build/mvn -DskipTests clean package
0030 (You do not need to do this if you downloaded a pre-built package.)
0032 More detailed documentation is available from the project site, at
0033 ["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html).
0035 For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](https://spark.apache.org/developer-tools.html).
0037 ## Interactive Scala Shell
0039 The easiest way to start using Spark is through the Scala shell:
0043 Try the following command, which should return 1,000,000,000:
0045 scala> spark.range(1000 * 1000 * 1000).count()
0047 ## Interactive Python Shell
0049 Alternatively, if you prefer Python, you can use the Python shell:
0053 And run the following command, which should also return 1,000,000,000:
0055 >>> spark.range(1000 * 1000 * 1000).count()
0057 ## Example Programs
0059 Spark also comes with several sample programs in the `examples` directory.
0060 To run one of them, use `./bin/run-example <class> [params]`. For example:
0062 ./bin/run-example SparkPi
0064 will run the Pi example locally.
0066 You can set the MASTER environment variable when running examples to submit
0067 examples to a cluster. This can be a mesos:// or spark:// URL,
0068 "yarn" to run on YARN, and "local" to run
0069 locally with one thread, or "local[N]" to run locally with N threads. You
0070 can also use an abbreviated class name if the class is in the `examples`
0071 package. For instance:
0073 MASTER=spark://host:7077 ./bin/run-example SparkPi
0075 Many of the example programs print usage help if no params are given.
0077 ## Running Tests
0079 Testing first requires [building Spark](#building-spark). Once Spark is built, tests
0080 can be run using:
0084 Please see the guidance on how to
0085 [run tests for a module, or individual tests](https://spark.apache.org/developer-tools.html#individual-tests).
0087 There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
0089 ## A Note About Hadoop Versions
0091 Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
0092 storage systems. Because the protocols have changed in different versions of
0093 Hadoop, you must build Spark against the same version that your cluster runs.
0095 Please refer to the build documentation at
0096 ["Specifying the Hadoop Version and Enabling YARN"](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)
0097 for detailed guidance on building for a particular distribution of Hadoop, including
0098 building for particular Hive and Hive Thriftserver distributions.
0100 ## Configuration
0102 Please refer to the [Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
0103 in the online documentation for an overview on how to configure Spark.
0105 ## Contributing
0107 Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
0108 for information on how to get started contributing to the project.