0001 # Apache Spark
0002
0003 Spark is a unified analytics engine for large-scale data processing. It provides
0004 high-level APIs in Scala, Java, Python, and R, and an optimized engine that
0005 supports general computation graphs for data analysis. It also supports a
0006 rich set of higher-level tools including Spark SQL for SQL and DataFrames,
0007 MLlib for machine learning, GraphX for graph processing,
0008 and Structured Streaming for stream processing.
0009
0010 <https://spark.apache.org/>
0011
0012 [![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3)
0013 [![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
0014 [![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
0015
0016
0017 ## Online Documentation
0018
0019 You can find the latest Spark documentation, including a programming
0020 guide, on the [project web page](https://spark.apache.org/documentation.html).
0021 This README file only contains basic setup instructions.
0022
0023 ## Building Spark
0024
0025 Spark is built using [Apache Maven](https://maven.apache.org/).
0026 To build Spark and its example programs, run:
0027
0028 ./build/mvn -DskipTests clean package
0029
0030 (You do not need to do this if you downloaded a pre-built package.)
0031
0032 More detailed documentation is available from the project site, at
0033 ["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html).
0034
0035 For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](https://spark.apache.org/developer-tools.html).
0036
0037 ## Interactive Scala Shell
0038
0039 The easiest way to start using Spark is through the Scala shell:
0040
0041 ./bin/spark-shell
0042
0043 Try the following command, which should return 1,000,000,000:
0044
0045 scala> spark.range(1000 * 1000 * 1000).count()
0046
0047 ## Interactive Python Shell
0048
0049 Alternatively, if you prefer Python, you can use the Python shell:
0050
0051 ./bin/pyspark
0052
0053 And run the following command, which should also return 1,000,000,000:
0054
0055 >>> spark.range(1000 * 1000 * 1000).count()
0056
0057 ## Example Programs
0058
0059 Spark also comes with several sample programs in the `examples` directory.
0060 To run one of them, use `./bin/run-example <class> [params]`. For example:
0061
0062 ./bin/run-example SparkPi
0063
0064 will run the Pi example locally.
0065
0066 You can set the MASTER environment variable when running examples to submit
0067 examples to a cluster. This can be a mesos:// or spark:// URL,
0068 "yarn" to run on YARN, and "local" to run
0069 locally with one thread, or "local[N]" to run locally with N threads. You
0070 can also use an abbreviated class name if the class is in the `examples`
0071 package. For instance:
0072
0073 MASTER=spark://host:7077 ./bin/run-example SparkPi
0074
0075 Many of the example programs print usage help if no params are given.
0076
0077 ## Running Tests
0078
0079 Testing first requires [building Spark](#building-spark). Once Spark is built, tests
0080 can be run using:
0081
0082 ./dev/run-tests
0083
0084 Please see the guidance on how to
0085 [run tests for a module, or individual tests](https://spark.apache.org/developer-tools.html#individual-tests).
0086
0087 There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
0088
0089 ## A Note About Hadoop Versions
0090
0091 Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
0092 storage systems. Because the protocols have changed in different versions of
0093 Hadoop, you must build Spark against the same version that your cluster runs.
0094
0095 Please refer to the build documentation at
0096 ["Specifying the Hadoop Version and Enabling YARN"](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)
0097 for detailed guidance on building for a particular distribution of Hadoop, including
0098 building for particular Hive and Hive Thriftserver distributions.
0099
0100 ## Configuration
0101
0102 Please refer to the [Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
0103 in the online documentation for an overview on how to configure Spark.
0104
0105 ## Contributing
0106
0107 Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
0108 for information on how to get started contributing to the project.