Back to home page

OSCL-LXR

 
 

    


0001 # Apache Spark
0002 
0003 Spark is a unified analytics engine for large-scale data processing. It provides
0004 high-level APIs in Scala, Java, Python, and R, and an optimized engine that
0005 supports general computation graphs for data analysis. It also supports a
0006 rich set of higher-level tools including Spark SQL for SQL and DataFrames,
0007 MLlib for machine learning, GraphX for graph processing,
0008 and Structured Streaming for stream processing.
0009 
0010 <https://spark.apache.org/>
0011 
0012 [![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3)
0013 [![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
0014 [![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
0015 
0016 
0017 ## Online Documentation
0018 
0019 You can find the latest Spark documentation, including a programming
0020 guide, on the [project web page](https://spark.apache.org/documentation.html).
0021 This README file only contains basic setup instructions.
0022 
0023 ## Building Spark
0024 
0025 Spark is built using [Apache Maven](https://maven.apache.org/).
0026 To build Spark and its example programs, run:
0027 
0028     ./build/mvn -DskipTests clean package
0029 
0030 (You do not need to do this if you downloaded a pre-built package.)
0031 
0032 More detailed documentation is available from the project site, at
0033 ["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html).
0034 
0035 For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](https://spark.apache.org/developer-tools.html).
0036 
0037 ## Interactive Scala Shell
0038 
0039 The easiest way to start using Spark is through the Scala shell:
0040 
0041     ./bin/spark-shell
0042 
0043 Try the following command, which should return 1,000,000,000:
0044 
0045     scala> spark.range(1000 * 1000 * 1000).count()
0046 
0047 ## Interactive Python Shell
0048 
0049 Alternatively, if you prefer Python, you can use the Python shell:
0050 
0051     ./bin/pyspark
0052 
0053 And run the following command, which should also return 1,000,000,000:
0054 
0055     >>> spark.range(1000 * 1000 * 1000).count()
0056 
0057 ## Example Programs
0058 
0059 Spark also comes with several sample programs in the `examples` directory.
0060 To run one of them, use `./bin/run-example <class> [params]`. For example:
0061 
0062     ./bin/run-example SparkPi
0063 
0064 will run the Pi example locally.
0065 
0066 You can set the MASTER environment variable when running examples to submit
0067 examples to a cluster. This can be a mesos:// or spark:// URL,
0068 "yarn" to run on YARN, and "local" to run
0069 locally with one thread, or "local[N]" to run locally with N threads. You
0070 can also use an abbreviated class name if the class is in the `examples`
0071 package. For instance:
0072 
0073     MASTER=spark://host:7077 ./bin/run-example SparkPi
0074 
0075 Many of the example programs print usage help if no params are given.
0076 
0077 ## Running Tests
0078 
0079 Testing first requires [building Spark](#building-spark). Once Spark is built, tests
0080 can be run using:
0081 
0082     ./dev/run-tests
0083 
0084 Please see the guidance on how to
0085 [run tests for a module, or individual tests](https://spark.apache.org/developer-tools.html#individual-tests).
0086 
0087 There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
0088 
0089 ## A Note About Hadoop Versions
0090 
0091 Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
0092 storage systems. Because the protocols have changed in different versions of
0093 Hadoop, you must build Spark against the same version that your cluster runs.
0094 
0095 Please refer to the build documentation at
0096 ["Specifying the Hadoop Version and Enabling YARN"](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)
0097 for detailed guidance on building for a particular distribution of Hadoop, including
0098 building for particular Hive and Hive Thriftserver distributions.
0099 
0100 ## Configuration
0101 
0102 Please refer to the [Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
0103 in the online documentation for an overview on how to configure Spark.
0104 
0105 ## Contributing
0106 
0107 Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
0108 for information on how to get started contributing to the project.