Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 displayTitle: Spark Overview
0004 title: Overview
0005 description: Apache Spark SPARK_VERSION_SHORT documentation homepage
0006 license: |
0007   Licensed to the Apache Software Foundation (ASF) under one or more
0008   contributor license agreements.  See the NOTICE file distributed with
0009   this work for additional information regarding copyright ownership.
0010   The ASF licenses this file to You under the Apache License, Version 2.0
0011   (the "License"); you may not use this file except in compliance with
0012   the License.  You may obtain a copy of the License at
0013  
0014      http://www.apache.org/licenses/LICENSE-2.0
0015  
0016   Unless required by applicable law or agreed to in writing, software
0017   distributed under the License is distributed on an "AS IS" BASIS,
0018   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0019   See the License for the specific language governing permissions and
0020   limitations under the License.
0021 ---
0022 
0023 Apache Spark is a unified analytics engine for large-scale data processing.
0024 It provides high-level APIs in Java, Scala, Python and R,
0025 and an optimized engine that supports general execution graphs.
0026 It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Structured Streaming](structured-streaming-programming-guide.html) for incremental computation and stream processing.
0027 
0028 # Security
0029 
0030 Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
0031 Please see [Spark Security](security.html) before downloading and running Spark.
0032 
0033 # Downloading
0034 
0035 Get Spark from the [downloads page](https://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. Spark uses Hadoop's client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions.
0036 Users can also download a "Hadoop free" binary and run Spark with any Hadoop version
0037 [by augmenting Spark's classpath](hadoop-provided.html).
0038 Scala and Java users can include Spark in their projects using its Maven coordinates and Python users can install Spark from PyPI.
0039 
0040 
0041 If you'd like to build Spark from 
0042 source, visit [Building Spark](building-spark.html).
0043 
0044 
0045 Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It's easy to run locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation.
0046 
0047 Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+.
0048 Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.
0049 Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0.
0050 R prior to version 3.4 support is deprecated as of Spark 3.0.0.
0051 For the Scala API, Spark {{site.SPARK_VERSION}}
0052 uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version
0053 ({{site.SCALA_BINARY_VERSION}}.x).
0054 
0055 For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additionally for Apache Arrow library. This prevents `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available` when Apache Arrow uses Netty internally.
0056 
0057 # Running the Examples and Shell
0058 
0059 Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
0060 `examples/src/main` directory. To run one of the Java or Scala sample programs, use
0061 `bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
0062 invokes the more general
0063 [`spark-submit` script](submitting-applications.html) for
0064 launching applications). For example,
0065 
0066     ./bin/run-example SparkPi 10
0067 
0068 You can also run Spark interactively through a modified version of the Scala shell. This is a
0069 great way to learn the framework.
0070 
0071     ./bin/spark-shell --master local[2]
0072 
0073 The `--master` option specifies the
0074 [master URL for a distributed cluster](submitting-applications.html#master-urls), or `local` to run
0075 locally with one thread, or `local[N]` to run locally with N threads. You should start by using
0076 `local` for testing. For a full list of options, run Spark shell with the `--help` option.
0077 
0078 Spark also provides a Python API. To run Spark interactively in a Python interpreter, use
0079 `bin/pyspark`:
0080 
0081     ./bin/pyspark --master local[2]
0082 
0083 Example applications are also provided in Python. For example,
0084 
0085     ./bin/spark-submit examples/src/main/python/pi.py 10
0086 
0087 Spark also provides an [R API](sparkr.html) since 1.4 (only DataFrames APIs included).
0088 To run Spark interactively in an R interpreter, use `bin/sparkR`:
0089 
0090     ./bin/sparkR --master local[2]
0091 
0092 Example applications are also provided in R. For example,
0093 
0094     ./bin/spark-submit examples/src/main/r/dataframe.R
0095 
0096 # Launching on a Cluster
0097 
0098 The Spark [cluster mode overview](cluster-overview.html) explains the key concepts in running on a cluster.
0099 Spark can run both by itself, or over several existing cluster managers. It currently provides several
0100 options for deployment:
0101 
0102 * [Standalone Deploy Mode](spark-standalone.html): simplest way to deploy Spark on a private cluster
0103 * [Apache Mesos](running-on-mesos.html)
0104 * [Hadoop YARN](running-on-yarn.html)
0105 * [Kubernetes](running-on-kubernetes.html)
0106 
0107 # Where to Go from Here
0108 
0109 **Programming Guides:**
0110 
0111 * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
0112 * [RDD Programming Guide](rdd-programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables  
0113 * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
0114 * [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
0115 * [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)
0116 * [MLlib](ml-guide.html): applying machine learning algorithms
0117 * [GraphX](graphx-programming-guide.html): processing graphs 
0118 
0119 **API Docs:**
0120 
0121 * [Spark Scala API (Scaladoc)](api/scala/org/apache/spark/index.html)
0122 * [Spark Java API (Javadoc)](api/java/index.html)
0123 * [Spark Python API (Sphinx)](api/python/index.html)
0124 * [Spark R API (Roxygen2)](api/R/index.html)
0125 * [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
0126 
0127 **Deployment Guides:**
0128 
0129 * [Cluster Overview](cluster-overview.html): overview of concepts and components when running on a cluster
0130 * [Submitting Applications](submitting-applications.html): packaging and deploying applications
0131 * Deployment modes:
0132   * [Amazon EC2](https://github.com/amplab/spark-ec2): scripts that let you launch a cluster on EC2 in about 5 minutes
0133   * [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without a third-party cluster manager
0134   * [Mesos](running-on-mesos.html): deploy a private cluster using
0135       [Apache Mesos](https://mesos.apache.org)
0136   * [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
0137   * [Kubernetes](running-on-kubernetes.html): deploy Spark on top of Kubernetes
0138 
0139 **Other Documents:**
0140 
0141 * [Configuration](configuration.html): customize Spark via its configuration system
0142 * [Monitoring](monitoring.html): track the behavior of your applications
0143 * [Tuning Guide](tuning.html): best practices to optimize performance and memory use
0144 * [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications
0145 * [Security](security.html): Spark security support
0146 * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
0147 * Integration with other storage systems:
0148   * [Cloud Infrastructures](cloud-integration.html)
0149   * [OpenStack Swift](storage-openstack-swift.html)
0150 * [Migration Guide](migration-guide.html): Migration guides for Spark components
0151 * [Building Spark](building-spark.html): build Spark using the Maven system
0152 * [Contributing to Spark](https://spark.apache.org/contributing.html)
0153 * [Third Party Projects](https://spark.apache.org/third-party-projects.html): related third party Spark projects
0154 
0155 **External Resources:**
0156 
0157 * [Spark Homepage](https://spark.apache.org)
0158 * [Spark Community](https://spark.apache.org/community.html) resources, including local meetups
0159 * [StackOverflow tag `apache-spark`](http://stackoverflow.com/questions/tagged/apache-spark)
0160 * [Mailing Lists](https://spark.apache.org/mailing-lists.html): ask questions about Spark here
0161 * [AMP Camps](http://ampcamp.berkeley.edu/): a series of training camps at UC Berkeley that featured talks and
0162   exercises about Spark, Spark Streaming, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/6/),
0163   [slides](http://ampcamp.berkeley.edu/6/) and [exercises](http://ampcamp.berkeley.edu/6/exercises/) are
0164   available online for free.
0165 * [Code Examples](https://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
0166  [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
0167  [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
0168  [R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r))