the-tree/docs/submitting-applications.md

0001 ---
0002 layout: global
0003 title: Submitting Applications
0004 license: |
0005   Licensed to the Apache Software Foundation (ASF) under one or more
0006   contributor license agreements.  See the NOTICE file distributed with
0007   this work for additional information regarding copyright ownership.
0008   The ASF licenses this file to You under the Apache License, Version 2.0
0009   (the "License"); you may not use this file except in compliance with
0010   the License.  You may obtain a copy of the License at
0011
0012      http://www.apache.org/licenses/LICENSE-2.0
0013
0014   Unless required by applicable law or agreed to in writing, software
0015   distributed under the License is distributed on an "AS IS" BASIS,
0016   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0017   See the License for the specific language governing permissions and
0018   limitations under the License.
0019 ---
0020
0021 The `spark-submit` script in Spark's `bin` directory is used to launch applications on a cluster.
0022 It can use all of Spark's supported [cluster managers](cluster-overview.html#cluster-manager-types)
0023 through a uniform interface so you don't have to configure your application especially for each one.
0024
0025 # Bundling Your Application's Dependencies
0026 If your code depends on other projects, you will need to package them alongside
0027 your application in order to distribute the code to a Spark cluster. To do this,
0028 create an assembly jar (or "uber" jar) containing your code and its dependencies. Both
0029 [sbt](https://github.com/sbt/sbt-assembly) and
0030 [Maven](http://maven.apache.org/plugins/maven-shade-plugin/)
0031 have assembly plugins. When creating assembly jars, list Spark and Hadoop
0032 as `provided` dependencies; these need not be bundled since they are provided by
0033 the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit`
0034 script as shown here while passing your jar.
0035
0036 For Python, you can use the `--py-files` argument of `spark-submit` to add `.py`, `.zip` or `.egg`
0037 files to be distributed with your application. If you depend on multiple Python files we recommend
0038 packaging them into a `.zip` or `.egg`.
0039
0040 # Launching Applications with spark-submit
0041
0042 Once a user application is bundled, it can be launched using the `bin/spark-submit` script.
0043 This script takes care of setting up the classpath with Spark and its
0044 dependencies, and can support different cluster managers and deploy modes that Spark supports:
0045
0046 {% highlight bash %}
0047 ./bin/spark-submit \
0048   --class <main-class> \
0049   --master <master-url> \
0050   --deploy-mode <deploy-mode> \
0051   --conf <key>=<value> \
0052   ... # other options
0053   <application-jar> \
0054   [application-arguments]
0055 {% endhighlight %}
0056
0057 Some of the commonly used options are:
0058
0059 * `--class`: The entry point for your application (e.g. `org.apache.spark.examples.SparkPi`)
0060 * `--master`: The [master URL](#master-urls) for the cluster (e.g. `spark://23.195.26.187:7077`)
0061 * `--deploy-mode`: Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`) (default: `client`) <b> &#8224; </b>
0062 * `--conf`: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap "key=value" in quotes (as shown). Multiple configurations should be passed as separate arguments. (e.g. `--conf <key>=<value> --conf <key2>=<value2>`)
0063 * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
0064 * `application-arguments`: Arguments passed to the main method of your main class, if any
0065
0066 <b>&#8224;</b> A common deployment strategy is to submit your application from a gateway machine
0067 that is
0068 physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster).
0069 In this setup, `client` mode is appropriate. In `client` mode, the driver is launched directly
0070 within the `spark-submit` process which acts as a *client* to the cluster. The input and
0071 output of the application is attached to the console. Thus, this mode is especially suitable
0072 for applications that involve the REPL (e.g. Spark shell).
0073
0074 Alternatively, if your application is submitted from a machine far from the worker machines (e.g.
0075 locally on your laptop), it is common to use `cluster` mode to minimize network latency between
0076 the drivers and the executors. Currently, the standalone mode does not support cluster mode for Python
0077 applications.
0078
0079 For Python applications, simply pass a `.py` file in the place of `<application-jar>` instead of a JAR,
0080 and add Python `.zip`, `.egg` or `.py` files to the search path with `--py-files`.
0081
0082 There are a few options available that are specific to the
0083 [cluster manager](cluster-overview.html#cluster-manager-types) that is being used.
0084 For example, with a [Spark standalone cluster](spark-standalone.html) with `cluster` deploy mode,
0085 you can also specify `--supervise` to make sure that the driver is automatically restarted if it
0086 fails with a non-zero exit code. To enumerate all such options available to `spark-submit`,
0087 run it with `--help`. Here are a few examples of common options:
0088
0089 {% highlight bash %}
0090 # Run application locally on 8 cores
0091 ./bin/spark-submit \
0092   --class org.apache.spark.examples.SparkPi \
0093   --master local[8] \
0094   /path/to/examples.jar \
0095   100
0096
0097 # Run on a Spark standalone cluster in client deploy mode
0098 ./bin/spark-submit \
0099   --class org.apache.spark.examples.SparkPi \
0100   --master spark://207.184.161.138:7077 \
0101   --executor-memory 20G \
0102   --total-executor-cores 100 \
0103   /path/to/examples.jar \
0104   1000
0105
0106 # Run on a Spark standalone cluster in cluster deploy mode with supervise
0107 ./bin/spark-submit \
0108   --class org.apache.spark.examples.SparkPi \
0109   --master spark://207.184.161.138:7077 \
0110   --deploy-mode cluster \
0111   --supervise \
0112   --executor-memory 20G \
0113   --total-executor-cores 100 \
0114   /path/to/examples.jar \
0115   1000
0116
0117 # Run on a YARN cluster
0118 export HADOOP_CONF_DIR=XXX
0119 ./bin/spark-submit \
0120   --class org.apache.spark.examples.SparkPi \
0121   --master yarn \
0122   --deploy-mode cluster \  # can be client for client mode
0123   --executor-memory 20G \
0124   --num-executors 50 \
0125   /path/to/examples.jar \
0126   1000
0127
0128 # Run a Python application on a Spark standalone cluster
0129 ./bin/spark-submit \
0130   --master spark://207.184.161.138:7077 \
0131   examples/src/main/python/pi.py \
0132   1000
0133
0134 # Run on a Mesos cluster in cluster deploy mode with supervise
0135 ./bin/spark-submit \
0136   --class org.apache.spark.examples.SparkPi \
0137   --master mesos://207.184.161.138:7077 \
0138   --deploy-mode cluster \
0139   --supervise \
0140   --executor-memory 20G \
0141   --total-executor-cores 100 \
0142   http://path/to/examples.jar \
0143   1000
0144
0145 # Run on a Kubernetes cluster in cluster deploy mode
0146 ./bin/spark-submit \
0147   --class org.apache.spark.examples.SparkPi \
0148   --master k8s://xx.yy.zz.ww:443 \
0149   --deploy-mode cluster \
0150   --executor-memory 20G \
0151   --num-executors 50 \
0152   http://path/to/examples.jar \
0153   1000
0154
0155 {% endhighlight %}
0156
0157 # Master URLs
0158
0159 The master URL passed to Spark can be in one of the following formats:
0160
0161 <table class="table">
0162 <tr><th>Master URL</th><th>Meaning</th></tr>
0163 <tr><td> <code>local</code> </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
0164 <tr><td> <code>local[K]</code> </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
0165 <tr><td> <code>local[K,F]</code> </td><td> Run Spark locally with K worker threads and F maxFailures (see <a href="configuration.html#scheduling">spark.task.maxFailures</a> for an explanation of this variable) </td></tr>
0166 <tr><td> <code>local[*]</code> </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
0167 <tr><td> <code>local[*,F]</code> </td><td> Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.</td></tr>
0168 <tr><td> <code>spark://HOST:PORT</code> </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
0169         cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
0170 </td></tr>
0171 <tr><td> <code>spark://HOST1:PORT1,HOST2:PORT2</code> </td><td> Connect to the given <a href="spark-standalone.html#standby-masters-with-zookeeper">Spark standalone
0172         cluster with standby masters with Zookeeper</a>. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.
0173 </td></tr>
0174 <tr><td> <code>mesos://HOST:PORT</code> </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
0175         The port must be whichever one your is configured to use, which is 5050 by default.
0176         Or, for a Mesos cluster using ZooKeeper, use <code>mesos://zk://...</code>.
0177         To submit with <code>--deploy-mode cluster</code>, the HOST:PORT should be configured to connect to the <a href="running-on-mesos.html#cluster-mode">MesosClusterDispatcher</a>.
0178 </td></tr>
0179 <tr><td> <code>yarn</code> </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
0180         <code>client</code> or <code>cluster</code> mode depending on the value of <code>--deploy-mode</code>.
0181         The cluster location will be found based on the <code>HADOOP_CONF_DIR</code> or <code>YARN_CONF_DIR</code> variable.
0182 </td></tr>
0183 <tr><td> <code>k8s://HOST:PORT</code> </td><td> Connect to a <a href="running-on-kubernetes.html">Kubernetes</a> cluster in
0184         <code>cluster</code> mode. Client mode is currently unsupported and will be supported in future releases.
0185         The <code>HOST</code> and <code>PORT</code> refer to the <a href="https://kubernetes.io/docs/reference/generated/kube-apiserver/">Kubernetes API Server</a>.
0186         It connects using TLS by default. In order to force it to use an unsecured connection, you can use
0187         <code>k8s://http://HOST:PORT</code>.
0188 </td></tr>
0189 </table>
0190
0191
0192 # Loading Configuration from a File
0193
0194 The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
0195 properties file and pass them on to your application. By default, it will read options
0196 from `conf/spark-defaults.conf` in the Spark directory. For more detail, see the section on
0197 [loading default configurations](configuration.html#loading-default-configurations).
0198
0199 Loading default Spark configurations this way can obviate the need for certain flags to
0200 `spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
0201 `--master` flag from `spark-submit`. In general, configuration values explicitly set on a
0202 `SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
0203 defaults file.
0204
0205 If you are ever unclear where configuration options are coming from, you can print out fine-grained
0206 debugging information by running `spark-submit` with the `--verbose` option.
0207
0208 # Advanced Dependency Management
0209 When using `spark-submit`, the application jar along with any jars included with the `--jars` option
0210 will be automatically transferred to the cluster. URLs supplied after `--jars` must be separated by commas. That list is included in the driver and executor classpaths. Directory expansion does not work with `--jars`.
0211
0212 Spark uses the following URL scheme to allow different strategies for disseminating jars:
0213
0214 - **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
0215   every executor pulls the file from the driver HTTP server.
0216 - **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
0217 - **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node.  This
0218   means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
0219   or shared via NFS, GlusterFS, etc.
0220
0221 Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
0222 This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup
0223 is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
0224 `spark.worker.cleanup.appDataTtl` property.
0225
0226 Users may also include any other dependencies by supplying a comma-delimited list of Maven coordinates
0227 with `--packages`. All transitive dependencies will be handled when using this command. Additional
0228 repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag `--repositories`.
0229 (Note that credentials for password-protected repositories can be supplied in some cases in the repository URI,
0230 such as in `https://user:password@host/...`. Be careful when supplying credentials this way.)
0231 These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to include Spark Packages.
0232
0233 For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries
0234 to executors.
0235
0236 # More Information
0237
0238 Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes
0239 the components involved in distributed execution, and how to monitor and debug applications.