0001 ---
0002 layout: global
0003 title: Running Spark on YARN
0004 license: |
0005 Licensed to the Apache Software Foundation (ASF) under one or more
0006 contributor license agreements. See the NOTICE file distributed with
0007 this work for additional information regarding copyright ownership.
0008 The ASF licenses this file to You under the Apache License, Version 2.0
0009 (the "License"); you may not use this file except in compliance with
0010 the License. You may obtain a copy of the License at
0011
0012 http://www.apache.org/licenses/LICENSE-2.0
0013
0014 Unless required by applicable law or agreed to in writing, software
0015 distributed under the License is distributed on an "AS IS" BASIS,
0016 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0017 See the License for the specific language governing permissions and
0018 limitations under the License.
0019 ---
0020 * This will become a table of contents (this text will be scraped).
0021 {:toc}
0022
0023 Support for running on [YARN (Hadoop
0024 NextGen)](http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html)
0025 was added to Spark in version 0.6.0, and improved in subsequent releases.
0026
0027 # Security
0028
0029 Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
0030 Please see [Spark Security](security.html) and the specific security sections in this doc before running Spark.
0031
0032 # Launching Spark on YARN
0033
0034 Ensure that `HADOOP_CONF_DIR` or `YARN_CONF_DIR` points to the directory which contains the (client side) configuration files for the Hadoop cluster.
0035 These configs are used to write to HDFS and connect to the YARN ResourceManager. The
0036 configuration contained in this directory will be distributed to the YARN cluster so that all
0037 containers used by the application use the same configuration. If the configuration references
0038 Java system properties or environment variables not managed by YARN, they should also be set in the
0039 Spark application's configuration (driver, executors, and the AM when running in client mode).
0040
0041 There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
0042
0043 Unlike other cluster managers supported by Spark in which the master's address is specified in the `--master`
0044 parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration.
0045 Thus, the `--master` parameter is `yarn`.
0046
0047 To launch a Spark application in `cluster` mode:
0048
0049 $ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
0050
0051 For example:
0052
0053 $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
0054 --master yarn \
0055 --deploy-mode cluster \
0056 --driver-memory 4g \
0057 --executor-memory 2g \
0058 --executor-cores 1 \
0059 --queue thequeue \
0060 examples/jars/spark-examples*.jar \
0061 10
0062
0063 The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs.
0064
0065 To launch a Spark application in `client` mode, do the same, but replace `cluster` with `client`. The following shows how you can run `spark-shell` in `client` mode:
0066
0067 $ ./bin/spark-shell --master yarn --deploy-mode client
0068
0069 ## Adding Other JARs
0070
0071 In `cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command.
0072
0073 $ ./bin/spark-submit --class my.main.Class \
0074 --master yarn \
0075 --deploy-mode cluster \
0076 --jars my-other-jar.jar,my-other-other-jar.jar \
0077 my-main-jar.jar \
0078 app_arg1 app_arg2
0079
0080
0081 # Preparations
0082
0083 Running Spark on YARN requires a binary distribution of Spark which is built with YARN support.
0084 Binary distributions can be downloaded from the [downloads page](https://spark.apache.org/downloads.html) of the project website.
0085 To build Spark yourself, refer to [Building Spark](building-spark.html).
0086
0087 To make Spark runtime jars accessible from YARN side, you can specify `spark.yarn.archive` or `spark.yarn.jars`. For details please refer to [Spark Properties](running-on-yarn.html#spark-properties). If neither `spark.yarn.archive` nor `spark.yarn.jars` is specified, Spark will create a zip file with all jars under `$SPARK_HOME/jars` and upload it to the distributed cache.
0088
0089 # Configuration
0090
0091 Most of the configs are the same for Spark on YARN as for other deployment modes. See the [configuration page](configuration.html) for more information on those. These are configs that are specific to Spark on YARN.
0092
0093 # Debugging your Application
0094
0095 In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the `yarn.log-aggregation-enable` config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the `yarn logs` command.
0096
0097 yarn logs -applicationId <app ID>
0098
0099 will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (`yarn.nodemanager.remote-app-log-dir` and `yarn.nodemanager.remote-app-log-dir-suffix`). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure `yarn.log.server.url` in `yarn-site.xml` properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
0100
0101 When log aggregation isn't turned on, logs are retained locally on each machine under `YARN_APP_LOGS_DIR`, which is usually configured to `/tmp/logs` or `$HADOOP_HOME/logs/userlogs` depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn't require running the MapReduce history server.
0102
0103 To review per-container launch environment, increase `yarn.nodemanager.delete.debug-delay-sec` to a
0104 large value (e.g. `36000`), and then access the application cache through `yarn.nodemanager.local-dirs`
0105 on the nodes on which containers are launched. This directory contains the launch script, JARs, and
0106 all environment variables used for launching each container. This process is useful for debugging
0107 classpath problems in particular. (Note that enabling this requires admin privileges on cluster
0108 settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
0109
0110 To use a custom log4j configuration for the application master or executors, here are the options:
0111
0112 - upload a custom `log4j.properties` using `spark-submit`, by adding it to the `--files` list of files
0113 to be uploaded with the application.
0114 - add `-Dlog4j.configuration=<location of configuration file>` to `spark.driver.extraJavaOptions`
0115 (for the driver) or `spark.executor.extraJavaOptions` (for executors). Note that if using a file,
0116 the `file:` protocol should be explicitly provided, and the file needs to exist locally on all
0117 the nodes.
0118 - update the `$SPARK_CONF_DIR/log4j.properties` file and it will be automatically uploaded along
0119 with the other configurations. Note that other 2 options has higher priority than this option if
0120 multiple options are specified.
0121
0122 Note that for the first option, both executors and the application master will share the same
0123 log4j configuration, which may cause issues when they run on the same node (e.g. trying to write
0124 to the same log file).
0125
0126 If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility.
0127
0128 To use a custom metrics.properties for the application master and executors, update the `$SPARK_CONF_DIR/metrics.properties` file. It will automatically be uploaded with other configurations, so you don't need to specify it manually with `--files`.
0129
0130 #### Spark Properties
0131
0132 <table class="table">
0133 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0134 <tr>
0135 <td><code>spark.yarn.am.memory</code></td>
0136 <td><code>512m</code></td>
0137 <td>
0138 Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
0139 In cluster mode, use <code>spark.driver.memory</code> instead.
0140 <p/>
0141 Use lower-case suffixes, e.g. <code>k</code>, <code>m</code>, <code>g</code>, <code>t</code>, and <code>p</code>, for kibi-, mebi-, gibi-, tebi-, and pebibytes, respectively.
0142 </td>
0143 <td>1.3.0</td>
0144 </tr>
0145 <tr>
0146 <td><code>spark.yarn.am.resource.{resource-type}.amount</code></td>
0147 <td><code>(none)</code></td>
0148 <td>
0149 Amount of resource to use for the YARN Application Master in client mode.
0150 In cluster mode, use <code>spark.yarn.driver.resource.<resource-type>.amount</code> instead.
0151 Please note that this feature can be used only with YARN 3.0+
0152 For reference, see YARN Resource Model documentation: https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
0153 <p/>
0154 Example:
0155 To request GPU resources from YARN, use: <code>spark.yarn.am.resource.yarn.io/gpu.amount</code>
0156 </td>
0157 <td>3.0.0</td>
0158 </tr>
0159 <tr>
0160 <td><code>spark.yarn.driver.resource.{resource-type}.amount</code></td>
0161 <td><code>(none)</code></td>
0162 <td>
0163 Amount of resource to use for the YARN Application Master in cluster mode.
0164 Please note that this feature can be used only with YARN 3.0+
0165 For reference, see YARN Resource Model documentation: https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
0166 <p/>
0167 Example:
0168 To request GPU resources from YARN, use: <code>spark.yarn.driver.resource.yarn.io/gpu.amount</code>
0169 </td>
0170 <td>3.0.0</td>
0171 </tr>
0172 <tr>
0173 <td><code>spark.yarn.executor.resource.{resource-type}.amount</code></td>
0174 <td><code>(none)</code></td>
0175 <td>
0176 Amount of resource to use per executor process.
0177 Please note that this feature can be used only with YARN 3.0+
0178 For reference, see YARN Resource Model documentation: https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
0179 <p/>
0180 Example:
0181 To request GPU resources from YARN, use: <code>spark.yarn.executor.resource.yarn.io/gpu.amount</code>
0182 </td>
0183 <td>3.0.0</td>
0184 </tr>
0185 <tr>
0186 <td><code>spark.yarn.am.cores</code></td>
0187 <td><code>1</code></td>
0188 <td>
0189 Number of cores to use for the YARN Application Master in client mode.
0190 In cluster mode, use <code>spark.driver.cores</code> instead.
0191 </td>
0192 <td>1.3.0</td>
0193 </tr>
0194 <tr>
0195 <td><code>spark.yarn.am.waitTime</code></td>
0196 <td><code>100s</code></td>
0197 <td>
0198 Only used in <code>cluster</code> mode. Time for the YARN Application Master to wait for the
0199 SparkContext to be initialized.
0200 </td>
0201 <td>1.3.0</td>
0202 </tr>
0203 <tr>
0204 <td><code>spark.yarn.submit.file.replication</code></td>
0205 <td>The default HDFS replication (usually <code>3</code>)</td>
0206 <td>
0207 HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
0208 </td>
0209 <td>0.8.1</td>
0210 </tr>
0211 <tr>
0212 <td><code>spark.yarn.stagingDir</code></td>
0213 <td>Current user's home directory in the filesystem</td>
0214 <td>
0215 Staging directory used while submitting applications.
0216 </td>
0217 <td>2.0.0</td>
0218 </tr>
0219 <tr>
0220 <td><code>spark.yarn.preserve.staging.files</code></td>
0221 <td><code>false</code></td>
0222 <td>
0223 Set to <code>true</code> to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
0224 </td>
0225 <td>1.1.0</td>
0226 </tr>
0227 <tr>
0228 <td><code>spark.yarn.scheduler.heartbeat.interval-ms</code></td>
0229 <td><code>3000</code></td>
0230 <td>
0231 The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager.
0232 The value is capped at half the value of YARN's configuration for the expiry interval, i.e.
0233 <code>yarn.am.liveness-monitor.expiry-interval-ms</code>.
0234 </td>
0235 <td>0.8.1</td>
0236 </tr>
0237 <tr>
0238 <td><code>spark.yarn.scheduler.initial-allocation.interval</code></td>
0239 <td><code>200ms</code></td>
0240 <td>
0241 The initial interval in which the Spark application master eagerly heartbeats to the YARN ResourceManager
0242 when there are pending container allocation requests. It should be no larger than
0243 <code>spark.yarn.scheduler.heartbeat.interval-ms</code>. The allocation interval will doubled on
0244 successive eager heartbeats if pending containers still exist, until
0245 <code>spark.yarn.scheduler.heartbeat.interval-ms</code> is reached.
0246 </td>
0247 <td>1.4.0</td>
0248 </tr>
0249 <tr>
0250 <td><code>spark.yarn.max.executor.failures</code></td>
0251 <td>numExecutors * 2, with minimum of 3</td>
0252 <td>
0253 The maximum number of executor failures before failing the application.
0254 </td>
0255 <td>1.0.0</td>
0256 </tr>
0257 <tr>
0258 <td><code>spark.yarn.historyServer.address</code></td>
0259 <td>(none)</td>
0260 <td>
0261 The address of the Spark history server, e.g. <code>host.com:18080</code>. The address should not contain a scheme (<code>http://</code>). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
0262 For this property, YARN properties can be used as variables, and these are substituted by Spark at runtime. For example, if the Spark history server runs on the same node as the YARN ResourceManager, it can be set to <code>${hadoopconf-yarn.resourcemanager.hostname}:18080</code>.
0263 </td>
0264 <td>1.0.0</td>
0265 </tr>
0266 <tr>
0267 <td><code>spark.yarn.dist.archives</code></td>
0268 <td>(none)</td>
0269 <td>
0270 Comma separated list of archives to be extracted into the working directory of each executor.
0271 </td>
0272 <td>1.0.0</td>
0273 </tr>
0274 <tr>
0275 <td><code>spark.yarn.dist.files</code></td>
0276 <td>(none)</td>
0277 <td>
0278 Comma-separated list of files to be placed in the working directory of each executor.
0279 </td>
0280 <td>1.0.0</td>
0281 </tr>
0282 <tr>
0283 <td><code>spark.yarn.dist.jars</code></td>
0284 <td>(none)</td>
0285 <td>
0286 Comma-separated list of jars to be placed in the working directory of each executor.
0287 </td>
0288 <td>2.0.0</td>
0289 </tr>
0290 <tr>
0291 <td><code>spark.yarn.dist.forceDownloadSchemes</code></td>
0292 <td><code>(none)</code></td>
0293 <td>
0294 Comma-separated list of schemes for which resources will be downloaded to the local disk prior to
0295 being added to YARN's distributed cache. For use in cases where the YARN service does not
0296 support schemes that are supported by Spark, like http, https and ftp, or jars required to be in the
0297 local YARN client's classpath. Wildcard '*' is denoted to download resources for all the schemes.
0298 </td>
0299 <td>2.3.0</td>
0300 </tr>
0301 <tr>
0302 <td><code>spark.executor.instances</code></td>
0303 <td><code>2</code></td>
0304 <td>
0305 The number of executors for static allocation. With <code>spark.dynamicAllocation.enabled</code>, the initial set of executors will be at least this large.
0306 </td>
0307 <td>1.0.0</td>
0308 </tr>
0309 <tr>
0310 <td><code>spark.yarn.am.memoryOverhead</code></td>
0311 <td>AM memory * 0.10, with minimum of 384 </td>
0312 <td>
0313 Same as <code>spark.driver.memoryOverhead</code>, but for the YARN Application Master in client mode.
0314 </td>
0315 <td>1.3.0</td>
0316 </tr>
0317 <tr>
0318 <td><code>spark.yarn.queue</code></td>
0319 <td><code>default</code></td>
0320 <td>
0321 The name of the YARN queue to which the application is submitted.
0322 </td>
0323 <td>1.0.0</td>
0324 </tr>
0325 <tr>
0326 <td><code>spark.yarn.jars</code></td>
0327 <td>(none)</td>
0328 <td>
0329 List of libraries containing Spark code to distribute to YARN containers.
0330 By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be
0331 in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't
0332 need to be distributed each time an application runs. To point to jars on HDFS, for example,
0333 set this configuration to <code>hdfs:///some/path</code>. Globs are allowed.
0334 </td>
0335 <td>2.0.0</td>
0336 </tr>
0337 <tr>
0338 <td><code>spark.yarn.archive</code></td>
0339 <td>(none)</td>
0340 <td>
0341 An archive containing needed Spark jars for distribution to the YARN cache. If set, this
0342 configuration replaces <code>spark.yarn.jars</code> and the archive is used in all the
0343 application's containers. The archive should contain jar files in its root directory.
0344 Like with the previous option, the archive can also be hosted on HDFS to speed up file
0345 distribution.
0346 </td>
0347 <td>2.0.0</td>
0348 </tr>
0349 <tr>
0350 <td><code>spark.yarn.appMasterEnv.[EnvironmentVariableName]</code></td>
0351 <td>(none)</td>
0352 <td>
0353 Add the environment variable specified by <code>EnvironmentVariableName</code> to the
0354 Application Master process launched on YARN. The user can specify multiple of
0355 these and to set multiple environment variables. In <code>cluster</code> mode this controls
0356 the environment of the Spark driver and in <code>client</code> mode it only controls
0357 the environment of the executor launcher.
0358 </td>
0359 <td>1.1.0</td>
0360 </tr>
0361 <tr>
0362 <td><code>spark.yarn.containerLauncherMaxThreads</code></td>
0363 <td><code>25</code></td>
0364 <td>
0365 The maximum number of threads to use in the YARN Application Master for launching executor containers.
0366 </td>
0367 <td>1.2.0</td>
0368 </tr>
0369 <tr>
0370 <td><code>spark.yarn.am.extraJavaOptions</code></td>
0371 <td>(none)</td>
0372 <td>
0373 A string of extra JVM options to pass to the YARN Application Master in client mode.
0374 In cluster mode, use <code>spark.driver.extraJavaOptions</code> instead. Note that it is illegal
0375 to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set
0376 with <code>spark.yarn.am.memory</code>
0377 </td>
0378 <td>1.3.0</td>
0379 </tr>
0380 <tr>
0381 <td><code>spark.yarn.am.extraLibraryPath</code></td>
0382 <td>(none)</td>
0383 <td>
0384 Set a special library path to use when launching the YARN Application Master in client mode.
0385 </td>
0386 <td>1.4.0</td>
0387 </tr>
0388 <tr>
0389 <td><code>spark.yarn.populateHadoopClasspath</code></td>
0390 <td>true</td>
0391 <td>
0392 Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
0393 <code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>,
0394 it requires a <code>with-Hadoop</code> Spark distribution that bundles Hadoop runtime or
0395 user has to provide a Hadoop installation separately.
0396 </td>
0397 <td>2.4.6</td>
0398 </tr>
0399 <tr>
0400 <td><code>spark.yarn.maxAppAttempts</code></td>
0401 <td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>
0402 <td>
0403 The maximum number of attempts that will be made to submit the application.
0404 It should be no larger than the global number of max attempts in the YARN configuration.
0405 </td>
0406 <td>1.3.0</td>
0407 </tr>
0408 <tr>
0409 <td><code>spark.yarn.am.attemptFailuresValidityInterval</code></td>
0410 <td>(none)</td>
0411 <td>
0412 Defines the validity interval for AM failure tracking.
0413 If the AM has been running for at least the defined interval, the AM failure count will be reset.
0414 This feature is not enabled if not configured.
0415 </td>
0416 <td>1.6.0</td>
0417 </tr>
0418 <tr>
0419 <td><code>spark.yarn.executor.failuresValidityInterval</code></td>
0420 <td>(none)</td>
0421 <td>
0422 Defines the validity interval for executor failure tracking.
0423 Executor failures which are older than the validity interval will be ignored.
0424 </td>
0425 <td>2.0.0</td>
0426 </tr>
0427 <tr>
0428 <td><code>spark.yarn.submit.waitAppCompletion</code></td>
0429 <td><code>true</code></td>
0430 <td>
0431 In YARN cluster mode, controls whether the client waits to exit until the application completes.
0432 If set to <code>true</code>, the client process will stay alive reporting the application's status.
0433 Otherwise, the client process will exit after submission.
0434 </td>
0435 <td>1.4.0</td>
0436 </tr>
0437 <tr>
0438 <td><code>spark.yarn.am.nodeLabelExpression</code></td>
0439 <td>(none)</td>
0440 <td>
0441 A YARN node label expression that restricts the set of nodes AM will be scheduled on.
0442 Only versions of YARN greater than or equal to 2.6 support node label expressions, so when
0443 running against earlier versions, this property will be ignored.
0444 </td>
0445 <td>1.6.0</td>
0446 </tr>
0447 <tr>
0448 <td><code>spark.yarn.executor.nodeLabelExpression</code></td>
0449 <td>(none)</td>
0450 <td>
0451 A YARN node label expression that restricts the set of nodes executors will be scheduled on.
0452 Only versions of YARN greater than or equal to 2.6 support node label expressions, so when
0453 running against earlier versions, this property will be ignored.
0454 </td>
0455 <td>1.4.0</td>
0456 </tr>
0457 <tr>
0458 <td><code>spark.yarn.tags</code></td>
0459 <td>(none)</td>
0460 <td>
0461 Comma-separated list of strings to pass through as YARN application tags appearing
0462 in YARN ApplicationReports, which can be used for filtering when querying YARN apps.
0463 </td>
0464 <td>1.5.0</td>
0465 </tr>
0466 <tr>
0467 <td><code>spark.yarn.priority</code></td>
0468 <td>(none)</td>
0469 <td>
0470 Application priority for YARN to define pending applications ordering policy, those with higher
0471 integer value have a better opportunity to be activated. Currently, YARN only supports application
0472 priority when using FIFO ordering policy.
0473 </td>
0474 <td>3.0.0</td>
0475 </tr>
0476 <tr>
0477 <td><code>spark.yarn.config.gatewayPath</code></td>
0478 <td>(none)</td>
0479 <td>
0480 A path that is valid on the gateway host (the host where a Spark application is started) but may
0481 differ for paths for the same resource in other nodes in the cluster. Coupled with
0482 <code>spark.yarn.config.replacementPath</code>, this is used to support clusters with
0483 heterogeneous configurations, so that Spark can correctly launch remote processes.
0484 <p/>
0485 The replacement path normally will contain a reference to some environment variable exported by
0486 YARN (and, thus, visible to Spark containers).
0487 <p/>
0488 For example, if the gateway node has Hadoop libraries installed on <code>/disk1/hadoop</code>, and
0489 the location of the Hadoop install is exported by YARN as the <code>HADOOP_HOME</code>
0490 environment variable, setting this value to <code>/disk1/hadoop</code> and the replacement path to
0491 <code>$HADOOP_HOME</code> will make sure that paths used to launch remote processes properly
0492 reference the local YARN configuration.
0493 </td>
0494 <td>1.5.0</td>
0495 </tr>
0496 <tr>
0497 <td><code>spark.yarn.config.replacementPath</code></td>
0498 <td>(none)</td>
0499 <td>
0500 See <code>spark.yarn.config.gatewayPath</code>.
0501 </td>
0502 <td>1.5.0</td>
0503 </tr>
0504 <tr>
0505 <td><code>spark.yarn.rolledLog.includePattern</code></td>
0506 <td>(none)</td>
0507 <td>
0508 Java Regex to filter the log files which match the defined include pattern
0509 and those log files will be aggregated in a rolling fashion.
0510 This will be used with YARN's rolling log aggregation, to enable this feature in YARN side
0511 <code>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</code> should be
0512 configured in yarn-site.xml. The Spark log4j appender needs be changed to use
0513 FileAppender or another appender that can handle the files being removed while it is running. Based
0514 on the file name configured in the log4j configuration (like spark.log), the user should set the
0515 regex (spark*) to include all the log files that need to be aggregated.
0516 </td>
0517 <td>2.0.0</td>
0518 </tr>
0519 <tr>
0520 <td><code>spark.yarn.rolledLog.excludePattern</code></td>
0521 <td>(none)</td>
0522 <td>
0523 Java Regex to filter the log files which match the defined exclude pattern
0524 and those log files will not be aggregated in a rolling fashion. If the log file
0525 name matches both the include and the exclude pattern, this file will be excluded eventually.
0526 </td>
0527 <td>2.0.0</td>
0528 </tr>
0529 <tr>
0530 <td><code>spark.yarn.blacklist.executor.launch.blacklisting.enabled</code></td>
0531 <td>false</td>
0532 <td>
0533 Flag to enable blacklisting of nodes having YARN resource allocation problems.
0534 The error limit for blacklisting can be configured by
0535 <code>spark.blacklist.application.maxFailedExecutorsPerNode</code>.
0536 </td>
0537 <td>2.4.0</td>
0538 </tr>
0539 <tr>
0540 <td><code>spark.yarn.exclude.nodes</code></td>
0541 <td>(none)</td>
0542 <td>
0543 Comma-separated list of YARN node names which are excluded from resource allocation.
0544 </td>
0545 <td>3.0.0</td>
0546 </tr>
0547 <tr>
0548 <td><code>spark.yarn.metrics.namespace</code></td>
0549 <td>(none)</td>
0550 <td>
0551 The root namespace for AM metrics reporting.
0552 If it is not set then the YARN application ID is used.
0553 </td>
0554 <td>2.4.0</td>
0555 </tr>
0556 </table>
0557
0558 #### Available patterns for SHS custom executor log URL
0559
0560 <table class="table">
0561 <tr><th>Pattern</th><th>Meaning</th></tr>
0562 <tr>
0563 <td>{{HTTP_SCHEME}}</td>
0564 <td>`http://` or `https://` according to YARN HTTP policy. (Configured via `yarn.http.policy`)</td>
0565 </tr>
0566 <tr>
0567 <td>{{NM_HOST}}</td>
0568 <td>The "host" of node where container was run.</td>
0569 </tr>
0570 <tr>
0571 <td>{{NM_PORT}}</td>
0572 <td>The "port" of node manager where container was run.</td>
0573 </tr>
0574 <tr>
0575 <td>{{NM_HTTP_PORT}}</td>
0576 <td>The "port" of node manager's http server where container was run.</td>
0577 </tr>
0578 <tr>
0579 <td>{{NM_HTTP_ADDRESS}}</td>
0580 <td>Http URI of the node on which the container is allocated.</td>
0581 </tr>
0582 <tr>
0583 <td>{{CLUSTER_ID}}</td>
0584 <td>The cluster ID of Resource Manager. (Configured via `yarn.resourcemanager.cluster-id`)</td>
0585 </tr>
0586 <tr>
0587 <td>{{CONTAINER_ID}}</td>
0588 <td>The ID of container.</td>
0589 </tr>
0590 <tr>
0591 <td>{{USER}}</td>
0592 <td>'SPARK_USER' on system environment.</td>
0593 </tr>
0594 <tr>
0595 <td>{{FILE_NAME}}</td>
0596 <td>`stdout`, `stderr`.</td>
0597 </tr>
0598 </table>
0599
0600 For example, suppose you would like to point log url link to Job History Server directly instead of let NodeManager http server redirects it, you can configure `spark.history.custom.executor.log.url` as below:
0601
0602 `{{HTTP_SCHEME}}<JHS_HOST>:<JHS_PORT>/jobhistory/logs/{{NM_HOST}}:{{NM_PORT}}/{{CONTAINER_ID}}/{{CONTAINER_ID}}/{{USER}}/{{FILE_NAME}}?start=-4096`
0603
0604 NOTE: you need to replace `<JHS_POST>` and `<JHS_PORT>` with actual value.
0605
0606 # Resource Allocation and Configuration Overview
0607
0608 Please make sure to have read the Custom Resource Scheduling and Configuration Overview section on the [configuration page](configuration.html). This section only talks about the YARN specific aspects of resource scheduling.
0609
0610 YARN needs to be configured to support any resources the user wants to use with Spark. Resource scheduling on YARN was added in YARN 3.1.0. See the YARN documentation for more information on configuring resources and properly setting up isolation. Ideally the resources are setup isolated so that an executor can only see the resources it was allocated. If you do not have isolation enabled, the user is responsible for creating a discovery script that ensures the resource is not shared between executors.
0611
0612 YARN currently supports any user defined resource type but has built in types for GPU (<code>yarn.io/gpu</code>) and FPGA (<code>yarn.io/fpga</code>). For that reason, if you are using either of those resources, Spark can translate your request for spark resources into YARN resources and you only have to specify the <code>spark.{driver/executor}.resource.</code> configs. If you are using a resource other then FPGA or GPU, the user is responsible for specifying the configs for both YARN (<code>spark.yarn.{driver/executor}.resource.</code>) and Spark (<code>spark.{driver/executor}.resource.</code>).
0613
0614 For example, the user wants to request 2 GPUs for each executor. The user can just specify <code>spark.executor.resource.gpu.amount=2</code> and Spark will handle requesting <code>yarn.io/gpu</code> resource type from YARN.
0615
0616 If the user has a user defined YARN resource, lets call it `acceleratorX` then the user must specify <code>spark.yarn.executor.resource.acceleratorX.amount=2</code> and <code>spark.executor.resource.acceleratorX.amount=2</code>.
0617
0618 YARN does not tell Spark the addresses of the resources allocated to each container. For that reason, the user must specify a discovery script that gets run by the executor on startup to discover what resources are available to that executor. You can find an example scripts in `examples/src/main/scripts/getGpusResources.sh`. The script must have execute permissions set and the user should setup permissions to not allow malicious users to modify it. The script should write to STDOUT a JSON string in the format of the ResourceInformation class. This has the resource name and an array of resource addresses available to just that executor.
0619
0620 # Important notes
0621
0622 - Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
0623 - In `cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `client` mode, only the Spark executors do.
0624 - The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example, you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named `localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN.
0625 - The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
0626
0627 # Kerberos
0628
0629 Standard Kerberos support in Spark is covered in the [Security](security.html#kerberos) page.
0630
0631 In YARN mode, when accessing Hadoop file systems, aside from the default file system in the hadoop
0632 configuration, Spark will also automatically obtain delegation tokens for the service hosting the
0633 staging directory of the Spark application.
0634
0635 ## YARN-specific Kerberos Configuration
0636
0637 <table class="table">
0638 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0639 <tr>
0640 <td><code>spark.kerberos.keytab</code></td>
0641 <td>(none)</td>
0642 <td>
0643 The full path to the file that contains the keytab for the principal specified above. This keytab
0644 will be copied to the node running the YARN Application Master via the YARN Distributed Cache, and
0645 will be used for renewing the login tickets and the delegation tokens periodically. Equivalent to
0646 the <code>--keytab</code> command line argument.
0647
0648 <br /> (Works also with the "local" master.)
0649 </td>
0650 <td>3.0.0</td>
0651 </tr>
0652 <tr>
0653 <td><code>spark.kerberos.principal</code></td>
0654 <td>(none)</td>
0655 <td>
0656 Principal to be used to login to KDC, while running on secure clusters. Equivalent to the
0657 <code>--principal</code> command line argument.
0658
0659 <br /> (Works also with the "local" master.)
0660 </td>
0661 <td>3.0.0</td>
0662 </tr>
0663 <tr>
0664 <td><code>spark.yarn.kerberos.relogin.period</code></td>
0665 <td>1m</td>
0666 <td>
0667 How often to check whether the kerberos TGT should be renewed. This should be set to a value
0668 that is shorter than the TGT renewal period (or the TGT lifetime if TGT renewal is not enabled).
0669 The default value should be enough for most deployments.
0670 </td>
0671 <td>2.3.0</td>
0672 </tr>
0673 </table>
0674
0675 ## Troubleshooting Kerberos
0676
0677 Debugging Hadoop/Kerberos problems can be "difficult". One useful technique is to
0678 enable extra logging of Kerberos operations in Hadoop by setting the `HADOOP_JAAS_DEBUG`
0679 environment variable.
0680
0681 ```bash
0682 export HADOOP_JAAS_DEBUG=true
0683 ```
0684
0685 The JDK classes can be configured to enable extra logging of their Kerberos and
0686 SPNEGO/REST authentication via the system properties `sun.security.krb5.debug`
0687 and `sun.security.spnego.debug=true`
0688
0689 ```
0690 -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true
0691 ```
0692
0693 All these options can be enabled in the Application Master:
0694
0695 ```
0696 spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true
0697 spark.yarn.am.extraJavaOptions -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true
0698 ```
0699
0700 Finally, if the log level for `org.apache.spark.deploy.yarn.Client` is set to `DEBUG`, the log
0701 will include a list of all tokens obtained, and their expiry details
0702
0703
0704 # Configuring the External Shuffle Service
0705
0706 To start the Spark Shuffle Service on each `NodeManager` in your YARN cluster, follow these
0707 instructions:
0708
0709 1. Build Spark with the [YARN profile](building-spark.html). Skip this step if you are using a
0710 pre-packaged distribution.
0711 1. Locate the `spark-<version>-yarn-shuffle.jar`. This should be under
0712 `$SPARK_HOME/common/network-yarn/target/scala-<version>` if you are building Spark yourself, and under
0713 `yarn` if you are using a distribution.
0714 1. Add this jar to the classpath of all `NodeManager`s in your cluster.
0715 1. In the `yarn-site.xml` on each node, add `spark_shuffle` to `yarn.nodemanager.aux-services`,
0716 then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
0717 `org.apache.spark.network.yarn.YarnShuffleService`.
0718 1. Increase `NodeManager's` heap size by setting `YARN_HEAPSIZE` (1000 by default) in `etc/hadoop/yarn-env.sh`
0719 to avoid garbage collection issues during shuffle.
0720 1. Restart all `NodeManager`s in your cluster.
0721
0722 The following extra configuration options are available when the shuffle service is running on YARN:
0723
0724 <table class="table">
0725 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
0726 <tr>
0727 <td><code>spark.yarn.shuffle.stopOnFailure</code></td>
0728 <td><code>false</code></td>
0729 <td>
0730 Whether to stop the NodeManager when there's a failure in the Spark Shuffle Service's
0731 initialization. This prevents application failures caused by running containers on
0732 NodeManagers where the Spark Shuffle Service is not running.
0733 </td>
0734 </tr>
0735 </table>
0736
0737 # Launching your application with Apache Oozie
0738
0739 Apache Oozie can launch Spark applications as part of a workflow.
0740 In a secure cluster, the launched application will need the relevant tokens to access the cluster's
0741 services. If Spark is launched with a keytab, this is automatic.
0742 However, if Spark is to be launched without a keytab, the responsibility for setting up security
0743 must be handed over to Oozie.
0744
0745 The details of configuring Oozie for secure clusters and obtaining
0746 credentials for a job can be found on the [Oozie web site](http://oozie.apache.org/)
0747 in the "Authentication" section of the specific release's documentation.
0748
0749 For Spark applications, the Oozie workflow must be set up for Oozie to request all tokens which
0750 the application needs, including:
0751
0752 - The YARN resource manager.
0753 - The local Hadoop filesystem.
0754 - Any remote Hadoop filesystems used as a source or destination of I/O.
0755 - Hive —if used.
0756 - HBase —if used.
0757 - The YARN timeline server, if the application interacts with this.
0758
0759 To avoid Spark attempting —and then failing— to obtain Hive, HBase and remote HDFS tokens,
0760 the Spark configuration must be set to disable token collection for the services.
0761
0762 The Spark configuration must include the lines:
0763
0764 ```
0765 spark.security.credentials.hive.enabled false
0766 spark.security.credentials.hbase.enabled false
0767 ```
0768
0769 The configuration option `spark.kerberos.access.hadoopFileSystems` must be unset.
0770
0771 # Using the Spark History Server to replace the Spark Web UI
0772
0773 It is possible to use the Spark History Server application page as the tracking URL for running
0774 applications when the application UI is disabled. This may be desirable on secure clusters, or to
0775 reduce the memory usage of the Spark driver. To set up tracking through the Spark History Server,
0776 do the following:
0777
0778 - On the application side, set <code>spark.yarn.historyServer.allowTracking=true</code> in Spark's
0779 configuration. This will tell Spark to use the history server's URL as the tracking URL if
0780 the application's UI is disabled.
0781 - On the Spark History Server, add <code>org.apache.spark.deploy.yarn.YarnProxyRedirectFilter</code>
0782 to the list of filters in the <code>spark.ui.filters</code> configuration.
0783
0784 Be aware that the history server information may not be up-to-date with the application's state.