the-tree/docs/security.md

0001 ---
0002 layout: global
0003 displayTitle: Spark Security
0004 title: Security
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 * This will become a table of contents (this text will be scraped).
0022 {:toc}
0023
0024 # Spark Security: Things You Need To Know
0025
0026 Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
0027 Spark supports multiple deployments types and each one supports different levels of security. Not
0028 all deployment types will be secure in all environments and none are secure by default. Be
0029 sure to evaluate your environment, what Spark supports, and take the appropriate measure to secure
0030 your Spark deployment.
0031
0032 There are many different types of security concerns. Spark does not necessarily protect against
0033 all things. Listed below are some of the things Spark supports. Also check the deployment
0034 documentation for the type of deployment you are using for deployment specific settings. Anything
0035 not documented, Spark does not support.
0036
0037 # Spark RPC (Communication protocol between Spark processes)
0038
0039 ## Authentication
0040
0041 Spark currently supports authentication for RPC channels using a shared secret. Authentication can
0042 be turned on by setting the `spark.authenticate` configuration parameter.
0043
0044 The exact mechanism used to generate and distribute the shared secret is deployment-specific. Unless
0045 specified below, the secret must be defined by setting the `spark.authenticate.secret` config
0046 option. The same secret is shared by all Spark applications and daemons in that case, which limits
0047 the security of these deployments, especially on multi-tenant clusters.
0048
0049 The REST Submission Server and the MesosClusterDispatcher do not support authentication.  You should
0050 ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077
0051 respectively by default) are restricted to hosts that are trusted to submit jobs.
0052
0053 ### YARN
0054
0055 For Spark on [YARN](running-on-yarn.html), Spark will automatically handle generating and
0056 distributing the shared secret. Each application will use a unique shared secret. In
0057 the case of YARN, this feature relies on YARN RPC encryption being enabled for the distribution of
0058 secrets to be secure.
0059
0060 ### Kubernetes
0061
0062 On Kubernetes, Spark will also automatically generate an authentication secret unique to each
0063 application. The secret is propagated to executor pods using environment variables. This means
0064 that any user that can list pods in the namespace where the Spark application is running can
0065 also see their authentication secret. Access control rules should be properly set up by the
0066 Kubernetes admin to ensure that Spark authentication is secure.
0067
0068 <table class="table">
0069 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0070 <tr>
0071   <td><code>spark.authenticate</code></td>
0072   <td>false</td>
0073   <td>Whether Spark authenticates its internal connections.</td>
0074   <td>1.0.0</td>
0075 </tr>
0076 <tr>
0077   <td><code>spark.authenticate.secret</code></td>
0078   <td>None</td>
0079   <td>
0080     The secret key used authentication. See above for when this configuration should be set.
0081   </td>
0082   <td>1.0.0</td>
0083 </tr>
0084 </table>
0085
0086 Alternatively, one can mount authentication secrets using files and Kubernetes secrets that
0087 the user mounts into their pods.
0088
0089 <table class="table">
0090 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0091 <tr>
0092   <td><code>spark.authenticate.secret.file</code></td>
0093   <td>None</td>
0094   <td>
0095     Path pointing to the secret key to use for securing connections. Ensure that the
0096     contents of the file have been securely generated. This file is loaded on both the driver
0097     and the executors unless other settings override this (see below).
0098   </td>
0099   <td>3.0.0</td>
0100 </tr>
0101 <tr>
0102   <td><code>spark.authenticate.secret.driver.file</code></td>
0103   <td>The value of <code>spark.authenticate.secret.file</code></td>
0104   <td>
0105     When specified, overrides the location that the Spark driver reads to load the secret.
0106     Useful when in client mode, when the location of the secret file may differ in the pod versus
0107     the node the driver is running in. When this is specified,
0108     <code>spark.authenticate.secret.executor.file</code> must be specified so that the driver
0109     and the executors can both use files to load the secret key. Ensure that the contents of the file
0110     on the driver is identical to the contents of the file on the executors.
0111   </td>
0112   <td>3.0.0</td>
0113 </tr>
0114 <tr>
0115   <td><code>spark.authenticate.secret.executor.file</code></td>
0116   <td>The value of <code>spark.authenticate.secret.file</code></td>
0117   <td>
0118     When specified, overrides the location that the Spark executors read to load the secret.
0119     Useful in client mode, when the location of the secret file may differ in the pod versus
0120     the node the driver is running in. When this is specified,
0121     <code>spark.authenticate.secret.driver.file</code> must be specified so that the driver
0122     and the executors can both use files to load the secret key. Ensure that the contents of the file
0123     on the driver is identical to the contents of the file on the executors.
0124   </td>
0125   <td>3.0.0</td>
0126 </tr>
0127 </table>
0128
0129 Note that when using files, Spark will not mount these files into the containers for you. It is up
0130 you to ensure that the secret files are deployed securely into your containers and that the driver's
0131 secret file agrees with the executors' secret file.
0132
0133 ## Encryption
0134
0135 Spark supports AES-based encryption for RPC connections. For encryption to be enabled, RPC
0136 authentication must also be enabled and properly configured. AES encryption uses the
0137 [Apache Commons Crypto](https://commons.apache.org/proper/commons-crypto/) library, and Spark's
0138 configuration system allows access to that library's configuration for advanced users.
0139
0140 There is also support for SASL-based encryption, although it should be considered deprecated. It
0141 is still required when talking to shuffle services from Spark versions older than 2.2.0.
0142
0143 The following table describes the different options available for configuring this feature.
0144
0145 <table class="table">
0146 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0147 <tr>
0148   <td><code>spark.network.crypto.enabled</code></td>
0149   <td>false</td>
0150   <td>
0151     Enable AES-based RPC encryption, including the new authentication protocol added in 2.2.0.
0152   </td>
0153   <td>2.2.0</td>
0154 </tr>
0155 <tr>
0156   <td><code>spark.network.crypto.keyLength</code></td>
0157   <td>128</td>
0158   <td>
0159     The length in bits of the encryption key to generate. Valid values are 128, 192 and 256.
0160   </td>
0161   <td>2.2.0</td>
0162 </tr>
0163 <tr>
0164   <td><code>spark.network.crypto.keyFactoryAlgorithm</code></td>
0165   <td>PBKDF2WithHmacSHA1</td>
0166   <td>
0167     The key factory algorithm to use when generating encryption keys. Should be one of the
0168     algorithms supported by the javax.crypto.SecretKeyFactory class in the JRE being used.
0169   </td>
0170   <td>2.2.0</td>
0171 </tr>
0172 <tr>
0173   <td><code>spark.network.crypto.config.*</code></td>
0174   <td>None</td>
0175   <td>
0176     Configuration values for the commons-crypto library, such as which cipher implementations to
0177     use. The config name should be the name of commons-crypto configuration without the
0178     <code>commons.crypto</code> prefix.
0179   </td>
0180   <td>2.2.0</td>
0181 </tr>
0182 <tr>
0183   <td><code>spark.network.crypto.saslFallback</code></td>
0184   <td>true</td>
0185   <td>
0186     Whether to fall back to SASL authentication if authentication fails using Spark's internal
0187     mechanism. This is useful when the application is connecting to old shuffle services that
0188     do not support the internal Spark authentication protocol. On the shuffle service side,
0189     disabling this feature will block older clients from authenticating.
0190   </td>
0191   <td>2.2.0</td>
0192 </tr>
0193 <tr>
0194   <td><code>spark.authenticate.enableSaslEncryption</code></td>
0195   <td>false</td>
0196   <td>
0197     Enable SASL-based encrypted communication.
0198   </td>
0199   <td>2.2.0</td>
0200 </tr>
0201 <tr>
0202   <td><code>spark.network.sasl.serverAlwaysEncrypt</code></td>
0203   <td>false</td>
0204   <td>
0205     Disable unencrypted connections for ports using SASL authentication. This will deny connections
0206     from clients that have authentication enabled, but do not request SASL-based encryption.
0207   </td>
0208   <td>1.4.0</td>
0209 </tr>
0210 </table>
0211
0212
0213 # Local Storage Encryption
0214
0215 Spark supports encrypting temporary data written to local disks. This covers shuffle files, shuffle
0216 spills and data blocks stored on disk (for both caching and broadcast variables). It does not cover
0217 encrypting output data generated by applications with APIs such as `saveAsHadoopFile` or
0218 `saveAsTable`. It also may not cover temporary files created explicitly by the user.
0219
0220 The following settings cover enabling encryption for data written to disk:
0221
0222 <table class="table">
0223 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0224 <tr>
0225   <td><code>spark.io.encryption.enabled</code></td>
0226   <td>false</td>
0227   <td>
0228     Enable local disk I/O encryption. Currently supported by all modes except Mesos. It's strongly
0229     recommended that RPC encryption be enabled when using this feature.
0230   </td>
0231   <td>2.1.0</td>
0232 </tr>
0233 <tr>
0234   <td><code>spark.io.encryption.keySizeBits</code></td>
0235   <td>128</td>
0236   <td>
0237     IO encryption key size in bits. Supported values are 128, 192 and 256.
0238   </td>
0239   <td>2.1.0</td>
0240 </tr>
0241 <tr>
0242   <td><code>spark.io.encryption.keygen.algorithm</code></td>
0243   <td>HmacSHA1</td>
0244   <td>
0245     The algorithm to use when generating the IO encryption key. The supported algorithms are
0246     described in the KeyGenerator section of the Java Cryptography Architecture Standard Algorithm
0247     Name Documentation.
0248   </td>
0249   <td>2.1.0</td>
0250 </tr>
0251 <tr>
0252   <td><code>spark.io.encryption.commons.config.*</code></td>
0253   <td>None</td>
0254   <td>
0255     Configuration values for the commons-crypto library, such as which cipher implementations to
0256     use. The config name should be the name of commons-crypto configuration without the
0257     <code>commons.crypto</code> prefix.
0258   </td>
0259   <td>2.1.0</td>
0260 </tr>
0261 </table>
0262
0263
0264 # Web UI
0265
0266 ## Authentication and Authorization
0267
0268 Enabling authentication for the Web UIs is done using [javax servlet filters](https://docs.oracle.com/javaee/6/api/javax/servlet/Filter.html).
0269 You will need a filter that implements the authentication method you want to deploy. Spark does not
0270 provide any built-in authentication filters.
0271
0272 Spark also supports access control to the UI when an authentication filter is present. Each
0273 application can be configured with its own separate access control lists (ACLs). Spark
0274 differentiates between "view" permissions (who is allowed to see the application's UI), and "modify"
0275 permissions (who can do things like kill jobs in a running application).
0276
0277 ACLs can be configured for either users or groups. Configuration entries accept comma-separated
0278 lists as input, meaning multiple users or groups can be given the desired privileges. This can be
0279 used if you run on a shared cluster and have a set of administrators or developers who need to
0280 monitor applications they may not have started themselves. A wildcard (`*`) added to specific ACL
0281 means that all users will have the respective privilege. By default, only the user submitting the
0282 application is added to the ACLs.
0283
0284 Group membership is established by using a configurable group mapping provider. The mapper is
0285 configured using the <code>spark.user.groups.mapping</code> config option, described in the table
0286 below.
0287
0288 The following options control the authentication of Web UIs:
0289
0290 <table class="table">
0291 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0292 <tr>
0293   <td><code>spark.ui.filters</code></td>
0294   <td>None</td>
0295   <td>
0296     See the <a href="configuration.html#spark-ui">Spark UI</a> configuration for how to configure
0297     filters.
0298   </td>
0299   <td>1.0.0</td>
0300 </tr>
0301 <tr>
0302   <td><code>spark.acls.enable</code></td>
0303   <td>false</td>
0304   <td>
0305     Whether UI ACLs should be enabled. If enabled, this checks to see if the user has access
0306     permissions to view or modify the application. Note this requires the user to be authenticated,
0307     so if no authentication filter is installed, this option does not do anything.
0308   </td>
0309   <td>1.1.0</td>
0310 </tr>
0311 <tr>
0312   <td><code>spark.admin.acls</code></td>
0313   <td>None</td>
0314   <td>
0315     Comma-separated list of users that have view and modify access to the Spark application.
0316   </td>
0317   <td>1.1.0</td>
0318 </tr>
0319 <tr>
0320   <td><code>spark.admin.acls.groups</code></td>
0321   <td>None</td>
0322   <td>
0323     Comma-separated list of groups that have view and modify access to the Spark application.
0324   </td>
0325   <td>2.0.0</td>
0326 </tr>
0327 <tr>
0328   <td><code>spark.modify.acls</code></td>
0329   <td>None</td>
0330   <td>
0331     Comma-separated list of users that have modify access to the Spark application.
0332   </td>
0333   <td>1.1.0</td>
0334 </tr>
0335 <tr>
0336   <td><code>spark.modify.acls.groups</code></td>
0337   <td>None</td>
0338   <td>
0339     Comma-separated list of groups that have modify access to the Spark application.
0340   </td>
0341   <td>2.0.0</td>
0342 </tr>
0343 <tr>
0344   <td><code>spark.ui.view.acls</code></td>
0345   <td>None</td>
0346   <td>
0347     Comma-separated list of users that have view access to the Spark application.
0348   </td>
0349   <td>1.0.0</td>
0350 </tr>
0351 <tr>
0352   <td><code>spark.ui.view.acls.groups</code></td>
0353   <td>None</td>
0354   <td>
0355     Comma-separated list of groups that have view access to the Spark application.
0356   </td>
0357   <td>2.0.0</td>
0358 </tr>
0359 <tr>
0360   <td><code>spark.user.groups.mapping</code></td>
0361   <td><code>org.apache.spark.security.ShellBasedGroupsMappingProvider</code></td>
0362   <td>
0363     The list of groups for a user is determined by a group mapping service defined by the trait
0364     <code>org.apache.spark.security.GroupMappingServiceProvider</code>, which can be configured by
0365     this property.
0366
0367     <br />By default, a Unix shell-based implementation is used, which collects this information
0368     from the host OS.
0369
0370     <br /><em>Note:</em> This implementation supports only Unix/Linux-based environments.
0371     Windows environment is currently <b>not</b> supported. However, a new platform/protocol can
0372     be supported by implementing the trait mentioned above.
0373   </td>
0374   <td>2.0.0</td>
0375 </tr>
0376 </table>
0377
0378 On YARN, the view and modify ACLs are provided to the YARN service when submitting applications, and
0379 control who has the respective privileges via YARN interfaces.
0380
0381 ## Spark History Server ACLs
0382
0383 Authentication for the SHS Web UI is enabled the same way as for regular applications, using
0384 servlet filters.
0385
0386 To enable authorization in the SHS, a few extra options are used:
0387
0388 <table class="table">
0389 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0390 <tr>
0391   <td><code>spark.history.ui.acls.enable</code></td>
0392   <td>false</td>
0393   <td>
0394     Specifies whether ACLs should be checked to authorize users viewing the applications in
0395     the history server. If enabled, access control checks are performed regardless of what the
0396     individual applications had set for <code>spark.ui.acls.enable</code>. The application owner
0397     will always have authorization to view their own application and any users specified via
0398     <code>spark.ui.view.acls</code> and groups specified via <code>spark.ui.view.acls.groups</code>
0399     when the application was run will also have authorization to view that application.
0400     If disabled, no access control checks are made for any application UIs available through
0401     the history server.
0402   </td>
0403   <td>1.0.1</td>
0404 </tr>
0405 <tr>
0406   <td><code>spark.history.ui.admin.acls</code></td>
0407   <td>None</td>
0408   <td>
0409     Comma separated list of users that have view access to all the Spark applications in history
0410     server.
0411   </td>
0412   <td>2.1.1</td>
0413 </tr>
0414 <tr>
0415   <td><code>spark.history.ui.admin.acls.groups</code></td>
0416   <td>None</td>
0417   <td>
0418     Comma separated list of groups that have view access to all the Spark applications in history
0419     server.
0420   </td>
0421   <td>2.1.1</td>
0422 </tr>
0423 </table>
0424
0425 The SHS uses the same options to configure the group mapping provider as regular applications.
0426 In this case, the group mapping provider will apply to all UIs server by the SHS, and individual
0427 application configurations will be ignored.
0428
0429 ## SSL Configuration
0430
0431 Configuration for SSL is organized hierarchically. The user can configure the default SSL settings
0432 which will be used for all the supported communication protocols unless they are overwritten by
0433 protocol-specific settings. This way the user can easily provide the common settings for all the
0434 protocols without disabling the ability to configure each one individually. The following table
0435 describes the SSL configuration namespaces:
0436
0437 <table class="table">
0438   <tr>
0439     <th>Config Namespace</th>
0440     <th>Component</th>
0441   </tr>
0442   <tr>
0443     <td><code>spark.ssl</code></td>
0444     <td>
0445       The default SSL configuration. These values will apply to all namespaces below, unless
0446       explicitly overridden at the namespace level.
0447     </td>
0448   </tr>
0449   <tr>
0450     <td><code>spark.ssl.ui</code></td>
0451     <td>Spark application Web UI</td>
0452   </tr>
0453   <tr>
0454     <td><code>spark.ssl.standalone</code></td>
0455     <td>Standalone Master / Worker Web UI</td>
0456   </tr>
0457   <tr>
0458     <td><code>spark.ssl.historyServer</code></td>
0459     <td>History Server Web UI</td>
0460   </tr>
0461 </table>
0462
0463 The full breakdown of available SSL options can be found below. The `${ns}` placeholder should be
0464 replaced with one of the above namespaces.
0465
0466 <table class="table">
0467 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
0468   <tr>
0469     <td><code>${ns}.enabled</code></td>
0470     <td>false</td>
0471     <td>Enables SSL. When enabled, <code>${ns}.ssl.protocol</code> is required.</td>
0472   </tr>
0473   <tr>
0474     <td><code>${ns}.port</code></td>
0475     <td>None</td>
0476     <td>
0477       The port where the SSL service will listen on.
0478
0479       <br />The port must be defined within a specific namespace configuration. The default
0480       namespace is ignored when reading this configuration.
0481
0482       <br />When not set, the SSL port will be derived from the non-SSL port for the
0483       same service. A value of "0" will make the service bind to an ephemeral port.
0484     </td>
0485   </tr>
0486   <tr>
0487     <td><code>${ns}.enabledAlgorithms</code></td>
0488     <td>None</td>
0489     <td>
0490       A comma-separated list of ciphers. The specified ciphers must be supported by JVM.
0491
0492       <br />The reference list of protocols can be found in the "JSSE Cipher Suite Names" section
0493       of the Java security guide. The list for Java 8 can be found at
0494       <a href="https://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html#ciphersuites">this</a>
0495       page.
0496
0497       <br />Note: If not set, the default cipher suite for the JRE will be used.
0498     </td>
0499   </tr>
0500   <tr>
0501     <td><code>${ns}.keyPassword</code></td>
0502     <td>None</td>
0503     <td>
0504       The password to the private key in the key store.
0505     </td>
0506   </tr>
0507   <tr>
0508     <td><code>${ns}.keyStore</code></td>
0509     <td>None</td>
0510     <td>
0511       Path to the key store file. The path can be absolute or relative to the directory in which the
0512       process is started.
0513     </td>
0514   </tr>
0515   <tr>
0516     <td><code>${ns}.keyStorePassword</code></td>
0517     <td>None</td>
0518     <td>Password to the key store.</td>
0519   </tr>
0520   <tr>
0521     <td><code>${ns}.keyStoreType</code></td>
0522     <td>JKS</td>
0523     <td>The type of the key store.</td>
0524   </tr>
0525   <tr>
0526     <td><code>${ns}.protocol</code></td>
0527     <td>None</td>
0528     <td>
0529       TLS protocol to use. The protocol must be supported by JVM.
0530
0531       <br />The reference list of protocols can be found in the "Additional JSSE Standard Names"
0532       section of the Java security guide. For Java 8, the list can be found at
0533       <a href="https://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html#jssenames">this</a>
0534       page.
0535     </td>
0536   </tr>
0537   <tr>
0538     <td><code>${ns}.needClientAuth</code></td>
0539     <td>false</td>
0540     <td>Whether to require client authentication.</td>
0541   </tr>
0542   <tr>
0543     <td><code>${ns}.trustStore</code></td>
0544     <td>None</td>
0545     <td>
0546       Path to the trust store file. The path can be absolute or relative to the directory in which
0547       the process is started.
0548     </td>
0549   </tr>
0550   <tr>
0551     <td><code>${ns}.trustStorePassword</code></td>
0552     <td>None</td>
0553     <td>Password for the trust store.</td>
0554   </tr>
0555   <tr>
0556     <td><code>${ns}.trustStoreType</code></td>
0557     <td>JKS</td>
0558     <td>The type of the trust store.</td>
0559   </tr>
0560 </table>
0561
0562 Spark also supports retrieving `${ns}.keyPassword`, `${ns}.keyStorePassword` and `${ns}.trustStorePassword` from
0563 [Hadoop Credential Providers](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
0564 User could store password into credential file and make it accessible by different components, like:
0565
0566 ```
0567 hadoop credential create spark.ssl.keyPassword -value password \
0568     -provider jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks
0569 ```
0570
0571 To configure the location of the credential provider, set the `hadoop.security.credential.provider.path`
0572 config option in the Hadoop configuration used by Spark, like:
0573
0574 ```
0575   <property>
0576     <name>hadoop.security.credential.provider.path</name>
0577     <value>jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks</value>
0578   </property>
0579 ```
0580
0581 Or via SparkConf "spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks".
0582
0583 ## Preparing the key stores
0584
0585 Key stores can be generated by `keytool` program. The reference documentation for this tool for
0586 Java 8 is [here](https://docs.oracle.com/javase/8/docs/technotes/tools/unix/keytool.html).
0587 The most basic steps to configure the key stores and the trust store for a Spark Standalone
0588 deployment mode is as follows:
0589
0590 * Generate a key pair for each node
0591 * Export the public key of the key pair to a file on each node
0592 * Import all exported public keys into a single trust store
0593 * Distribute the trust store to the cluster nodes
0594
0595 ### YARN mode
0596
0597 To provide a local trust store or key store file to drivers running in cluster mode, they can be
0598 distributed with the application using the `--files` command line argument (or the equivalent
0599 `spark.files` configuration). The files will be placed on the driver's working directory, so the TLS
0600 configuration should just reference the file name with no absolute path.
0601
0602 Distributing local key stores this way may require the files to be staged in HDFS (or other similar
0603 distributed file system used by the cluster), so it's recommended that the underlying file system be
0604 configured with security in mind (e.g. by enabling authentication and wire encryption).
0605
0606 ### Standalone mode
0607
0608 The user needs to provide key stores and configuration options for master and workers. They have to
0609 be set by attaching appropriate Java system properties in `SPARK_MASTER_OPTS` and in
0610 `SPARK_WORKER_OPTS` environment variables, or just in `SPARK_DAEMON_JAVA_OPTS`.
0611
0612 The user may allow the executors to use the SSL settings inherited from the worker process. That
0613 can be accomplished by setting `spark.ssl.useNodeLocalConf` to `true`. In that case, the settings
0614 provided by the user on the client side are not used.
0615
0616 ### Mesos mode
0617
0618 Mesos 1.3.0 and newer supports `Secrets` primitives as both file-based and environment based
0619 secrets. Spark allows the specification of file-based and environment variable based secrets with
0620 `spark.mesos.driver.secret.filenames` and `spark.mesos.driver.secret.envkeys`, respectively.
0621
0622 Depending on the secret store backend secrets can be passed by reference or by value with the
0623 `spark.mesos.driver.secret.names` and `spark.mesos.driver.secret.values` configuration properties,
0624 respectively.
0625
0626 Reference type secrets are served by the secret store and referred to by name, for example
0627 `/mysecret`. Value type secrets are passed on the command line and translated into their
0628 appropriate files or environment variables.
0629
0630 ## HTTP Security Headers
0631
0632 Apache Spark can be configured to include HTTP headers to aid in preventing Cross Site Scripting
0633 (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing, and also to enforce HTTP Strict Transport
0634 Security.
0635
0636 <table class="table">
0637 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0638 <tr>
0639   <td><code>spark.ui.xXssProtection</code></td>
0640   <td><code>1; mode=block</code></td>
0641   <td>
0642     Value for HTTP X-XSS-Protection response header. You can choose appropriate value
0643     from below:
0644     <ul>
0645       <li><code>0</code> (Disables XSS filtering)</li>
0646       <li><code>1</code> (Enables XSS filtering. If a cross-site scripting attack is detected,
0647         the browser will sanitize the page.)</li>
0648       <li><code>1; mode=block</code> (Enables XSS filtering. The browser will prevent rendering
0649         of the page if an attack is detected.)</li>
0650     </ul>
0651   </td>
0652   <td>2.3.0</td>
0653 </tr>
0654 <tr>
0655   <td><code>spark.ui.xContentTypeOptions.enabled</code></td>
0656   <td><code>true</code></td>
0657   <td>
0658     When enabled, X-Content-Type-Options HTTP response header will be set to "nosniff".
0659   </td>
0660   <td>2.3.0</td>
0661 </tr>
0662 <tr>
0663   <td><code>spark.ui.strictTransportSecurity</code></td>
0664   <td>None</td>
0665   <td>
0666     Value for HTTP Strict Transport Security (HSTS) Response Header. You can choose appropriate
0667     value from below and set <code>expire-time</code> accordingly. This option is only used when
0668     SSL/TLS is enabled.
0669     <ul>
0670       <li><code>max-age=&lt;expire-time&gt;</code></li>
0671       <li><code>max-age=&lt;expire-time&gt;; includeSubDomains</code></li>
0672       <li><code>max-age=&lt;expire-time&gt;; preload</code></li>
0673     </ul>
0674   </td>
0675   <td>2.3.0</td>
0676 </tr>
0677 </table>
0678
0679
0680 # Configuring Ports for Network Security
0681
0682 Generally speaking, a Spark cluster and its services are not deployed on the public internet.
0683 They are generally private services, and should only be accessible within the network of the
0684 organization that deploys Spark. Access to the hosts and ports used by Spark services should
0685 be limited to origin hosts that need to access the services.
0686
0687 Below are the primary ports that Spark uses for its communication and how to
0688 configure those ports.
0689
0690 ## Standalone mode only
0691
0692 <table class="table">
0693   <tr>
0694     <th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
0695     Setting</th><th>Notes</th>
0696   </tr>
0697   <tr>
0698     <td>Browser</td>
0699     <td>Standalone Master</td>
0700     <td>8080</td>
0701     <td>Web UI</td>
0702     <td><code>spark.master.ui.port /<br> SPARK_MASTER_WEBUI_PORT</code></td>
0703     <td>Jetty-based. Standalone mode only.</td>
0704   </tr>
0705   <tr>
0706     <td>Browser</td>
0707     <td>Standalone Worker</td>
0708     <td>8081</td>
0709     <td>Web UI</td>
0710     <td><code>spark.worker.ui.port /<br> SPARK_WORKER_WEBUI_PORT</code></td>
0711     <td>Jetty-based. Standalone mode only.</td>
0712   </tr>
0713   <tr>
0714     <td>Driver /<br> Standalone Worker</td>
0715     <td>Standalone Master</td>
0716     <td>7077</td>
0717     <td>Submit job to cluster /<br> Join cluster</td>
0718     <td><code>SPARK_MASTER_PORT</code></td>
0719     <td>Set to "0" to choose a port randomly. Standalone mode only.</td>
0720   </tr>
0721   <tr>
0722     <td>External Service</td>
0723     <td>Standalone Master</td>
0724     <td>6066</td>
0725     <td>Submit job to cluster via REST API</td>
0726     <td><code>spark.master.rest.port</code></td>
0727     <td>Use <code>spark.master.rest.enabled</code> to enable/disable this service. Standalone mode only.</td>
0728   </tr>
0729   <tr>
0730     <td>Standalone Master</td>
0731     <td>Standalone Worker</td>
0732     <td>(random)</td>
0733     <td>Schedule executors</td>
0734     <td><code>SPARK_WORKER_PORT</code></td>
0735     <td>Set to "0" to choose a port randomly. Standalone mode only.</td>
0736   </tr>
0737 </table>
0738
0739 ## All cluster managers
0740
0741 <table class="table">
0742   <tr>
0743     <th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
0744     Setting</th><th>Notes</th>
0745   </tr>
0746   <tr>
0747     <td>Browser</td>
0748     <td>Application</td>
0749     <td>4040</td>
0750     <td>Web UI</td>
0751     <td><code>spark.ui.port</code></td>
0752     <td>Jetty-based</td>
0753   </tr>
0754   <tr>
0755     <td>Browser</td>
0756     <td>History Server</td>
0757     <td>18080</td>
0758     <td>Web UI</td>
0759     <td><code>spark.history.ui.port</code></td>
0760     <td>Jetty-based</td>
0761   </tr>
0762   <tr>
0763     <td>Executor /<br> Standalone Master</td>
0764     <td>Driver</td>
0765     <td>(random)</td>
0766     <td>Connect to application /<br> Notify executor state changes</td>
0767     <td><code>spark.driver.port</code></td>
0768     <td>Set to "0" to choose a port randomly.</td>
0769   </tr>
0770   <tr>
0771     <td>Executor / Driver</td>
0772     <td>Executor / Driver</td>
0773     <td>(random)</td>
0774     <td>Block Manager port</td>
0775     <td><code>spark.blockManager.port</code></td>
0776     <td>Raw socket via ServerSocketChannel</td>
0777   </tr>
0778 </table>
0779
0780
0781 # Kerberos
0782
0783 Spark supports submitting applications in environments that use Kerberos for authentication.
0784 In most cases, Spark relies on the credentials of the current logged in user when authenticating
0785 to Kerberos-aware services. Such credentials can be obtained by logging in to the configured KDC
0786 with tools like `kinit`.
0787
0788 When talking to Hadoop-based services, Spark needs to obtain delegation tokens so that non-local
0789 processes can authenticate. Spark ships with support for HDFS and other Hadoop file systems, Hive
0790 and HBase.
0791
0792 When using a Hadoop filesystem (such HDFS or WebHDFS), Spark will acquire the relevant tokens
0793 for the service hosting the user's home directory.
0794
0795 An HBase token will be obtained if HBase is in the application's classpath, and the HBase
0796 configuration has Kerberos authentication turned (`hbase.security.authentication=kerberos`).
0797
0798 Similarly, a Hive token will be obtained if Hive is in the classpath, and the configuration includes
0799 URIs for remote metastore services (`hive.metastore.uris` is not empty).
0800
0801 If an application needs to interact with other secure Hadoop filesystems, their URIs need to be
0802 explicitly provided to Spark at launch time. This is done by listing them in the
0803 `spark.kerberos.access.hadoopFileSystems` property, described in the configuration section below.
0804
0805 Spark also supports custom delegation token providers using the Java Services
0806 mechanism (see `java.util.ServiceLoader`). Implementations of
0807 `org.apache.spark.security.HadoopDelegationTokenProvider` can be made available to Spark
0808 by listing their names in the corresponding file in the jar's `META-INF/services` directory.
0809
0810 Delegation token support is currently only supported in YARN and Mesos modes. Consult the
0811 deployment-specific page for more information.
0812
0813 The following options provides finer-grained control for this feature:
0814
0815 <table class="table">
0816 <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0817 <tr>
0818   <td><code>spark.security.credentials.${service}.enabled</code></td>
0819   <td><code>true</code></td>
0820   <td>
0821     Controls whether to obtain credentials for services when security is enabled.
0822     By default, credentials for all supported services are retrieved when those services are
0823     configured, but it's possible to disable that behavior if it somehow conflicts with the
0824     application being run.
0825   </td>
0826   <td>2.3.0</td>
0827 </tr>
0828 <tr>
0829   <td><code>spark.kerberos.access.hadoopFileSystems</code></td>
0830   <td>(none)</td>
0831   <td>
0832     A comma-separated list of secure Hadoop filesystems your Spark application is going to access. For
0833     example, <code>spark.kerberos.access.hadoopFileSystems=hdfs://nn1.com:8032,hdfs://nn2.com:8032,
0834     webhdfs://nn3.com:50070</code>. The Spark application must have access to the filesystems listed
0835     and Kerberos must be properly configured to be able to access them (either in the same realm
0836     or in a trusted realm). Spark acquires security tokens for each of the filesystems so that
0837     the Spark application can access those remote Hadoop filesystems.
0838   </td>
0839   <td>3.0.0</td>
0840 </tr>
0841 </table>
0842
0843 ## Long-Running Applications
0844
0845 Long-running applications may run into issues if their run time exceeds the maximum delegation
0846 token lifetime configured in services it needs to access.
0847
0848 This feature is not available everywhere. In particular, it's only implemented
0849 on YARN and Kubernetes (both client and cluster modes), and on Mesos when using client mode.
0850
0851 Spark supports automatically creating new tokens for these applications. There are two ways to
0852 enable this functionality.
0853
0854 ### Using a Keytab
0855
0856 By providing Spark with a principal and keytab (e.g. using `spark-submit` with `--principal`
0857 and `--keytab` parameters), the application will maintain a valid Kerberos login that can be
0858 used to retrieve delegation tokens indefinitely.
0859
0860 Note that when using a keytab in cluster mode, it will be copied over to the machine running the
0861 Spark driver. In the case of YARN, this means using HDFS as a staging area for the keytab, so it's
0862 strongly recommended that both YARN and HDFS be secured with encryption, at least.
0863
0864 ### Using a ticket cache
0865
0866 By setting `spark.kerberos.renewal.credentials` to `ccache` in Spark's configuration, the local
0867 Kerberos ticket cache will be used for authentication. Spark will keep the ticket renewed during its
0868 renewable life, but after it expires a new ticket needs to be acquired (e.g. by running `kinit`).
0869
0870 It's up to the user to maintain an updated ticket cache that Spark can use.
0871
0872 The location of the ticket cache can be customized by setting the `KRB5CCNAME` environment
0873 variable.
0874
0875 ## Secure Interaction with Kubernetes
0876
0877 When talking to Hadoop-based services behind Kerberos, it was noted that Spark needs to obtain delegation tokens
0878 so that non-local processes can authenticate. These delegation tokens in Kubernetes are stored in Secrets that are
0879 shared by the Driver and its Executors. As such, there are three ways of submitting a Kerberos job:
0880
0881 In all cases you must define the environment variable: `HADOOP_CONF_DIR` or
0882 `spark.kubernetes.hadoop.configMapName.`
0883
0884 It also important to note that the KDC needs to be visible from inside the containers.
0885
0886 If a user wishes to use a remote HADOOP_CONF directory, that contains the Hadoop configuration files, this could be
0887 achieved by setting `spark.kubernetes.hadoop.configMapName` to a pre-existing ConfigMap.
0888
0889 1. Submitting with a $kinit that stores a TGT in the Local Ticket Cache:
0890 ```bash
0891 /usr/bin/kinit -kt <keytab_file> <username>/<krb5 realm>
0892 /opt/spark/bin/spark-submit \
0893     --deploy-mode cluster \
0894     --class org.apache.spark.examples.HdfsTest \
0895     --master k8s://<KUBERNETES_MASTER_ENDPOINT> \
0896     --conf spark.executor.instances=1 \
0897     --conf spark.app.name=spark-hdfs \
0898     --conf spark.kubernetes.container.image=spark:latest \
0899     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
0900     local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \
0901     <HDFS_FILE_LOCATION>
0902 ```
0903 2. Submitting with a local Keytab and Principal
0904 ```bash
0905 /opt/spark/bin/spark-submit \
0906     --deploy-mode cluster \
0907     --class org.apache.spark.examples.HdfsTest \
0908     --master k8s://<KUBERNETES_MASTER_ENDPOINT> \
0909     --conf spark.executor.instances=1 \
0910     --conf spark.app.name=spark-hdfs \
0911     --conf spark.kubernetes.container.image=spark:latest \
0912     --conf spark.kerberos.keytab=<KEYTAB_FILE> \
0913     --conf spark.kerberos.principal=<PRINCIPAL> \
0914     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
0915     local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \
0916     <HDFS_FILE_LOCATION>
0917 ```
0918
0919 3. Submitting with pre-populated secrets, that contain the Delegation Token, already existing within the namespace
0920 ```bash
0921 /opt/spark/bin/spark-submit \
0922     --deploy-mode cluster \
0923     --class org.apache.spark.examples.HdfsTest \
0924     --master k8s://<KUBERNETES_MASTER_ENDPOINT> \
0925     --conf spark.executor.instances=1 \
0926     --conf spark.app.name=spark-hdfs \
0927     --conf spark.kubernetes.container.image=spark:latest \
0928     --conf spark.kubernetes.kerberos.tokenSecret.name=<SECRET_TOKEN_NAME> \
0929     --conf spark.kubernetes.kerberos.tokenSecret.itemKey=<SECRET_ITEM_KEY> \
0930     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
0931     local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \
0932     <HDFS_FILE_LOCATION>
0933 ```
0934
0935 3b. Submitting like in (3) however specifying a pre-created krb5 ConfigMap and pre-created `HADOOP_CONF_DIR` ConfigMap
0936 ```bash
0937 /opt/spark/bin/spark-submit \
0938     --deploy-mode cluster \
0939     --class org.apache.spark.examples.HdfsTest \
0940     --master k8s://<KUBERNETES_MASTER_ENDPOINT> \
0941     --conf spark.executor.instances=1 \
0942     --conf spark.app.name=spark-hdfs \
0943     --conf spark.kubernetes.container.image=spark:latest \
0944     --conf spark.kubernetes.kerberos.tokenSecret.name=<SECRET_TOKEN_NAME> \
0945     --conf spark.kubernetes.kerberos.tokenSecret.itemKey=<SECRET_ITEM_KEY> \
0946     --conf spark.kubernetes.hadoop.configMapName=<HCONF_CONFIG_MAP_NAME> \
0947     --conf spark.kubernetes.kerberos.krb5.configMapName=<KRB_CONFIG_MAP_NAME> \
0948     local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \
0949     <HDFS_FILE_LOCATION>
0950 ```
0951 # Event Logging
0952
0953 If your applications are using event logging, the directory where the event logs go
0954 (`spark.eventLog.dir`) should be manually created with proper permissions. To secure the log files,
0955 the directory permissions should be set to `drwxrwxrwxt`. The owner and group of the directory
0956 should correspond to the super user who is running the Spark History Server.
0957
0958 This will allow all users to write to the directory but will prevent unprivileged users from
0959 reading, removing or renaming a file unless they own it. The event log files will be created by
0960 Spark with permissions such that only the user and group have read and write access.
0961
0962 # Persisting driver logs in client mode
0963
0964 If your applications persist driver logs in client mode by enabling `spark.driver.log.persistToDfs.enabled`,
0965 the directory where the driver logs go (`spark.driver.log.dfsDir`) should be manually created with proper
0966 permissions. To secure the log files, the directory permissions should be set to `drwxrwxrwxt`. The owner
0967 and group of the directory should correspond to the super user who is running the Spark History Server.
0968
0969 This will allow all users to write to the directory but will prevent unprivileged users from
0970 reading, removing or renaming a file unless they own it. The driver log files will be created by
0971 Spark with permissions such that only the user and group have read and write access.