the-tree/docs/building-spark.md

0001 ---
0002 layout: global
0003 title: Building Spark
0004 redirect_from: "building-with-maven.html"
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * This will become a table of contents (this text will be scraped).
0023 {:toc}
0024
0025 # Building Apache Spark
0026
0027 ## Apache Maven
0028
0029 The Maven-based build is the build of reference for Apache Spark.
0030 Building Spark using Maven requires Maven 3.6.3 and Java 8.
0031 Spark requires Scala 2.12; support for Scala 2.11 was removed in Spark 3.0.0.
0032
0033 ### Setting up Maven's Memory Usage
0034
0035 You'll need to configure Maven to use more memory than usual by setting `MAVEN_OPTS`:
0036
0037     export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
0038
0039 (The `ReservedCodeCacheSize` setting is optional but recommended.)
0040 If you don't add these parameters to `MAVEN_OPTS`, you may see errors and warnings like the following:
0041
0042     [INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
0043     [ERROR] Java heap space -> [Help 1]
0044
0045 You can fix these problems by setting the `MAVEN_OPTS` variable as discussed before.
0046
0047 **Note:**
0048
0049 * If using `build/mvn` with no `MAVEN_OPTS` set, the script will automatically add the above options to the `MAVEN_OPTS` environment variable.
0050 * The `test` phase of the Spark build will automatically add these options to `MAVEN_OPTS`, even when not using `build/mvn`.
0051
0052 ### build/mvn
0053
0054 Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the `build/` directory. This script will automatically download and setup all necessary build requirements ([Maven](https://maven.apache.org/), [Scala](https://www.scala-lang.org/), and [Zinc](https://github.com/typesafehub/zinc)) locally within the `build/` directory itself. It honors any `mvn` binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. `build/mvn` execution acts as a pass through to the `mvn` call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:
0055
0056     ./build/mvn -DskipTests clean package
0057
0058 Other build examples can be found below.
0059
0060 ## Building a Runnable Distribution
0061
0062 To create a Spark distribution like those distributed by the
0063 [Spark Downloads](https://spark.apache.org/downloads.html) page, and that is laid out so as
0064 to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured
0065 with Maven profile settings and so on like the direct Maven build. Example:
0066
0067     ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
0068
0069 This will build Spark distribution along with Python pip and R packages. For more information on usage, run `./dev/make-distribution.sh --help`
0070
0071 ## Specifying the Hadoop Version and Enabling YARN
0072
0073 You can specify the exact version of Hadoop to compile against through the `hadoop.version` property.
0074
0075 You can enable the `yarn` profile and optionally set the `yarn.version` property if it is different
0076 from `hadoop.version`.
0077
0078 Example:
0079
0080     ./build/mvn -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package
0081
0082 ## Building With Hive and JDBC Support
0083
0084 To enable Hive integration for Spark SQL along with its JDBC server and CLI,
0085 add the `-Phive` and `-Phive-thriftserver` profiles to your existing build options.
0086 By default Spark will build with Hive 2.3.7.
0087
0088     # With Hive 2.3.7 support
0089     ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
0090
0091 ## Packaging without Hadoop Dependencies for YARN
0092
0093 The assembly directory produced by `mvn package` will, by default, include all of Spark's
0094 dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this
0095 causes multiple versions of these to appear on executor classpaths: the version packaged in
0096 the Spark assembly and the version on each node, included with `yarn.application.classpath`.
0097 The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects,
0098 like ZooKeeper and Hadoop itself.
0099
0100 ## Building with Mesos support
0101
0102     ./build/mvn -Pmesos -DskipTests clean package
0103
0104 ## Building with Kubernetes support
0105
0106     ./build/mvn -Pkubernetes -DskipTests clean package
0107
0108 ## Building submodules individually
0109
0110 It's possible to build Spark submodules using the `mvn -pl` option.
0111
0112 For instance, you can build the Spark Streaming module using:
0113
0114     ./build/mvn -pl :spark-streaming_{{site.SCALA_BINARY_VERSION}} clean install
0115
0116 where `spark-streaming_{{site.SCALA_BINARY_VERSION}}` is the `artifactId` as defined in `streaming/pom.xml` file.
0117
0118 ## Continuous Compilation
0119
0120 We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
0121
0122     ./build/mvn scala:cc
0123
0124 should run continuous compilation (i.e. wait for changes). However, this has not been tested
0125 extensively. A couple of gotchas to note:
0126
0127 * it only scans the paths `src/main` and `src/test` (see
0128 [docs](https://davidb.github.io/scala-maven-plugin/example_cc.html)), so it will only work
0129 from within certain submodules that have that structure.
0130
0131 * you'll typically need to run `mvn install` from the project root for compilation within
0132 specific submodules to work; this is because submodules that depend on other submodules do so via
0133 the `spark-parent` module).
0134
0135 Thus, the full flow for running continuous-compilation of the `core` submodule may look more like:
0136
0137     $ ./build/mvn install
0138     $ cd core
0139     $ ../build/mvn scala:cc
0140
0141 ## Building with SBT
0142
0143 Maven is the official build tool recommended for packaging Spark, and is the *build of reference*.
0144 But SBT is supported for day-to-day development since it can provide much faster iterative
0145 compilation. More advanced developers may wish to use SBT.
0146
0147 The SBT build is derived from the Maven POM files, and so the same Maven profiles and variables
0148 can be set to control the SBT build. For example:
0149
0150     ./build/sbt package
0151
0152 To avoid the overhead of launching sbt each time you need to re-compile, you can launch sbt
0153 in interactive mode by running `build/sbt`, and then run all build commands at the command
0154 prompt.
0155
0156 ### Setting up SBT's Memory Usage
0157 Configure the JVM options for SBT in `.jvmopts` at the project root, for example:
0158
0159     -Xmx2g
0160     -XX:ReservedCodeCacheSize=1g
0161
0162 For the meanings of these two options, please carefully read the [Setting up Maven's Memory Usage section](https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage).
0163
0164 ## Speeding up Compilation
0165
0166 Developers who compile Spark frequently may want to speed up compilation; e.g., by using Zinc
0167 (for developers who build with Maven) or by avoiding re-compilation of the assembly JAR (for
0168 developers who build with SBT).  For more information about how to do this, refer to the
0169 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html#reducing-build-times).
0170
0171 ## Encrypted Filesystems
0172
0173 When building on an encrypted filesystem (if your home directory is encrypted, for example), then the Spark build might fail with a "Filename too long" error. As a workaround, add the following in the configuration args of the `scala-maven-plugin` in the project `pom.xml`:
0174
0175     <arg>-Xmax-classfile-name</arg>
0176     <arg>128</arg>
0177
0178 and in `project/SparkBuild.scala` add:
0179
0180     scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
0181
0182 to the `sharedSettings` val. See also [this PR](https://github.com/apache/spark/pull/2883/files) if you are unsure of where to add these lines.
0183
0184 ## IntelliJ IDEA or Eclipse
0185
0186 For help in setting up IntelliJ IDEA or Eclipse for Spark development, and troubleshooting, refer to the
0187 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html).
0188
0189
0190 # Running Tests
0191
0192 Tests are run by default via the [ScalaTest Maven plugin](http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin).
0193 Note that tests should not be run as root or an admin user.
0194
0195 The following is an example of a command to run the tests:
0196
0197     ./build/mvn test
0198
0199 ## Testing with SBT
0200
0201 The following is an example of a command to run the tests:
0202
0203     ./build/sbt test
0204
0205 ## Running Individual Tests
0206
0207 For information about how to run individual tests, refer to the
0208 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html#running-individual-tests).
0209
0210 ## PySpark pip installable
0211
0212 If you are building Spark for use in a Python environment and you wish to pip install it, you will first need to build the Spark JARs as described above. Then you can construct an sdist package suitable for setup.py and pip installable package.
0213
0214     cd python; python setup.py sdist
0215
0216 **Note:** Due to packaging requirements you can not directly pip install from the Python directory, rather you must first build the sdist package as described above.
0217
0218 Alternatively, you can also run make-distribution with the --pip option.
0219
0220 ## PySpark Tests with Maven or SBT
0221
0222 If you are building PySpark and wish to run the PySpark tests you will need to build Spark with Hive support.
0223
0224     ./build/mvn -DskipTests clean package -Phive
0225     ./python/run-tests
0226
0227 If you are building PySpark with SBT and wish to run the PySpark tests, you will need to build Spark with Hive support and also build the test components:
0228
0229     ./build/sbt -Phive clean package
0230     ./build/sbt test:compile
0231     ./python/run-tests
0232
0233 The run-tests script also can be limited to a specific Python version or a specific module
0234
0235     ./python/run-tests --python-executables=python --modules=pyspark-sql
0236
0237 ## Running R Tests
0238
0239 To run the SparkR tests you will need to install the [knitr](https://cran.r-project.org/package=knitr), [rmarkdown](https://cran.r-project.org/package=rmarkdown), [testthat](https://cran.r-project.org/package=testthat), [e1071](https://cran.r-project.org/package=e1071) and [survival](https://cran.r-project.org/package=survival) packages first:
0240
0241     Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')"
0242
0243 You can run just the SparkR tests using the command:
0244
0245     ./R/run-tests.sh
0246
0247 ## Running Docker-based Integration Test Suites
0248
0249 In order to run Docker integration tests, you have to install the `docker` engine on your box.
0250 The instructions for installation can be found at [the Docker site](https://docs.docker.com/engine/installation/).
0251 Once installed, the `docker` service needs to be started, if not already running.
0252 On Linux, this can be done by `sudo service docker start`.
0253
0254     ./build/mvn install -DskipTests
0255     ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_{{site.SCALA_BINARY_VERSION}}
0256
0257 or
0258
0259     ./build/sbt docker-integration-tests/test
0260
0261 ## Change Scala Version
0262
0263 When other versions of Scala like 2.13 are supported, it will be possible to build for that version.
0264 Change the major Scala version using (e.g. 2.13):
0265
0266     ./dev/change-scala-version.sh 2.13
0267
0268 For Maven, please enable the profile (e.g. 2.13):
0269
0270     ./build/mvn -Pscala-2.13 compile
0271
0272 For SBT, specify a complete scala version using (e.g. 2.13.0):
0273
0274     ./build/sbt -Dscala.version=2.13.0
0275
0276 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom.
0277
0278 ## Running Jenkins tests with Github Enterprise
0279
0280 To run tests with Jenkins:
0281
0282     ./dev/run-tests-jenkins
0283
0284 If use an individual repository or a repository on GitHub Enterprise, export below environment variables before running above command.
0285
0286 ### Related environment variables
0287
0288 <table class="table">
0289 <tr><th>Variable Name</th><th>Default</th><th>Meaning</th></tr>
0290 <tr>
0291   <td><code>SPARK_PROJECT_URL</code></td>
0292   <td>https://github.com/apache/spark</td>
0293   <td>
0294     The Spark project URL of GitHub Enterprise.
0295   </td>
0296 </tr>
0297 <tr>
0298   <td><code>GITHUB_API_BASE</code></td>
0299   <td>https://api.github.com/repos/apache/spark</td>
0300   <td>
0301     The Spark project API server URL of GitHub Enterprise.
0302   </td>
0303 </tr>
0304 </table>