Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Building Spark
0004 redirect_from: "building-with-maven.html"
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012 
0013      http://www.apache.org/licenses/LICENSE-2.0
0014 
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * This will become a table of contents (this text will be scraped).
0023 {:toc}
0024 
0025 # Building Apache Spark
0026 
0027 ## Apache Maven
0028 
0029 The Maven-based build is the build of reference for Apache Spark.
0030 Building Spark using Maven requires Maven 3.6.3 and Java 8.
0031 Spark requires Scala 2.12; support for Scala 2.11 was removed in Spark 3.0.0.
0032 
0033 ### Setting up Maven's Memory Usage
0034 
0035 You'll need to configure Maven to use more memory than usual by setting `MAVEN_OPTS`:
0036 
0037     export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
0038 
0039 (The `ReservedCodeCacheSize` setting is optional but recommended.)
0040 If you don't add these parameters to `MAVEN_OPTS`, you may see errors and warnings like the following:
0041 
0042     [INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
0043     [ERROR] Java heap space -> [Help 1]
0044 
0045 You can fix these problems by setting the `MAVEN_OPTS` variable as discussed before.
0046 
0047 **Note:**
0048 
0049 * If using `build/mvn` with no `MAVEN_OPTS` set, the script will automatically add the above options to the `MAVEN_OPTS` environment variable.
0050 * The `test` phase of the Spark build will automatically add these options to `MAVEN_OPTS`, even when not using `build/mvn`.    
0051 
0052 ### build/mvn
0053 
0054 Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the `build/` directory. This script will automatically download and setup all necessary build requirements ([Maven](https://maven.apache.org/), [Scala](https://www.scala-lang.org/), and [Zinc](https://github.com/typesafehub/zinc)) locally within the `build/` directory itself. It honors any `mvn` binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. `build/mvn` execution acts as a pass through to the `mvn` call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:
0055 
0056     ./build/mvn -DskipTests clean package
0057 
0058 Other build examples can be found below.
0059 
0060 ## Building a Runnable Distribution
0061 
0062 To create a Spark distribution like those distributed by the
0063 [Spark Downloads](https://spark.apache.org/downloads.html) page, and that is laid out so as
0064 to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured
0065 with Maven profile settings and so on like the direct Maven build. Example:
0066 
0067     ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
0068 
0069 This will build Spark distribution along with Python pip and R packages. For more information on usage, run `./dev/make-distribution.sh --help`
0070 
0071 ## Specifying the Hadoop Version and Enabling YARN
0072 
0073 You can specify the exact version of Hadoop to compile against through the `hadoop.version` property.
0074 
0075 You can enable the `yarn` profile and optionally set the `yarn.version` property if it is different
0076 from `hadoop.version`.
0077 
0078 Example:
0079 
0080     ./build/mvn -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package
0081 
0082 ## Building With Hive and JDBC Support
0083 
0084 To enable Hive integration for Spark SQL along with its JDBC server and CLI,
0085 add the `-Phive` and `-Phive-thriftserver` profiles to your existing build options.
0086 By default Spark will build with Hive 2.3.7.
0087 
0088     # With Hive 2.3.7 support
0089     ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
0090 
0091 ## Packaging without Hadoop Dependencies for YARN
0092 
0093 The assembly directory produced by `mvn package` will, by default, include all of Spark's
0094 dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this
0095 causes multiple versions of these to appear on executor classpaths: the version packaged in
0096 the Spark assembly and the version on each node, included with `yarn.application.classpath`.
0097 The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects,
0098 like ZooKeeper and Hadoop itself.
0099 
0100 ## Building with Mesos support
0101 
0102     ./build/mvn -Pmesos -DskipTests clean package
0103 
0104 ## Building with Kubernetes support
0105 
0106     ./build/mvn -Pkubernetes -DskipTests clean package
0107 
0108 ## Building submodules individually
0109 
0110 It's possible to build Spark submodules using the `mvn -pl` option.
0111 
0112 For instance, you can build the Spark Streaming module using:
0113 
0114     ./build/mvn -pl :spark-streaming_{{site.SCALA_BINARY_VERSION}} clean install
0115 
0116 where `spark-streaming_{{site.SCALA_BINARY_VERSION}}` is the `artifactId` as defined in `streaming/pom.xml` file.
0117 
0118 ## Continuous Compilation
0119 
0120 We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
0121 
0122     ./build/mvn scala:cc
0123 
0124 should run continuous compilation (i.e. wait for changes). However, this has not been tested
0125 extensively. A couple of gotchas to note:
0126 
0127 * it only scans the paths `src/main` and `src/test` (see
0128 [docs](https://davidb.github.io/scala-maven-plugin/example_cc.html)), so it will only work
0129 from within certain submodules that have that structure.
0130 
0131 * you'll typically need to run `mvn install` from the project root for compilation within
0132 specific submodules to work; this is because submodules that depend on other submodules do so via
0133 the `spark-parent` module).
0134 
0135 Thus, the full flow for running continuous-compilation of the `core` submodule may look more like:
0136 
0137     $ ./build/mvn install
0138     $ cd core
0139     $ ../build/mvn scala:cc
0140 
0141 ## Building with SBT
0142 
0143 Maven is the official build tool recommended for packaging Spark, and is the *build of reference*.
0144 But SBT is supported for day-to-day development since it can provide much faster iterative
0145 compilation. More advanced developers may wish to use SBT.
0146 
0147 The SBT build is derived from the Maven POM files, and so the same Maven profiles and variables
0148 can be set to control the SBT build. For example:
0149 
0150     ./build/sbt package
0151 
0152 To avoid the overhead of launching sbt each time you need to re-compile, you can launch sbt
0153 in interactive mode by running `build/sbt`, and then run all build commands at the command
0154 prompt.
0155 
0156 ### Setting up SBT's Memory Usage
0157 Configure the JVM options for SBT in `.jvmopts` at the project root, for example:
0158 
0159     -Xmx2g
0160     -XX:ReservedCodeCacheSize=1g
0161 
0162 For the meanings of these two options, please carefully read the [Setting up Maven's Memory Usage section](https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage).
0163 
0164 ## Speeding up Compilation
0165 
0166 Developers who compile Spark frequently may want to speed up compilation; e.g., by using Zinc
0167 (for developers who build with Maven) or by avoiding re-compilation of the assembly JAR (for
0168 developers who build with SBT).  For more information about how to do this, refer to the
0169 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html#reducing-build-times).
0170 
0171 ## Encrypted Filesystems
0172 
0173 When building on an encrypted filesystem (if your home directory is encrypted, for example), then the Spark build might fail with a "Filename too long" error. As a workaround, add the following in the configuration args of the `scala-maven-plugin` in the project `pom.xml`:
0174 
0175     <arg>-Xmax-classfile-name</arg>
0176     <arg>128</arg>
0177 
0178 and in `project/SparkBuild.scala` add:
0179 
0180     scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
0181 
0182 to the `sharedSettings` val. See also [this PR](https://github.com/apache/spark/pull/2883/files) if you are unsure of where to add these lines.
0183 
0184 ## IntelliJ IDEA or Eclipse
0185 
0186 For help in setting up IntelliJ IDEA or Eclipse for Spark development, and troubleshooting, refer to the
0187 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html).
0188 
0189 
0190 # Running Tests
0191 
0192 Tests are run by default via the [ScalaTest Maven plugin](http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin).
0193 Note that tests should not be run as root or an admin user.
0194 
0195 The following is an example of a command to run the tests:
0196 
0197     ./build/mvn test
0198 
0199 ## Testing with SBT
0200 
0201 The following is an example of a command to run the tests:
0202 
0203     ./build/sbt test
0204 
0205 ## Running Individual Tests
0206 
0207 For information about how to run individual tests, refer to the
0208 [Useful Developer Tools page](https://spark.apache.org/developer-tools.html#running-individual-tests).
0209 
0210 ## PySpark pip installable
0211 
0212 If you are building Spark for use in a Python environment and you wish to pip install it, you will first need to build the Spark JARs as described above. Then you can construct an sdist package suitable for setup.py and pip installable package.
0213 
0214     cd python; python setup.py sdist
0215 
0216 **Note:** Due to packaging requirements you can not directly pip install from the Python directory, rather you must first build the sdist package as described above.
0217 
0218 Alternatively, you can also run make-distribution with the --pip option.
0219 
0220 ## PySpark Tests with Maven or SBT
0221 
0222 If you are building PySpark and wish to run the PySpark tests you will need to build Spark with Hive support.
0223 
0224     ./build/mvn -DskipTests clean package -Phive
0225     ./python/run-tests
0226 
0227 If you are building PySpark with SBT and wish to run the PySpark tests, you will need to build Spark with Hive support and also build the test components:
0228 
0229     ./build/sbt -Phive clean package
0230     ./build/sbt test:compile
0231     ./python/run-tests
0232 
0233 The run-tests script also can be limited to a specific Python version or a specific module
0234 
0235     ./python/run-tests --python-executables=python --modules=pyspark-sql
0236 
0237 ## Running R Tests
0238 
0239 To run the SparkR tests you will need to install the [knitr](https://cran.r-project.org/package=knitr), [rmarkdown](https://cran.r-project.org/package=rmarkdown), [testthat](https://cran.r-project.org/package=testthat), [e1071](https://cran.r-project.org/package=e1071) and [survival](https://cran.r-project.org/package=survival) packages first:
0240 
0241     Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')"
0242 
0243 You can run just the SparkR tests using the command:
0244 
0245     ./R/run-tests.sh
0246 
0247 ## Running Docker-based Integration Test Suites
0248 
0249 In order to run Docker integration tests, you have to install the `docker` engine on your box.
0250 The instructions for installation can be found at [the Docker site](https://docs.docker.com/engine/installation/).
0251 Once installed, the `docker` service needs to be started, if not already running.
0252 On Linux, this can be done by `sudo service docker start`.
0253 
0254     ./build/mvn install -DskipTests
0255     ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_{{site.SCALA_BINARY_VERSION}}
0256 
0257 or
0258 
0259     ./build/sbt docker-integration-tests/test
0260 
0261 ## Change Scala Version
0262 
0263 When other versions of Scala like 2.13 are supported, it will be possible to build for that version.
0264 Change the major Scala version using (e.g. 2.13):
0265 
0266     ./dev/change-scala-version.sh 2.13
0267 
0268 For Maven, please enable the profile (e.g. 2.13):
0269 
0270     ./build/mvn -Pscala-2.13 compile
0271 
0272 For SBT, specify a complete scala version using (e.g. 2.13.0):
0273 
0274     ./build/sbt -Dscala.version=2.13.0
0275 
0276 Otherwise, the sbt-pom-reader plugin will use the `scala.version` specified in the spark-parent pom.
0277 
0278 ## Running Jenkins tests with Github Enterprise
0279 
0280 To run tests with Jenkins:
0281 
0282     ./dev/run-tests-jenkins
0283 
0284 If use an individual repository or a repository on GitHub Enterprise, export below environment variables before running above command.
0285 
0286 ### Related environment variables
0287 
0288 <table class="table">
0289 <tr><th>Variable Name</th><th>Default</th><th>Meaning</th></tr>
0290 <tr>
0291   <td><code>SPARK_PROJECT_URL</code></td>
0292   <td>https://github.com/apache/spark</td>
0293   <td>
0294     The Spark project URL of GitHub Enterprise.
0295   </td>
0296 </tr>
0297 <tr>
0298   <td><code>GITHUB_API_BASE</code></td>
0299   <td>https://api.github.com/repos/apache/spark</td>
0300   <td>
0301     The Spark project API server URL of GitHub Enterprise.
0302   </td>
0303 </tr>
0304 </table>