the-tree/docs/hadoop-provided.md

0001 ---
0002 layout: global
0003 displayTitle: Using Spark's "Hadoop Free" Build
0004 title: Using Spark's "Hadoop Free" Build
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages "Hadoop free" builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify `SPARK_DIST_CLASSPATH` to include Hadoop's package jars. The most convenient place to do this is by adding an entry in `conf/spark-env.sh`.
0023
0024 This page describes how to connect Spark to Hadoop for different types of distributions.
0025
0026 # Apache Hadoop
0027 For Apache distributions, you can use Hadoop's 'classpath' command. For instance:
0028
0029 {% highlight bash %}
0030 ### in conf/spark-env.sh ###
0031
0032 # If 'hadoop' binary is on your PATH
0033 export SPARK_DIST_CLASSPATH=$(hadoop classpath)
0034
0035 # With explicit path to 'hadoop' binary
0036 export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
0037
0038 # Passing a Hadoop configuration directory
0039 export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
0040
0041 {% endhighlight %}
0042
0043 # Hadoop Free Build Setup for Spark on Kubernetes
0044 To run the Hadoop free build of Spark on Kubernetes, the executor image must have the appropriate version of Hadoop binaries and the correct `SPARK_DIST_CLASSPATH` value set. See the example below for the relevant changes needed in the executor Dockerfile:
0045
0046 {% highlight bash %}
0047 ### Set environment variables in the executor dockerfile ###
0048
0049 ENV SPARK_HOME="/opt/spark"
0050 ENV HADOOP_HOME="/opt/hadoop"
0051 ENV PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
0052 ...
0053
0054 #Copy your target hadoop binaries to the executor hadoop home
0055
0056 COPY /opt/hadoop3  $HADOOP_HOME
0057 ...
0058
0059 #Copy and use the Spark provided entrypoint.sh. It sets your SPARK_DIST_CLASSPATH using the hadoop binary in $HADOOP_HOME and starts the executor. If you choose to customize the value of SPARK_DIST_CLASSPATH here, the value will be retained in entrypoint.sh
0060
0061 ENTRYPOINT [ "/opt/entrypoint.sh" ]
0062 ...
0063 {% endhighlight %}