Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 displayTitle: Using Spark's "Hadoop Free" Build
0004 title: Using Spark's "Hadoop Free" Build
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages "Hadoop free" builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify `SPARK_DIST_CLASSPATH` to include Hadoop's package jars. The most convenient place to do this is by adding an entry in `conf/spark-env.sh`.
0023 
0024 This page describes how to connect Spark to Hadoop for different types of distributions.
0025 
0026 # Apache Hadoop
0027 For Apache distributions, you can use Hadoop's 'classpath' command. For instance:
0028 
0029 {% highlight bash %}
0030 ### in conf/spark-env.sh ###
0031 
0032 # If 'hadoop' binary is on your PATH
0033 export SPARK_DIST_CLASSPATH=$(hadoop classpath)
0034 
0035 # With explicit path to 'hadoop' binary
0036 export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
0037 
0038 # Passing a Hadoop configuration directory
0039 export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
0040 
0041 {% endhighlight %}
0042 
0043 # Hadoop Free Build Setup for Spark on Kubernetes  
0044 To run the Hadoop free build of Spark on Kubernetes, the executor image must have the appropriate version of Hadoop binaries and the correct `SPARK_DIST_CLASSPATH` value set. See the example below for the relevant changes needed in the executor Dockerfile:
0045 
0046 {% highlight bash %}
0047 ### Set environment variables in the executor dockerfile ###
0048 
0049 ENV SPARK_HOME="/opt/spark"  
0050 ENV HADOOP_HOME="/opt/hadoop"  
0051 ENV PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"  
0052 ...  
0053 
0054 #Copy your target hadoop binaries to the executor hadoop home   
0055 
0056 COPY /opt/hadoop3  $HADOOP_HOME  
0057 ...
0058 
0059 #Copy and use the Spark provided entrypoint.sh. It sets your SPARK_DIST_CLASSPATH using the hadoop binary in $HADOOP_HOME and starts the executor. If you choose to customize the value of SPARK_DIST_CLASSPATH here, the value will be retained in entrypoint.sh
0060 
0061 ENTRYPOINT [ "/opt/entrypoint.sh" ]
0062 ...  
0063 {% endhighlight %}