Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Accessing OpenStack Swift from Spark
0004 license: |
0005   Licensed to the Apache Software Foundation (ASF) under one or more
0006   contributor license agreements.  See the NOTICE file distributed with
0007   this work for additional information regarding copyright ownership.
0008   The ASF licenses this file to You under the Apache License, Version 2.0
0009   (the "License"); you may not use this file except in compliance with
0010   the License.  You may obtain a copy of the License at
0011  
0012      http://www.apache.org/licenses/LICENSE-2.0
0013  
0014   Unless required by applicable law or agreed to in writing, software
0015   distributed under the License is distributed on an "AS IS" BASIS,
0016   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0017   See the License for the specific language governing permissions and
0018   limitations under the License.
0019 ---
0020 
0021 Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
0022 same URI formats as in Hadoop. You can specify a path in Swift as input through a 
0023 URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your 
0024 Swift security credentials, through <code>core-site.xml</code> or via
0025 <code>SparkContext.hadoopConfiguration</code>.
0026 The current Swift driver requires Swift to use the Keystone authentication method, or
0027 its Rackspace-specific predecessor.
0028 
0029 # Configuring Swift for Better Data Locality
0030 
0031 Although not mandatory, it is recommended to configure the proxy server of Swift with
0032 <code>list_endpoints</code> to have better data locality. More information is
0033 [available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py).
0034 
0035 
0036 # Dependencies
0037 
0038 The Spark application should include <code>hadoop-openstack</code> dependency, which can
0039 be done by including the `hadoop-cloud` module for the specific version of spark used.
0040 For example, for Maven support, add the following to the <code>pom.xml</code> file:
0041 
0042 {% highlight xml %}
0043 <dependencyManagement>
0044   ...
0045   <dependency>
0046     <groupId>org.apache.spark</groupId>
0047     <artifactId>hadoop-cloud_2.12</artifactId>
0048     <version>${spark.version}</version>
0049   </dependency>
0050   ...
0051 </dependencyManagement>
0052 {% endhighlight %}
0053 
0054 # Configuration Parameters
0055 
0056 Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
0057 The main category of parameters that should be configured is the authentication parameters
0058 required by Keystone.
0059 
0060 The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be
0061 any (alphanumeric) name.
0062 
0063 <table class="table">
0064 <tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
0065 <tr>
0066   <td><code>fs.swift.service.PROVIDER.auth.url</code></td>
0067   <td>Keystone Authentication URL</td>
0068   <td>Mandatory</td>
0069 </tr>
0070 <tr>
0071   <td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td>
0072   <td>Keystone endpoints prefix</td>
0073   <td>Optional</td>
0074 </tr>
0075 <tr>
0076   <td><code>fs.swift.service.PROVIDER.tenant</code></td>
0077   <td>Tenant</td>
0078   <td>Mandatory</td>
0079 </tr>
0080 <tr>
0081   <td><code>fs.swift.service.PROVIDER.username</code></td>
0082   <td>Username</td>
0083   <td>Mandatory</td>
0084 </tr>
0085 <tr>
0086   <td><code>fs.swift.service.PROVIDER.password</code></td>
0087   <td>Password</td>
0088   <td>Mandatory</td>
0089 </tr>
0090 <tr>
0091   <td><code>fs.swift.service.PROVIDER.http.port</code></td>
0092   <td>HTTP port</td>
0093   <td>Mandatory</td>
0094 </tr>
0095 <tr>
0096   <td><code>fs.swift.service.PROVIDER.region</code></td>
0097   <td>Keystone region</td>
0098   <td>Mandatory</td>
0099 </tr>
0100 <tr>
0101   <td><code>fs.swift.service.PROVIDER.public</code></td>
0102   <td>Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints</td>
0103   <td>Mandatory</td>
0104 </tr>
0105 </table>
0106 
0107 For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code>
0108 defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include:
0109 
0110 {% highlight xml %}
0111 <configuration>
0112   <property>
0113     <name>fs.swift.service.SparkTest.auth.url</name>
0114     <value>http://127.0.0.1:5000/v2.0/tokens</value>
0115   </property>
0116   <property>
0117     <name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
0118     <value>endpoints</value>
0119   </property>
0120     <name>fs.swift.service.SparkTest.http.port</name>
0121     <value>8080</value>
0122   </property>
0123   <property>
0124     <name>fs.swift.service.SparkTest.region</name>
0125     <value>RegionOne</value>
0126   </property>
0127   <property>
0128     <name>fs.swift.service.SparkTest.public</name>
0129     <value>true</value>
0130   </property>
0131   <property>
0132     <name>fs.swift.service.SparkTest.tenant</name>
0133     <value>test</value>
0134   </property>
0135   <property>
0136     <name>fs.swift.service.SparkTest.username</name>
0137     <value>tester</value>
0138   </property>
0139   <property>
0140     <name>fs.swift.service.SparkTest.password</name>
0141     <value>testing</value>
0142   </property>
0143 </configuration>
0144 {% endhighlight %}
0145 
0146 Notice that
0147 <code>fs.swift.service.PROVIDER.tenant</code>,
0148 <code>fs.swift.service.PROVIDER.username</code>, 
0149 <code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in
0150 <code>core-site.xml</code> is not always a good approach.
0151 We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark
0152 via <code>spark-shell</code>.
0153 For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.