Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Data sources
0004 displayTitle: Data sources
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 In this section, we introduce how to use data source in ML to load data.
0023 Besides some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
0024 
0025 **Table of Contents**
0026 
0027 * This will become a table of contents (this text will be scraped).
0028 {:toc}
0029 
0030 ## Image data source
0031 
0032 This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library.
0033 The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema.
0034 The schema of the `image` column is:
0035  - origin: `StringType` (represents the file path of the image)
0036  - height: `IntegerType` (height of the image)
0037  - width: `IntegerType` (width of the image)
0038  - nChannels: `IntegerType` (number of image channels)
0039  - mode: `IntegerType` (OpenCV-compatible type)
0040  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
0041 
0042 
0043 <div class="codetabs">
0044 <div data-lang="scala" markdown="1">
0045 [`ImageDataSource`](api/scala/org/apache/spark/ml/source/image/ImageDataSource.html)
0046 implements a Spark SQL data source API for loading image data as a DataFrame.
0047 
0048 {% highlight scala %}
0049 scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
0050 df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
0051 
0052 scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
0053 +-----------------------------------------------------------------------+-----+------+
0054 |origin                                                                 |width|height|
0055 +-----------------------------------------------------------------------+-----+------+
0056 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0057 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0058 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0059 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0060 +-----------------------------------------------------------------------+-----+------+
0061 {% endhighlight %}
0062 </div>
0063 
0064 <div data-lang="java" markdown="1">
0065 [`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
0066 implements Spark SQL data source API for loading image data as a DataFrame.
0067 
0068 {% highlight java %}
0069 Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
0070 imageDF.select("image.origin", "image.width", "image.height").show(false);
0071 /*
0072 Will output:
0073 +-----------------------------------------------------------------------+-----+------+
0074 |origin                                                                 |width|height|
0075 +-----------------------------------------------------------------------+-----+------+
0076 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0077 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0078 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0079 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0080 +-----------------------------------------------------------------------+-----+------+
0081 */
0082 {% endhighlight %}
0083 </div>
0084 
0085 <div data-lang="python" markdown="1">
0086 In PySpark we provide Spark SQL data source API for loading image data as a DataFrame.
0087 
0088 {% highlight python %}
0089 >>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
0090 >>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
0091 +-----------------------------------------------------------------------+-----+------+
0092 |origin                                                                 |width|height|
0093 +-----------------------------------------------------------------------+-----+------+
0094 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0095 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0096 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0097 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0098 +-----------------------------------------------------------------------+-----+------+
0099 {% endhighlight %}
0100 </div>
0101 
0102 <div data-lang="r" markdown="1">
0103 In SparkR we provide Spark SQL data source API for loading image data as a DataFrame.
0104 
0105 {% highlight r %}
0106 > df = read.df("data/mllib/images/origin/kittens", "image")
0107 > head(select(df, df$image.origin, df$image.width, df$image.height))
0108 
0109 1               file:///spark/data/mllib/images/origin/kittens/54893.jpg
0110 2            file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
0111 3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
0112 4            file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
0113   width height
0114 1   300    311
0115 2   199    313
0116 3   300    200
0117 4   300    296
0118 
0119 {% endhighlight %}
0120 </div>
0121 
0122 
0123 </div>
0124 
0125 
0126 ## LIBSVM data source
0127 
0128 This `LIBSVM` data source is used to load 'libsvm' type files from a directory.
0129 The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
0130 The schemas of the columns are:
0131  - label: `DoubleType` (represents the instance label)
0132  - features: `VectorUDT` (represents the feature vector)
0133 
0134 <div class="codetabs">
0135 <div data-lang="scala" markdown="1">
0136 [`LibSVMDataSource`](api/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
0137 implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0138 
0139 {% highlight scala %}
0140 scala> val df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
0141 df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
0142 
0143 scala> df.show(10)
0144 +-----+--------------------+
0145 |label|            features|
0146 +-----+--------------------+
0147 |  0.0|(780,[127,128,129...|
0148 |  1.0|(780,[158,159,160...|
0149 |  1.0|(780,[124,125,126...|
0150 |  1.0|(780,[152,153,154...|
0151 |  1.0|(780,[151,152,153...|
0152 |  0.0|(780,[129,130,131...|
0153 |  1.0|(780,[158,159,160...|
0154 |  1.0|(780,[99,100,101,...|
0155 |  0.0|(780,[154,155,156...|
0156 |  0.0|(780,[127,128,129...|
0157 +-----+--------------------+
0158 only showing top 10 rows
0159 {% endhighlight %}
0160 </div>
0161 
0162 <div data-lang="java" markdown="1">
0163 [`LibSVMDataSource`](api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
0164 implements Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0165 
0166 {% highlight java %}
0167 Dataset<Row> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt");
0168 df.show(10);
0169 /*
0170 Will output:
0171 +-----+--------------------+
0172 |label|            features|
0173 +-----+--------------------+
0174 |  0.0|(780,[127,128,129...|
0175 |  1.0|(780,[158,159,160...|
0176 |  1.0|(780,[124,125,126...|
0177 |  1.0|(780,[152,153,154...|
0178 |  1.0|(780,[151,152,153...|
0179 |  0.0|(780,[129,130,131...|
0180 |  1.0|(780,[158,159,160...|
0181 |  1.0|(780,[99,100,101,...|
0182 |  0.0|(780,[154,155,156...|
0183 |  0.0|(780,[127,128,129...|
0184 +-----+--------------------+
0185 only showing top 10 rows
0186 */
0187 {% endhighlight %}
0188 </div>
0189 
0190 <div data-lang="python" markdown="1">
0191 In PySpark we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0192 
0193 {% highlight python %}
0194 >>> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
0195 >>> df.show(10)
0196 +-----+--------------------+
0197 |label|            features|
0198 +-----+--------------------+
0199 |  0.0|(780,[127,128,129...|
0200 |  1.0|(780,[158,159,160...|
0201 |  1.0|(780,[124,125,126...|
0202 |  1.0|(780,[152,153,154...|
0203 |  1.0|(780,[151,152,153...|
0204 |  0.0|(780,[129,130,131...|
0205 |  1.0|(780,[158,159,160...|
0206 |  1.0|(780,[99,100,101,...|
0207 |  0.0|(780,[154,155,156...|
0208 |  0.0|(780,[127,128,129...|
0209 +-----+--------------------+
0210 only showing top 10 rows
0211 {% endhighlight %}
0212 </div>
0213 
0214 <div data-lang="r" markdown="1">
0215 In SparkR we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0216 
0217 {% highlight r %}
0218 > df = read.df("data/mllib/sample_libsvm_data.txt", "libsvm")
0219 > head(select(df, df$label, df$features), 10)
0220 
0221    label                      features
0222 1      0 <environment: 0x7fe6d35366e8>
0223 2      1 <environment: 0x7fe6d353bf78>
0224 3      1 <environment: 0x7fe6d3541840>
0225 4      1 <environment: 0x7fe6d3545108>
0226 5      1 <environment: 0x7fe6d354c8e0>
0227 6      0 <environment: 0x7fe6d35501a8>
0228 7      1 <environment: 0x7fe6d3555a70>
0229 8      1 <environment: 0x7fe6d3559338>
0230 9      0 <environment: 0x7fe6d355cc00>
0231 10     0 <environment: 0x7fe6d35643d8>
0232 
0233 {% endhighlight %}
0234 </div>
0235 
0236 
0237 </div>