the-tree/docs/ml-datasource.md

0001 ---
0002 layout: global
0003 title: Data sources
0004 displayTitle: Data sources
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 In this section, we introduce how to use data source in ML to load data.
0023 Besides some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
0024
0025 **Table of Contents**
0026
0027 * This will become a table of contents (this text will be scraped).
0028 {:toc}
0029
0030 ## Image data source
0031
0032 This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library.
0033 The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema.
0034 The schema of the `image` column is:
0035  - origin: `StringType` (represents the file path of the image)
0036  - height: `IntegerType` (height of the image)
0037  - width: `IntegerType` (width of the image)
0038  - nChannels: `IntegerType` (number of image channels)
0039  - mode: `IntegerType` (OpenCV-compatible type)
0040  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
0041
0042
0043 <div class="codetabs">
0044 <div data-lang="scala" markdown="1">
0045 [`ImageDataSource`](api/scala/org/apache/spark/ml/source/image/ImageDataSource.html)
0046 implements a Spark SQL data source API for loading image data as a DataFrame.
0047
0048 {% highlight scala %}
0049 scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
0050 df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
0051
0052 scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
0053 +-----------------------------------------------------------------------+-----+------+
0054 |origin                                                                 |width|height|
0055 +-----------------------------------------------------------------------+-----+------+
0056 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0057 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0058 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0059 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0060 +-----------------------------------------------------------------------+-----+------+
0061 {% endhighlight %}
0062 </div>
0063
0064 <div data-lang="java" markdown="1">
0065 [`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
0066 implements Spark SQL data source API for loading image data as a DataFrame.
0067
0068 {% highlight java %}
0069 Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
0070 imageDF.select("image.origin", "image.width", "image.height").show(false);
0071 /*
0072 Will output:
0073 +-----------------------------------------------------------------------+-----+------+
0074 |origin                                                                 |width|height|
0075 +-----------------------------------------------------------------------+-----+------+
0076 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0077 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0078 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0079 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0080 +-----------------------------------------------------------------------+-----+------+
0081 */
0082 {% endhighlight %}
0083 </div>
0084
0085 <div data-lang="python" markdown="1">
0086 In PySpark we provide Spark SQL data source API for loading image data as a DataFrame.
0087
0088 {% highlight python %}
0089 >>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
0090 >>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
0091 +-----------------------------------------------------------------------+-----+------+
0092 |origin                                                                 |width|height|
0093 +-----------------------------------------------------------------------+-----+------+
0094 |file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
0095 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
0096 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
0097 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
0098 +-----------------------------------------------------------------------+-----+------+
0099 {% endhighlight %}
0100 </div>
0101
0102 <div data-lang="r" markdown="1">
0103 In SparkR we provide Spark SQL data source API for loading image data as a DataFrame.
0104
0105 {% highlight r %}
0106 > df = read.df("data/mllib/images/origin/kittens", "image")
0107 > head(select(df, df$image.origin, df$image.width, df$image.height))
0108
0109 1               file:///spark/data/mllib/images/origin/kittens/54893.jpg
0110 2            file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
0111 3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
0112 4            file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
0113   width height
0114 1   300    311
0115 2   199    313
0116 3   300    200
0117 4   300    296
0118
0119 {% endhighlight %}
0120 </div>
0121
0122
0123 </div>
0124
0125
0126 ## LIBSVM data source
0127
0128 This `LIBSVM` data source is used to load 'libsvm' type files from a directory.
0129 The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
0130 The schemas of the columns are:
0131  - label: `DoubleType` (represents the instance label)
0132  - features: `VectorUDT` (represents the feature vector)
0133
0134 <div class="codetabs">
0135 <div data-lang="scala" markdown="1">
0136 [`LibSVMDataSource`](api/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
0137 implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0138
0139 {% highlight scala %}
0140 scala> val df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
0141 df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
0142
0143 scala> df.show(10)
0144 +-----+--------------------+
0145 |label|            features|
0146 +-----+--------------------+
0147 |  0.0|(780,[127,128,129...|
0148 |  1.0|(780,[158,159,160...|
0149 |  1.0|(780,[124,125,126...|
0150 |  1.0|(780,[152,153,154...|
0151 |  1.0|(780,[151,152,153...|
0152 |  0.0|(780,[129,130,131...|
0153 |  1.0|(780,[158,159,160...|
0154 |  1.0|(780,[99,100,101,...|
0155 |  0.0|(780,[154,155,156...|
0156 |  0.0|(780,[127,128,129...|
0157 +-----+--------------------+
0158 only showing top 10 rows
0159 {% endhighlight %}
0160 </div>
0161
0162 <div data-lang="java" markdown="1">
0163 [`LibSVMDataSource`](api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
0164 implements Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0165
0166 {% highlight java %}
0167 Dataset<Row> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt");
0168 df.show(10);
0169 /*
0170 Will output:
0171 +-----+--------------------+
0172 |label|            features|
0173 +-----+--------------------+
0174 |  0.0|(780,[127,128,129...|
0175 |  1.0|(780,[158,159,160...|
0176 |  1.0|(780,[124,125,126...|
0177 |  1.0|(780,[152,153,154...|
0178 |  1.0|(780,[151,152,153...|
0179 |  0.0|(780,[129,130,131...|
0180 |  1.0|(780,[158,159,160...|
0181 |  1.0|(780,[99,100,101,...|
0182 |  0.0|(780,[154,155,156...|
0183 |  0.0|(780,[127,128,129...|
0184 +-----+--------------------+
0185 only showing top 10 rows
0186 */
0187 {% endhighlight %}
0188 </div>
0189
0190 <div data-lang="python" markdown="1">
0191 In PySpark we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0192
0193 {% highlight python %}
0194 >>> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
0195 >>> df.show(10)
0196 +-----+--------------------+
0197 |label|            features|
0198 +-----+--------------------+
0199 |  0.0|(780,[127,128,129...|
0200 |  1.0|(780,[158,159,160...|
0201 |  1.0|(780,[124,125,126...|
0202 |  1.0|(780,[152,153,154...|
0203 |  1.0|(780,[151,152,153...|
0204 |  0.0|(780,[129,130,131...|
0205 |  1.0|(780,[158,159,160...|
0206 |  1.0|(780,[99,100,101,...|
0207 |  0.0|(780,[154,155,156...|
0208 |  0.0|(780,[127,128,129...|
0209 +-----+--------------------+
0210 only showing top 10 rows
0211 {% endhighlight %}
0212 </div>
0213
0214 <div data-lang="r" markdown="1">
0215 In SparkR we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
0216
0217 {% highlight r %}
0218 > df = read.df("data/mllib/sample_libsvm_data.txt", "libsvm")
0219 > head(select(df, df$label, df$features), 10)
0220
0221    label                      features
0222 1      0 <environment: 0x7fe6d35366e8>
0223 2      1 <environment: 0x7fe6d353bf78>
0224 3      1 <environment: 0x7fe6d3541840>
0225 4      1 <environment: 0x7fe6d3545108>
0226 5      1 <environment: 0x7fe6d354c8e0>
0227 6      0 <environment: 0x7fe6d35501a8>
0228 7      1 <environment: 0x7fe6d3555a70>
0229 8      1 <environment: 0x7fe6d3559338>
0230 9      0 <environment: 0x7fe6d355cc00>
0231 10     0 <environment: 0x7fe6d35643d8>
0232
0233 {% endhighlight %}
0234 </div>
0235
0236
0237 </div>