Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Generic Load/Save Functions
0004 displayTitle: Generic Load/Save Functions
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 
0026 In the simplest form, the default data source (`parquet` unless otherwise configured by
0027 `spark.sql.sources.default`) will be used for all operations.
0028 
0029 
0030 <div class="codetabs">
0031 <div data-lang="scala"  markdown="1">
0032 {% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0033 </div>
0034 
0035 <div data-lang="java"  markdown="1">
0036 {% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0037 </div>
0038 
0039 <div data-lang="python"  markdown="1">
0040 
0041 {% include_example generic_load_save_functions python/sql/datasource.py %}
0042 </div>
0043 
0044 <div data-lang="r"  markdown="1">
0045 
0046 {% include_example generic_load_save_functions r/RSparkSQLExample.R %}
0047 
0048 </div>
0049 </div>
0050 
0051 ### Manually Specifying Options
0052 
0053 You can also manually specify the data source that will be used along with any extra options
0054 that you would like to pass to the data source. Data sources are specified by their fully qualified
0055 name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
0056 names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
0057 source type can be converted into other types using this syntax.
0058 
0059 Please refer the API documentation for available options of built-in sources, for example,
0060 `org.apache.spark.sql.DataFrameReader` and `org.apache.spark.sql.DataFrameWriter`. The
0061 options documented there should be applicable through non-Scala Spark APIs (e.g. PySpark)
0062 as well. For other formats, refer to the API documentation of the particular format.
0063 
0064 To load a JSON file you can use:
0065 
0066 <div class="codetabs">
0067 <div data-lang="scala"  markdown="1">
0068 {% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0069 </div>
0070 
0071 <div data-lang="java"  markdown="1">
0072 {% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0073 </div>
0074 
0075 <div data-lang="python"  markdown="1">
0076 {% include_example manual_load_options python/sql/datasource.py %}
0077 </div>
0078 
0079 <div data-lang="r"  markdown="1">
0080 {% include_example manual_load_options r/RSparkSQLExample.R %}
0081 </div>
0082 </div>
0083 
0084 To load a CSV file you can use:
0085 
0086 <div class="codetabs">
0087 <div data-lang="scala"  markdown="1">
0088 {% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0089 </div>
0090 
0091 <div data-lang="java"  markdown="1">
0092 {% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0093 </div>
0094 
0095 <div data-lang="python"  markdown="1">
0096 {% include_example manual_load_options_csv python/sql/datasource.py %}
0097 </div>
0098 
0099 <div data-lang="r"  markdown="1">
0100 {% include_example manual_load_options_csv r/RSparkSQLExample.R %}
0101 
0102 </div>
0103 </div>
0104 
0105 The extra options are also used during write operation.
0106 For example, you can control bloom filters and dictionary encodings for ORC data sources.
0107 The following ORC example will create bloom filter and use dictionary encoding only for `favorite_color`.
0108 For Parquet, there exists `parquet.enable.dictionary`, too.
0109 To find more detailed information about the extra ORC/Parquet options,
0110 visit the official Apache ORC/Parquet websites.
0111 
0112 <div class="codetabs">
0113 
0114 <div data-lang="scala"  markdown="1">
0115 {% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0116 </div>
0117 
0118 <div data-lang="java"  markdown="1">
0119 {% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0120 </div>
0121 
0122 <div data-lang="python"  markdown="1">
0123 {% include_example manual_save_options_orc python/sql/datasource.py %}
0124 </div>
0125 
0126 <div data-lang="r"  markdown="1">
0127 {% include_example manual_save_options_orc r/RSparkSQLExample.R %}
0128 </div>
0129 
0130 <div data-lang="SQL"  markdown="1">
0131 
0132 {% highlight sql %}
0133 CREATE TABLE users_with_options (
0134   name STRING,
0135   favorite_color STRING,
0136   favorite_numbers array<integer>
0137 ) USING ORC
0138 OPTIONS (
0139   orc.bloom.filter.columns 'favorite_color',
0140   orc.dictionary.key.threshold '1.0',
0141   orc.column.encoding.direct 'name'
0142 )
0143 {% endhighlight %}
0144 
0145 </div>
0146 
0147 </div>
0148 
0149 ### Run SQL on files directly
0150 
0151 Instead of using read API to load a file into DataFrame and query it, you can also query that
0152 file directly with SQL.
0153 
0154 <div class="codetabs">
0155 <div data-lang="scala"  markdown="1">
0156 {% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0157 </div>
0158 
0159 <div data-lang="java"  markdown="1">
0160 {% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0161 </div>
0162 
0163 <div data-lang="python"  markdown="1">
0164 {% include_example direct_sql python/sql/datasource.py %}
0165 </div>
0166 
0167 <div data-lang="r"  markdown="1">
0168 {% include_example direct_sql r/RSparkSQLExample.R %}
0169 
0170 </div>
0171 </div>
0172 
0173 ### Save Modes
0174 
0175 Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
0176 present. It is important to realize that these save modes do not utilize any locking and are not
0177 atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the
0178 new data.
0179 
0180 <table class="table">
0181 <tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
0182 <tr>
0183   <td><code>SaveMode.ErrorIfExists</code> (default)</td>
0184   <td><code>"error" or "errorifexists"</code> (default)</td>
0185   <td>
0186     When saving a DataFrame to a data source, if data already exists,
0187     an exception is expected to be thrown.
0188   </td>
0189 </tr>
0190 <tr>
0191   <td><code>SaveMode.Append</code></td>
0192   <td><code>"append"</code></td>
0193   <td>
0194     When saving a DataFrame to a data source, if data/table already exists,
0195     contents of the DataFrame are expected to be appended to existing data.
0196   </td>
0197 </tr>
0198 <tr>
0199   <td><code>SaveMode.Overwrite</code></td>
0200   <td><code>"overwrite"</code></td>
0201   <td>
0202     Overwrite mode means that when saving a DataFrame to a data source,
0203     if data/table already exists, existing data is expected to be overwritten by the contents of
0204     the DataFrame.
0205   </td>
0206 </tr>
0207 <tr>
0208   <td><code>SaveMode.Ignore</code></td>
0209   <td><code>"ignore"</code></td>
0210   <td>
0211     Ignore mode means that when saving a DataFrame to a data source, if data already exists,
0212     the save operation is expected not to save the contents of the DataFrame and not to
0213     change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
0214   </td>
0215 </tr>
0216 </table>
0217 
0218 ### Saving to Persistent Tables
0219 
0220 `DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
0221 command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
0222 default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceTempView` command,
0223 `saveAsTable` will materialize the contents of the DataFrame and create a pointer to the data in the
0224 Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
0225 long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
0226 be created by calling the `table` method on a `SparkSession` with the name of the table.
0227 
0228 For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the
0229 `path` option, e.g. `df.write.option("path", "/some/path").saveAsTable("t")`. When the table is dropped,
0230 the custom table path will not be removed and the table data is still there. If no custom table path is
0231 specified, Spark will write data to a default table path under the warehouse directory. When the table is
0232 dropped, the default table path will be removed too.
0233 
0234 Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
0235 
0236 - Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
0237 - Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.
0238 
0239 Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
0240 
0241 ### Bucketing, Sorting and Partitioning
0242 
0243 For file-based data source, it is also possible to bucket and sort or partition the output.
0244 Bucketing and sorting are applicable only to persistent tables:
0245 
0246 <div class="codetabs">
0247 
0248 <div data-lang="scala"  markdown="1">
0249 {% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0250 </div>
0251 
0252 <div data-lang="java"  markdown="1">
0253 {% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0254 </div>
0255 
0256 <div data-lang="python"  markdown="1">
0257 {% include_example write_sorting_and_bucketing python/sql/datasource.py %}
0258 </div>
0259 
0260 <div data-lang="SQL"  markdown="1">
0261 
0262 {% highlight sql %}
0263 
0264 CREATE TABLE users_bucketed_by_name(
0265   name STRING,
0266   favorite_color STRING,
0267   favorite_numbers array<integer>
0268 ) USING parquet
0269 CLUSTERED BY(name) INTO 42 BUCKETS;
0270 
0271 {% endhighlight %}
0272 
0273 </div>
0274 
0275 </div>
0276 
0277 while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs.
0278 
0279 
0280 <div class="codetabs">
0281 
0282 <div data-lang="scala"  markdown="1">
0283 {% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0284 </div>
0285 
0286 <div data-lang="java"  markdown="1">
0287 {% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0288 </div>
0289 
0290 <div data-lang="python"  markdown="1">
0291 {% include_example write_partitioning python/sql/datasource.py %}
0292 </div>
0293 
0294 <div data-lang="SQL"  markdown="1">
0295 
0296 {% highlight sql %}
0297 
0298 CREATE TABLE users_by_favorite_color(
0299   name STRING,
0300   favorite_color STRING,
0301   favorite_numbers array<integer>
0302 ) USING csv PARTITIONED BY(favorite_color);
0303 
0304 {% endhighlight %}
0305 
0306 </div>
0307 
0308 </div>
0309 
0310 It is possible to use both partitioning and bucketing for a single table:
0311 
0312 <div class="codetabs">
0313 
0314 <div data-lang="scala"  markdown="1">
0315 {% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
0316 </div>
0317 
0318 <div data-lang="java"  markdown="1">
0319 {% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
0320 </div>
0321 
0322 <div data-lang="python"  markdown="1">
0323 {% include_example write_partition_and_bucket python/sql/datasource.py %}
0324 </div>
0325 
0326 <div data-lang="SQL"  markdown="1">
0327 
0328 {% highlight sql %}
0329 
0330 CREATE TABLE users_bucketed_and_partitioned(
0331   name STRING,
0332   favorite_color STRING,
0333   favorite_numbers array<integer>
0334 ) USING parquet
0335 PARTITIONED BY (favorite_color)
0336 CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
0337 
0338 {% endhighlight %}
0339 
0340 </div>
0341 
0342 </div>
0343 
0344 `partitionBy` creates a directory structure as described in the [Partition Discovery](sql-data-sources-parquet.html#partition-discovery) section.
0345 Thus, it has limited applicability to columns with high cardinality. In contrast
0346  `bucketBy` distributes
0347 data across a fixed number of buckets and can be used when the number of unique values is unbounded.