the-tree/docs/sql-data-sources-hive-tables.md

0001 ---
0002 layout: global
0003 title: Hive Tables
0004 displayTitle: Hive Tables
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
0026 However, since Hive has a large number of dependencies, these dependencies are not included in the
0027 default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
0028 automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
0029 they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
0030 access data stored in Hive.
0031
0032 Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
0033 and `hdfs-site.xml` (for HDFS configuration) file in `conf/`.
0034
0035 When working with Hive, one must instantiate `SparkSession` with Hive support, including
0036 connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
0037 Users who do not have an existing Hive deployment can still enable Hive support. When not configured
0038 by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
0039 creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
0040 `spark-warehouse` in the current directory that the Spark application is started. Note that
0041 the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
0042 Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
0043 You may need to grant write privilege to the user who starts the Spark application.
0044
0045 <div class="codetabs">
0046
0047 <div data-lang="scala"  markdown="1">
0048 {% include_example spark_hive scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala %}
0049 </div>
0050
0051 <div data-lang="java"  markdown="1">
0052 {% include_example spark_hive java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java %}
0053 </div>
0054
0055 <div data-lang="python"  markdown="1">
0056 {% include_example spark_hive python/sql/hive.py %}
0057 </div>
0058
0059 <div data-lang="r"  markdown="1">
0060
0061 When working with Hive one must instantiate `SparkSession` with Hive support. This
0062 adds support for finding tables in the MetaStore and writing queries using HiveQL.
0063
0064 {% include_example spark_hive r/RSparkSQLExample.R %}
0065
0066 </div>
0067 </div>
0068
0069 ### Specifying storage format for Hive tables
0070
0071 When you create a Hive table, you need to define how this table should read/write data from/to file system,
0072 i.e. the "input format" and "output format". You also need to define how this table should deserialize the data
0073 to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage
0074 format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`.
0075 By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when
0076 creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it.
0077
0078 <table class="table">
0079   <tr><th>Property Name</th><th>Meaning</th></tr>
0080   <tr>
0081     <td><code>fileFormat</code></td>
0082     <td>
0083       A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and
0084       "output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'.
0085     </td>
0086   </tr>
0087
0088   <tr>
0089     <td><code>inputFormat, outputFormat</code></td>
0090     <td>
0091       These 2 options specify the name of a corresponding <code>InputFormat</code> and <code>OutputFormat</code> class as a string literal,
0092       e.g. <code>org.apache.hadoop.hive.ql.io.orc.OrcInputFormat</code>. These 2 options must be appeared in a pair, and you can not
0093       specify them if you already specified the <code>fileFormat</code> option.
0094     </td>
0095   </tr>
0096
0097   <tr>
0098     <td><code>serde</code></td>
0099     <td>
0100       This option specifies the name of a serde class. When the <code>fileFormat</code> option is specified, do not specify this option
0101       if the given <code>fileFormat</code> already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
0102       don't include the serde information and you can use this option with these 3 fileFormats.
0103     </td>
0104   </tr>
0105
0106   <tr>
0107     <td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td>
0108     <td>
0109       These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows.
0110     </td>
0111   </tr>
0112 </table>
0113
0114 All other properties defined with `OPTIONS` will be regarded as Hive serde properties.
0115
0116 ### Interacting with Different Versions of Hive Metastore
0117
0118 One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
0119 which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
0120 build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
0121 Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
0122 will compile against built-in Hive and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
0123
0124 The following options can be used to configure the version of Hive that is used to retrieve metadata:
0125
0126 <table class="table">
0127   <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0128   <tr>
0129     <td><code>spark.sql.hive.metastore.version</code></td>
0130     <td><code>2.3.7</code></td>
0131     <td>
0132       Version of the Hive metastore. Available
0133       options are <code>0.12.0</code> through <code>2.3.7</code> and <code>3.0.0</code> through <code>3.1.2</code>.
0134     </td>
0135     <td>1.4.0</td>
0136   </tr>
0137   <tr>
0138     <td><code>spark.sql.hive.metastore.jars</code></td>
0139     <td><code>builtin</code></td>
0140     <td>
0141       Location of the jars that should be used to instantiate the HiveMetastoreClient. This
0142       property can be one of three options:
0143       <ol>
0144         <li><code>builtin</code></li>
0145         Use Hive 2.3.7, which is bundled with the Spark assembly when <code>-Phive</code> is
0146         enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
0147         either <code>2.3.7</code> or not defined.
0148         <li><code>maven</code></li>
0149         Use Hive jars of specified version downloaded from Maven repositories. This configuration
0150         is not generally recommended for production deployments.
0151         <li>A classpath in the standard format for the JVM. This classpath must include all of Hive
0152         and its dependencies, including the correct version of Hadoop. These jars only need to be
0153         present on the driver, but if you are running in yarn cluster mode then you must ensure
0154         they are packaged with your application.</li>
0155       </ol>
0156     </td>
0157     <td>1.4.0</td>
0158   </tr>
0159   <tr>
0160     <td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
0161     <td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td>
0162     <td>
0163       <p>
0164         A comma-separated list of class prefixes that should be loaded using the classloader that is
0165         shared between Spark SQL and a specific version of Hive. An example of classes that should
0166         be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
0167         to be shared are those that interact with classes that are already shared. For example,
0168         custom appenders that are used by log4j.
0169       </p>
0170     </td>
0171     <td>1.4.0</td>
0172   </tr>
0173   <tr>
0174     <td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
0175     <td><code>(empty)</code></td>
0176     <td>
0177       <p>
0178         A comma separated list of class prefixes that should explicitly be reloaded for each version
0179         of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
0180         prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
0181       </p>
0182     </td>
0183     <td>1.4.0</td>
0184   </tr>
0185 </table>