the-tree/docs/sql-getting-started.md

0001 ---
0002 layout: global
0003 title: Getting Started
0004 displayTitle: Getting Started
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 ## Starting Point: SparkSession
0026
0027 <div class="codetabs">
0028 <div data-lang="scala"  markdown="1">
0029
0030 The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
0031
0032 {% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0033 </div>
0034
0035 <div data-lang="java" markdown="1">
0036
0037 The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
0038
0039 {% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0040 </div>
0041
0042 <div data-lang="python"  markdown="1">
0043
0044 The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
0045
0046 {% include_example init_session python/sql/basic.py %}
0047 </div>
0048
0049 <div data-lang="r"  markdown="1">
0050
0051 The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
0052
0053 {% include_example init_session r/RSparkSQLExample.R %}
0054
0055 Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
0056 </div>
0057 </div>
0058
0059 `SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
0060 write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
0061 To use these features, you do not need to have an existing Hive setup.
0062
0063 ## Creating DataFrames
0064
0065 <div class="codetabs">
0066 <div data-lang="scala"  markdown="1">
0067 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0068 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0069
0070 As an example, the following creates a DataFrame based on the content of a JSON file:
0071
0072 {% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0073 </div>
0074
0075 <div data-lang="java" markdown="1">
0076 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0077 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0078
0079 As an example, the following creates a DataFrame based on the content of a JSON file:
0080
0081 {% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0082 </div>
0083
0084 <div data-lang="python"  markdown="1">
0085 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0086 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0087
0088 As an example, the following creates a DataFrame based on the content of a JSON file:
0089
0090 {% include_example create_df python/sql/basic.py %}
0091 </div>
0092
0093 <div data-lang="r"  markdown="1">
0094 With a `SparkSession`, applications can create DataFrames from a local R data.frame,
0095 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0096
0097 As an example, the following creates a DataFrame based on the content of a JSON file:
0098
0099 {% include_example create_df r/RSparkSQLExample.R %}
0100
0101 </div>
0102 </div>
0103
0104
0105 ## Untyped Dataset Operations (aka DataFrame Operations)
0106
0107 DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html).
0108
0109 As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
0110
0111 Here we include some basic examples of structured data processing using Datasets:
0112
0113 <div class="codetabs">
0114 <div data-lang="scala"  markdown="1">
0115 {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0116
0117 For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/org/apache/spark/sql/Dataset.html).
0118
0119 In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/org/apache/spark/sql/functions$.html).
0120 </div>
0121
0122 <div data-lang="java" markdown="1">
0123
0124 {% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0125
0126 For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html).
0127
0128 In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
0129 </div>
0130
0131 <div data-lang="python"  markdown="1">
0132 In Python, it's possible to access a DataFrame's columns either by attribute
0133 (`df.age`) or by indexing (`df['age']`). While the former is convenient for
0134 interactive data exploration, users are highly encouraged to use the
0135 latter form, which is future proof and won't break with column names that
0136 are also attributes on the DataFrame class.
0137
0138 {% include_example untyped_ops python/sql/basic.py %}
0139 For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
0140
0141 In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
0142
0143 </div>
0144
0145 <div data-lang="r"  markdown="1">
0146
0147 {% include_example untyped_ops r/RSparkSQLExample.R %}
0148
0149 For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
0150
0151 In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/SparkDataFrame.html).
0152
0153 </div>
0154
0155 </div>
0156
0157 ## Running SQL Queries Programmatically
0158
0159 <div class="codetabs">
0160 <div data-lang="scala"  markdown="1">
0161 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
0162
0163 {% include_example run_sql scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0164 </div>
0165
0166 <div data-lang="java" markdown="1">
0167 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `Dataset<Row>`.
0168
0169 {% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0170 </div>
0171
0172 <div data-lang="python"  markdown="1">
0173 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
0174
0175 {% include_example run_sql python/sql/basic.py %}
0176 </div>
0177
0178 <div data-lang="r"  markdown="1">
0179 The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
0180
0181 {% include_example run_sql r/RSparkSQLExample.R %}
0182
0183 </div>
0184 </div>
0185
0186
0187 ## Global Temporary View
0188
0189 Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it
0190 terminates. If you want to have a temporary view that is shared among all sessions and keep alive
0191 until the Spark application terminates, you can create a global temporary view. Global temporary
0192 view is tied to a system preserved database `global_temp`, and we must use the qualified name to
0193 refer it, e.g. `SELECT * FROM global_temp.view1`.
0194
0195 <div class="codetabs">
0196 <div data-lang="scala"  markdown="1">
0197 {% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0198 </div>
0199
0200 <div data-lang="java" markdown="1">
0201 {% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0202 </div>
0203
0204 <div data-lang="python"  markdown="1">
0205 {% include_example global_temp_view python/sql/basic.py %}
0206 </div>
0207
0208 <div data-lang="SQL"  markdown="1">
0209
0210 {% highlight sql %}
0211
0212 CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
0213
0214 SELECT * FROM global_temp.temp_view
0215
0216 {% endhighlight %}
0217
0218 </div>
0219 </div>
0220
0221
0222 ## Creating Datasets
0223
0224 Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
0225 a specialized [Encoder](api/scala/org/apache/spark/sql/Encoder.html) to serialize the objects
0226 for processing or transmitting over the network. While both encoders and standard serialization are
0227 responsible for turning an object into bytes, encoders are code generated dynamically and use a format
0228 that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
0229 the bytes back into an object.
0230
0231 <div class="codetabs">
0232 <div data-lang="scala"  markdown="1">
0233 {% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0234 </div>
0235
0236 <div data-lang="java" markdown="1">
0237 {% include_example create_ds java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0238 </div>
0239 </div>
0240
0241 ## Interoperating with RDDs
0242
0243 Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
0244 method uses reflection to infer the schema of an RDD that contains specific types of objects. This
0245 reflection-based approach leads to more concise code and works well when you already know the schema
0246 while writing your Spark application.
0247
0248 The second method for creating Datasets is through a programmatic interface that allows you to
0249 construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
0250 you to construct Datasets when the columns and their types are not known until runtime.
0251
0252 ### Inferring the Schema Using Reflection
0253 <div class="codetabs">
0254
0255 <div data-lang="scala"  markdown="1">
0256
0257 The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
0258 to a DataFrame. The case class
0259 defines the schema of the table. The names of the arguments to the case class are read using
0260 reflection and become the names of the columns. Case classes can also be nested or contain complex
0261 types such as `Seq`s or `Array`s. This RDD can be implicitly converted to a DataFrame and then be
0262 registered as a table. Tables can be used in subsequent SQL statements.
0263
0264 {% include_example schema_inferring scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0265 </div>
0266
0267 <div data-lang="java"  markdown="1">
0268
0269 Spark SQL supports automatically converting an RDD of
0270 [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly) into a DataFrame.
0271 The `BeanInfo`, obtained using reflection, defines the schema of the table. Currently, Spark SQL
0272 does not support JavaBeans that contain `Map` field(s). Nested JavaBeans and `List` or `Array`
0273 fields are supported though. You can create a JavaBean by creating a class that implements
0274 Serializable and has getters and setters for all of its fields.
0275
0276 {% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0277 </div>
0278
0279 <div data-lang="python"  markdown="1">
0280
0281 Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
0282 key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
0283 and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
0284
0285 {% include_example schema_inferring python/sql/basic.py %}
0286 </div>
0287
0288 </div>
0289
0290 ### Programmatically Specifying the Schema
0291
0292 <div class="codetabs">
0293
0294 <div data-lang="scala"  markdown="1">
0295
0296 When case classes cannot be defined ahead of time (for example,
0297 the structure of records is encoded in a string, or a text dataset will be parsed
0298 and fields will be projected differently for different users),
0299 a `DataFrame` can be created programmatically with three steps.
0300
0301 1. Create an RDD of `Row`s from the original RDD;
0302 2. Create the schema represented by a `StructType` matching the structure of
0303 `Row`s in the RDD created in Step 1.
0304 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
0305 by `SparkSession`.
0306
0307 For example:
0308
0309 {% include_example programmatic_schema scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0310 </div>
0311
0312 <div data-lang="java"  markdown="1">
0313
0314 When JavaBean classes cannot be defined ahead of time (for example,
0315 the structure of records is encoded in a string, or a text dataset will be parsed and
0316 fields will be projected differently for different users),
0317 a `Dataset<Row>` can be created programmatically with three steps.
0318
0319 1. Create an RDD of `Row`s from the original RDD;
0320 2. Create the schema represented by a `StructType` matching the structure of
0321 `Row`s in the RDD created in Step 1.
0322 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
0323 by `SparkSession`.
0324
0325 For example:
0326
0327 {% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0328 </div>
0329
0330 <div data-lang="python"  markdown="1">
0331
0332 When a dictionary of kwargs cannot be defined ahead of time (for example,
0333 the structure of records is encoded in a string, or a text dataset will be parsed and
0334 fields will be projected differently for different users),
0335 a `DataFrame` can be created programmatically with three steps.
0336
0337 1. Create an RDD of tuples or lists from the original RDD;
0338 2. Create the schema represented by a `StructType` matching the structure of
0339 tuples or lists in the RDD created in the step 1.
0340 3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
0341
0342 For example:
0343
0344 {% include_example programmatic_schema python/sql/basic.py %}
0345 </div>
0346
0347 </div>
0348
0349 ## Scalar Functions
0350
0351 Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of [Built-in Scalar Functions](sql-ref-functions.html#scalar-functions). It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html).
0352
0353 ## Aggregate Functions
0354
0355 Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
0356 Users are not limited to the predefined aggregate functions and can create their own. For more details
0357 about user defined aggregate functions, please refer to the documentation of
0358 [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html).
0359
0360