the-tree/docs/sql-programming-guide.md

0001 ---
0002 layout: global
0003 displayTitle: Spark SQL, DataFrames and Datasets Guide
0004 title: Spark SQL and DataFrames
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
0023 by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
0024 Spark SQL uses this extra information to perform extra optimizations. There are several ways to
0025 interact with Spark SQL including SQL and the Dataset API. When computing a result,
0026 the same execution engine is used, independent of which API/language you are using to express the
0027 computation. This unification means that developers can easily switch back and forth between
0028 different APIs based on which provides the most natural way to express a given transformation.
0029
0030 All of the examples on this page use sample data included in the Spark distribution and can be run in
0031 the `spark-shell`, `pyspark` shell, or `sparkR` shell.
0032
0033 ## SQL
0034
0035 One use of Spark SQL is to execute SQL queries.
0036 Spark SQL can also be used to read data from an existing Hive installation. For more on how to
0037 configure this feature, please refer to the [Hive Tables](sql-data-sources-hive-tables.html) section. When running
0038 SQL from within another programming language the results will be returned as a [Dataset/DataFrame](#datasets-and-dataframes).
0039 You can also interact with the SQL interface using the [command-line](sql-distributed-sql-engine.html#running-the-spark-sql-cli)
0040 or over [JDBC/ODBC](sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server).
0041
0042 ## Datasets and DataFrames
0043
0044 A Dataset is a distributed collection of data.
0045 Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong
0046 typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
0047 execution engine. A Dataset can be [constructed](sql-getting-started.html#creating-datasets) from JVM objects and then
0048 manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
0049 The Dataset API is available in [Scala][scala-datasets] and
0050 [Java][java-datasets]. Python does not have the support for the Dataset API. But due to Python's dynamic nature,
0051 many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally
0052 `row.columnName`). The case for R is similar.
0053
0054 A DataFrame is a *Dataset* organized into named columns. It is conceptually
0055 equivalent to a table in a relational database or a data frame in R/Python, but with richer
0056 optimizations under the hood. DataFrames can be constructed from a wide array of [sources](sql-data-sources.html) such
0057 as: structured data files, tables in Hive, external databases, or existing RDDs.
0058 The DataFrame API is available in Scala,
0059 Java, [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
0060 In Scala and Java, a DataFrame is represented by a Dataset of `Row`s.
0061 In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
0062 While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
0063
0064 [scala-datasets]: api/scala/org/apache/spark/sql/Dataset.html
0065 [java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
0066
0067 Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.