Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Getting Started
0004 displayTitle: Getting Started
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 ## Starting Point: SparkSession
0026 
0027 <div class="codetabs">
0028 <div data-lang="scala"  markdown="1">
0029 
0030 The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
0031 
0032 {% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0033 </div>
0034 
0035 <div data-lang="java" markdown="1">
0036 
0037 The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
0038 
0039 {% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0040 </div>
0041 
0042 <div data-lang="python"  markdown="1">
0043 
0044 The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
0045 
0046 {% include_example init_session python/sql/basic.py %}
0047 </div>
0048 
0049 <div data-lang="r"  markdown="1">
0050 
0051 The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
0052 
0053 {% include_example init_session r/RSparkSQLExample.R %}
0054 
0055 Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
0056 </div>
0057 </div>
0058 
0059 `SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
0060 write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
0061 To use these features, you do not need to have an existing Hive setup.
0062 
0063 ## Creating DataFrames
0064 
0065 <div class="codetabs">
0066 <div data-lang="scala"  markdown="1">
0067 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0068 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0069 
0070 As an example, the following creates a DataFrame based on the content of a JSON file:
0071 
0072 {% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0073 </div>
0074 
0075 <div data-lang="java" markdown="1">
0076 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0077 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0078 
0079 As an example, the following creates a DataFrame based on the content of a JSON file:
0080 
0081 {% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0082 </div>
0083 
0084 <div data-lang="python"  markdown="1">
0085 With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
0086 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0087 
0088 As an example, the following creates a DataFrame based on the content of a JSON file:
0089 
0090 {% include_example create_df python/sql/basic.py %}
0091 </div>
0092 
0093 <div data-lang="r"  markdown="1">
0094 With a `SparkSession`, applications can create DataFrames from a local R data.frame,
0095 from a Hive table, or from [Spark data sources](sql-data-sources.html).
0096 
0097 As an example, the following creates a DataFrame based on the content of a JSON file:
0098 
0099 {% include_example create_df r/RSparkSQLExample.R %}
0100 
0101 </div>
0102 </div>
0103 
0104 
0105 ## Untyped Dataset Operations (aka DataFrame Operations)
0106 
0107 DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html).
0108 
0109 As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
0110 
0111 Here we include some basic examples of structured data processing using Datasets:
0112 
0113 <div class="codetabs">
0114 <div data-lang="scala"  markdown="1">
0115 {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0116 
0117 For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/org/apache/spark/sql/Dataset.html).
0118 
0119 In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/org/apache/spark/sql/functions$.html).
0120 </div>
0121 
0122 <div data-lang="java" markdown="1">
0123 
0124 {% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0125 
0126 For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html).
0127 
0128 In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
0129 </div>
0130 
0131 <div data-lang="python"  markdown="1">
0132 In Python, it's possible to access a DataFrame's columns either by attribute
0133 (`df.age`) or by indexing (`df['age']`). While the former is convenient for
0134 interactive data exploration, users are highly encouraged to use the
0135 latter form, which is future proof and won't break with column names that
0136 are also attributes on the DataFrame class.
0137 
0138 {% include_example untyped_ops python/sql/basic.py %}
0139 For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
0140 
0141 In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
0142 
0143 </div>
0144 
0145 <div data-lang="r"  markdown="1">
0146 
0147 {% include_example untyped_ops r/RSparkSQLExample.R %}
0148 
0149 For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
0150 
0151 In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/SparkDataFrame.html).
0152 
0153 </div>
0154 
0155 </div>
0156 
0157 ## Running SQL Queries Programmatically
0158 
0159 <div class="codetabs">
0160 <div data-lang="scala"  markdown="1">
0161 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
0162 
0163 {% include_example run_sql scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0164 </div>
0165 
0166 <div data-lang="java" markdown="1">
0167 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `Dataset<Row>`.
0168 
0169 {% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0170 </div>
0171 
0172 <div data-lang="python"  markdown="1">
0173 The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
0174 
0175 {% include_example run_sql python/sql/basic.py %}
0176 </div>
0177 
0178 <div data-lang="r"  markdown="1">
0179 The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
0180 
0181 {% include_example run_sql r/RSparkSQLExample.R %}
0182 
0183 </div>
0184 </div>
0185 
0186 
0187 ## Global Temporary View
0188 
0189 Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it
0190 terminates. If you want to have a temporary view that is shared among all sessions and keep alive
0191 until the Spark application terminates, you can create a global temporary view. Global temporary
0192 view is tied to a system preserved database `global_temp`, and we must use the qualified name to
0193 refer it, e.g. `SELECT * FROM global_temp.view1`.
0194 
0195 <div class="codetabs">
0196 <div data-lang="scala"  markdown="1">
0197 {% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0198 </div>
0199 
0200 <div data-lang="java" markdown="1">
0201 {% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0202 </div>
0203 
0204 <div data-lang="python"  markdown="1">
0205 {% include_example global_temp_view python/sql/basic.py %}
0206 </div>
0207 
0208 <div data-lang="SQL"  markdown="1">
0209 
0210 {% highlight sql %}
0211 
0212 CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
0213 
0214 SELECT * FROM global_temp.temp_view
0215 
0216 {% endhighlight %}
0217 
0218 </div>
0219 </div>
0220 
0221 
0222 ## Creating Datasets
0223 
0224 Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
0225 a specialized [Encoder](api/scala/org/apache/spark/sql/Encoder.html) to serialize the objects
0226 for processing or transmitting over the network. While both encoders and standard serialization are
0227 responsible for turning an object into bytes, encoders are code generated dynamically and use a format
0228 that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
0229 the bytes back into an object.
0230 
0231 <div class="codetabs">
0232 <div data-lang="scala"  markdown="1">
0233 {% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0234 </div>
0235 
0236 <div data-lang="java" markdown="1">
0237 {% include_example create_ds java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0238 </div>
0239 </div>
0240 
0241 ## Interoperating with RDDs
0242 
0243 Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
0244 method uses reflection to infer the schema of an RDD that contains specific types of objects. This
0245 reflection-based approach leads to more concise code and works well when you already know the schema
0246 while writing your Spark application.
0247 
0248 The second method for creating Datasets is through a programmatic interface that allows you to
0249 construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
0250 you to construct Datasets when the columns and their types are not known until runtime.
0251 
0252 ### Inferring the Schema Using Reflection
0253 <div class="codetabs">
0254 
0255 <div data-lang="scala"  markdown="1">
0256 
0257 The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
0258 to a DataFrame. The case class
0259 defines the schema of the table. The names of the arguments to the case class are read using
0260 reflection and become the names of the columns. Case classes can also be nested or contain complex
0261 types such as `Seq`s or `Array`s. This RDD can be implicitly converted to a DataFrame and then be
0262 registered as a table. Tables can be used in subsequent SQL statements.
0263 
0264 {% include_example schema_inferring scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0265 </div>
0266 
0267 <div data-lang="java"  markdown="1">
0268 
0269 Spark SQL supports automatically converting an RDD of
0270 [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly) into a DataFrame.
0271 The `BeanInfo`, obtained using reflection, defines the schema of the table. Currently, Spark SQL
0272 does not support JavaBeans that contain `Map` field(s). Nested JavaBeans and `List` or `Array`
0273 fields are supported though. You can create a JavaBean by creating a class that implements
0274 Serializable and has getters and setters for all of its fields.
0275 
0276 {% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0277 </div>
0278 
0279 <div data-lang="python"  markdown="1">
0280 
0281 Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
0282 key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
0283 and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
0284 
0285 {% include_example schema_inferring python/sql/basic.py %}
0286 </div>
0287 
0288 </div>
0289 
0290 ### Programmatically Specifying the Schema
0291 
0292 <div class="codetabs">
0293 
0294 <div data-lang="scala"  markdown="1">
0295 
0296 When case classes cannot be defined ahead of time (for example,
0297 the structure of records is encoded in a string, or a text dataset will be parsed
0298 and fields will be projected differently for different users),
0299 a `DataFrame` can be created programmatically with three steps.
0300 
0301 1. Create an RDD of `Row`s from the original RDD;
0302 2. Create the schema represented by a `StructType` matching the structure of
0303 `Row`s in the RDD created in Step 1.
0304 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
0305 by `SparkSession`.
0306 
0307 For example:
0308 
0309 {% include_example programmatic_schema scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0310 </div>
0311 
0312 <div data-lang="java"  markdown="1">
0313 
0314 When JavaBean classes cannot be defined ahead of time (for example,
0315 the structure of records is encoded in a string, or a text dataset will be parsed and
0316 fields will be projected differently for different users),
0317 a `Dataset<Row>` can be created programmatically with three steps.
0318 
0319 1. Create an RDD of `Row`s from the original RDD;
0320 2. Create the schema represented by a `StructType` matching the structure of
0321 `Row`s in the RDD created in Step 1.
0322 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
0323 by `SparkSession`.
0324 
0325 For example:
0326 
0327 {% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
0328 </div>
0329 
0330 <div data-lang="python"  markdown="1">
0331 
0332 When a dictionary of kwargs cannot be defined ahead of time (for example,
0333 the structure of records is encoded in a string, or a text dataset will be parsed and
0334 fields will be projected differently for different users),
0335 a `DataFrame` can be created programmatically with three steps.
0336 
0337 1. Create an RDD of tuples or lists from the original RDD;
0338 2. Create the schema represented by a `StructType` matching the structure of
0339 tuples or lists in the RDD created in the step 1.
0340 3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
0341 
0342 For example:
0343 
0344 {% include_example programmatic_schema python/sql/basic.py %}
0345 </div>
0346 
0347 </div>
0348 
0349 ## Scalar Functions
0350 
0351 Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of [Built-in Scalar Functions](sql-ref-functions.html#scalar-functions). It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html).
0352 
0353 ## Aggregate Functions
0354 
0355 Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
0356 Users are not limited to the predefined aggregate functions and can create their own. For more details
0357 about user defined aggregate functions, please refer to the documentation of
0358 [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html).
0359 
0360