0001 ---
0002 layout: global
0003 title: CREATE DATASOURCE TABLE
0004 displayTitle: CREATE DATASOURCE TABLE
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 ### Description
0023
0024 The `CREATE TABLE` statement defines a new table using a Data Source.
0025
0026 ### Syntax
0027
0028 ```sql
0029 CREATE TABLE [ IF NOT EXISTS ] table_identifier
0030 [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
0031 USING data_source
0032 [ OPTIONS ( key1=val1, key2=val2, ... ) ]
0033 [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
0034 [ CLUSTERED BY ( col_name3, col_name4, ... )
0035 [ SORTED BY ( col_name [ ASC | DESC ], ... ) ]
0036 INTO num_buckets BUCKETS ]
0037 [ LOCATION path ]
0038 [ COMMENT table_comment ]
0039 [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
0040 [ AS select_statement ]
0041 ```
0042
0043 Note that, the clauses between the USING clause and the AS SELECT clause can come in
0044 as any order. For example, you can write COMMENT table_comment after TBLPROPERTIES.
0045
0046 ### Parameters
0047
0048 * **table_identifier**
0049
0050 Specifies a table name, which may be optionally qualified with a database name.
0051
0052 **Syntax:** `[ database_name. ] table_name`
0053
0054 * **USING data_source**
0055
0056 Data Source is the input format used to create the table. Data source can be CSV, TXT, ORC, JDBC, PARQUET, etc.
0057
0058 * **PARTITIONED BY**
0059
0060 Partitions are created on the table, based on the columns specified.
0061
0062 * **CLUSTERED BY**
0063
0064 Partitions created on the table will be bucketed into fixed buckets based on the column specified for bucketing.
0065
0066 **NOTE:** Bucketing is an optimization technique that uses buckets (and bucketing columns) to determine data partitioning and avoid data shuffle.
0067
0068 * **SORTED BY**
0069
0070 Determines the order in which the data is stored in buckets. Default is Ascending order.
0071
0072 * **LOCATION**
0073
0074 Path to the directory where table data is stored, which could be a path on distributed storage like HDFS, etc.
0075
0076 * **COMMENT**
0077
0078 A string literal to describe the table.
0079
0080 * **TBLPROPERTIES**
0081
0082 A list of key-value pairs that is used to tag the table definition.
0083
0084 * **AS select_statement**
0085
0086 The table is populated using the data from the select statement.
0087
0088 ### Data Source Interaction
0089
0090 A Data Source table acts like a pointer to the underlying data source. For example, you can create
0091 a table "foo" in Spark which points to a table "bar" in MySQL using JDBC Data Source. When you
0092 read/write table "foo", you actually read/write table "bar".
0093
0094 In general CREATE TABLE is creating a "pointer", and you need to make sure it points to something
0095 existing. An exception is file source such as parquet, json. If you don't specify the LOCATION,
0096 Spark will create a default table location for you.
0097
0098 For CREATE TABLE AS SELECT, Spark will overwrite the underlying data source with the data of the
0099 input query, to make sure the table gets created contains exactly the same data as the input query.
0100
0101 ### Examples
0102
0103 ```sql
0104
0105 --Use data source
0106 CREATE TABLE student (id INT, name STRING, age INT) USING CSV;
0107
0108 --Use data from another table
0109 CREATE TABLE student_copy USING CSV
0110 AS SELECT * FROM student;
0111
0112 --Omit the USING clause, which uses the default data source (parquet by default)
0113 CREATE TABLE student (id INT, name STRING, age INT);
0114
0115 --Specify table comment and properties
0116 CREATE TABLE student (id INT, name STRING, age INT) USING CSV
0117 COMMENT 'this is a comment'
0118 TBLPROPERTIES ('foo'='bar');
0119
0120 --Specify table comment and properties with different clauses order
0121 CREATE TABLE student (id INT, name STRING, age INT) USING CSV
0122 TBLPROPERTIES ('foo'='bar')
0123 COMMENT 'this is a comment';
0124
0125 --Create partitioned and bucketed table
0126 CREATE TABLE student (id INT, name STRING, age INT)
0127 USING CSV
0128 PARTITIONED BY (age)
0129 CLUSTERED BY (Id) INTO 4 buckets;
0130 ```
0131
0132 ### Related Statements
0133
0134 * [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html)
0135 * [CREATE TABLE LIKE](sql-ref-syntax-ddl-create-table-like.html)