Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Data Types
0004 displayTitle: Data Types
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 ### Supported Data Types
0023 
0024 Spark SQL and DataFrames support the following data types:
0025 
0026 * Numeric types
0027   - `ByteType`: Represents 1-byte signed integer numbers.
0028   The range of numbers is from `-128` to `127`.
0029   - `ShortType`: Represents 2-byte signed integer numbers.
0030   The range of numbers is from `-32768` to `32767`.
0031   - `IntegerType`: Represents 4-byte signed integer numbers.
0032   The range of numbers is from `-2147483648` to `2147483647`.
0033   - `LongType`: Represents 8-byte signed integer numbers.
0034   The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
0035   - `FloatType`: Represents 4-byte single-precision floating point numbers.
0036   - `DoubleType`: Represents 8-byte double-precision floating point numbers.
0037   - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
0038 * String type
0039   - `StringType`: Represents character string values.
0040 * Binary type
0041   - `BinaryType`: Represents byte sequence values.
0042 * Boolean type
0043   - `BooleanType`: Represents boolean values.
0044 * Datetime type
0045   - `TimestampType`: Represents values comprising values of fields year, month, day,
0046   hour, minute, and second, with the session local time-zone. The timestamp value represents an
0047   absolute point in time.
0048   - `DateType`: Represents values comprising values of fields year, month and day, without a
0049   time-zone.
0050 * Complex types
0051   - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
0052   elements with the type of `elementType`. `containsNull` is used to indicate if
0053   elements in a `ArrayType` value can have `null` values.
0054   - `MapType(keyType, valueType, valueContainsNull)`:
0055   Represents values comprising a set of key-value pairs. The data type of keys is
0056   described by `keyType` and the data type of values is described by `valueType`.
0057   For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
0058   is used to indicate if values of a `MapType` value can have `null` values.
0059   - `StructType(fields)`: Represents values with the structure described by
0060   a sequence of `StructField`s (`fields`).
0061     * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
0062     The name of a field is indicated by `name`. The data type of a field is indicated
0063     by `dataType`. `nullable` is used to indicate if values of these fields can have
0064     `null` values.
0065 
0066 <div class="codetabs">
0067 <div data-lang="scala"  markdown="1">
0068 
0069 All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
0070 You can access them by doing
0071 
0072 {% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
0073 
0074 |Data type|Value type in Scala|API to access or create a data type|
0075 |---------|-------------------|-----------------------------------|
0076 |**ByteType**|Byte|ByteType|
0077 |**ShortType**|Short|ShortType|
0078 |**IntegerType**|Int|IntegerType|
0079 |**LongType**|Long|LongType|
0080 |**FloatType**|Float|FloatType|
0081 |**DoubleType**|Double|DoubleType|
0082 |**DecimalType**|java.math.BigDecimal|DecimalType|
0083 |**StringType**|String|StringType|
0084 |**BinaryType**|Array[Byte]|BinaryType|
0085 |**BooleanType**|Boolean|BooleanType|
0086 |**TimestampType**|java.sql.Timestamp|TimestampType|
0087 |**DateType**|java.sql.Date|DateType|
0088 |**ArrayType**|scala.collection.Seq|ArrayType(*elementType*, [*containsNull]*)<br/>**Note:** The default value of *containsNull* is true.|
0089 |**MapType**|scala.collection.Map|MapType(*keyType*, *valueType*, [*valueContainsNull]*)<br/>**Note:** The default value of *valueContainsNull* is true.|
0090 |**StructType**|org.apache.spark.sql.Row|StructType(*fields*)<br/>**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
0091 |**StructField**|The value type in Scala of the data type of this field(For example, Int for a StructField with the data type IntegerType)|StructField(*name*, *dataType*, [*nullable*])<br/>**Note:** The default value of *nullable* is true.|
0092 
0093 </div>
0094 
0095 <div data-lang="java" markdown="1">
0096 
0097 All data types of Spark SQL are located in the package of
0098 `org.apache.spark.sql.types`. To access or create a data type,
0099 please use factory methods provided in
0100 `org.apache.spark.sql.types.DataTypes`.
0101 
0102 |Data type|Value type in Java|API to access or create a data type|
0103 |---------|------------------|-----------------------------------|
0104 |**ByteType**|byte or Byte|DataTypes.ByteType|
0105 |**ShortType**|short or Short|DataTypes.ShortType|
0106 |**IntegerType**|int or Integer|DataTypes.IntegerType|
0107 |**LongType**|long or Long|DataTypes.LongType|
0108 |**FloatType**|float or Float|DataTypes.FloatType|
0109 |**DoubleType**|double or Double|DataTypes.DoubleType|
0110 |**DecimalType**|java.math.BigDecimal|DataTypes.createDecimalType()<br/>DataTypes.createDecimalType(*precision*, *scale*).|
0111 |**StringType**|String|DataTypes.StringType|
0112 |**BinaryType**|byte[]|DataTypes.BinaryType|
0113 |**BooleanType**|boolean or Boolean|DataTypes.BooleanType|
0114 |**TimestampType**|java.sql.Timestamp|DataTypes.TimestampType|
0115 |**DateType**|java.sql.Date|DataTypes.DateType|
0116 |**ArrayType**|java.util.List|DataTypes.createArrayType(*elementType*)<br/>**Note:** The value of *containsNull* will be true.<br/>DataTypes.createArrayType(*elementType*, *containsNull*).|
0117 |**MapType**|java.util.Map|DataTypes.createMapType(*keyType*, *valueType*)<br/>**Note:** The value of *valueContainsNull* will be true.<br/>DataTypes.createMapType(*keyType*, *valueType*, *valueContainsNull*)|
0118 |**StructType**|org.apache.spark.sql.Row|DataTypes.createStructType(*fields*)<br/>**Note:** *fields* is a List or an array of StructFields.Also, two fields with the same name are not allowed.|
0119 |**StructField**|The value type in Java of the data type of this field (For example, int for a StructField with the data type IntegerType)|DataTypes.createStructField(*name*, *dataType*, *nullable*)| 
0120 
0121 </div>
0122 
0123 <div data-lang="python"  markdown="1">
0124 
0125 All data types of Spark SQL are located in the package of `pyspark.sql.types`.
0126 You can access them by doing
0127 {% highlight python %}
0128 from pyspark.sql.types import *
0129 {% endhighlight %}
0130 
0131 |Data type|Value type in Python|API to access or create a data type|
0132 |---------|--------------------|-----------------------------------|
0133 |**ByteType**|int or long<br/>**Note:** Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127.|ByteType()|
0134 |**ShortType**|int or long<br/>**Note:** Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767.|ShortType()|
0135 |**IntegerType**|int or long|IntegerType()|
0136 |**LongType**|long<br/>**Note:** Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType.|LongType()|
0137 |**FloatType**|float<br/>**Note:** Numbers will be converted to 4-byte single-precision floating point numbers at runtime.|FloatType()|
0138 |**DoubleType**|float|DoubleType()|
0139 |**DecimalType**|decimal.Decimal|DecimalType()|
0140 |**StringType**|string|StringType()|
0141 |**BinaryType**|bytearray|BinaryType()|
0142 |**BooleanType**|bool|BooleanType()|
0143 |**TimestampType**|datetime.datetime|TimestampType()|
0144 |**DateType**|datetime.date|DateType()|
0145 |**ArrayType**|list, tuple, or array|ArrayType(*elementType*, [*containsNull*])<br/>**Note:**The default value of *containsNull* is True.|
0146 |**MapType**|dict|MapType(*keyType*, *valueType*, [*valueContainsNull]*)<br/>**Note:**The default value of *valueContainsNull* is True.|
0147 |**StructType**|list or tuple|StructType(*fields*)<br/>**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
0148 |**StructField**|The value type in Python of the data type of this field<br/>(For example, Int for a StructField with the data type IntegerType)|StructField(*name*, *dataType*, [*nullable*])<br/>**Note:** The default value of *nullable* is True.|
0149 
0150 </div>
0151 
0152 <div data-lang="r"  markdown="1">
0153 
0154 |Data type|Value type in R|API to access or create a data type|
0155 |---------|---------------|-----------------------------------|
0156 |**ByteType**|integer <br/>**Note:** Numbers will be converted to 1-byte signed integer numbers at runtime.  Please make sure that numbers are within the range of -128 to 127.|"byte"|
0157 |**ShortType**|integer <br/>**Note:** Numbers will be converted to 2-byte signed integer numbers at runtime.  Please make sure that numbers are within the range of -32768 to 32767.|"short"|
0158 |**IntegerType**|integer|"integer"|
0159 |**LongType**|integer <br/>**Note:** Numbers will be converted to 8-byte signed integer numbers at runtime.  Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.  Otherwise, please convert data to decimal.Decimal and use DecimalType.|"long"|
0160 |**FloatType**|numeric <br/>**Note:** Numbers will be converted to 4-byte single-precision floating point numbers at runtime.|"float"|
0161 |**DoubleType**|numeric|"double"|
0162 |**DecimalType**|Not supported|Not supported|
0163 |**StringType**|character|"string"|
0164 |**BinaryType**|raw|"binary"|
0165 |**BooleanType**|logical|"bool"|
0166 |**TimestampType**|POSIXct|"timestamp"|
0167 |**DateType**|Date|"date"|
0168 |**ArrayType**|vector or list|list(type="array", elementType=*elementType*, containsNull=[*containsNull*])<br/>**Note:** The default value of *containsNull* is TRUE.|
0169 |**MapType**|environment|list(type="map", keyType=*keyType*, valueType=*valueType*, valueContainsNull=[*valueContainsNull*])<br/> **Note:** The default value of *valueContainsNull* is TRUE.|
0170 |**StructType**|named list|list(type="struct", fields=*fields*)<br/> **Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
0171 |**StructField**|The value type in R of the data type of this field (For example, integer for a StructField with the data type IntegerType)|list(name=*name*, type=*dataType*, nullable=[*nullable*])<br/> **Note:** The default value of *nullable* is TRUE.|
0172 
0173 </div>
0174 
0175 <div data-lang="SQL"  markdown="1">
0176 
0177 The following table shows the type names as well as aliases used in Spark SQL parser for each data type.
0178 
0179 |Data type|SQL name|
0180 |---------|--------|
0181 |**BooleanType**|BOOLEAN|
0182 |**ByteType**|BYTE, TINYINT|
0183 |**ShortType**|SHORT, SMALLINT|
0184 |**IntegerType**|INT, INTEGER|
0185 |**LongType**|LONG, BIGINT|
0186 |**FloatType**|FLOAT, REAL|
0187 |**DoubleType**|DOUBLE|
0188 |**DateType**|DATE|
0189 |**TimestampType**|TIMESTAMP|
0190 |**StringType**|STRING|
0191 |**BinaryType**|BINARY|
0192 |**DecimalType**|DECIMAL, DEC, NUMERIC|
0193 |**CalendarIntervalType**|INTERVAL|
0194 |**ArrayType**|ARRAY<element_type>|
0195 |**StructType**|STRUCT<field1_name: field1_type, field2_name: field2_type, ...>|
0196 |**MapType**|MAP<key_type, value_type>|
0197 
0198 </div>
0199 </div>
0200 
0201 ### Floating Point Special Values
0202 
0203 Spark SQL supports several special floating point values in a case-insensitive manner:
0204 
0205  * Inf/+Inf/Infinity/+Infinity: positive infinity
0206    * ```FloatType```: equivalent to Scala <code>Float.PositiveInfinity</code>.
0207    * ```DoubleType```: equivalent to Scala <code>Double.PositiveInfinity</code>.
0208  * -Inf/-Infinity: negative infinity
0209    * ```FloatType```: equivalent to Scala <code>Float.NegativeInfinity</code>.
0210    * ```DoubleType```: equivalent to Scala <code>Double.NegativeInfinity</code>.
0211  * NaN: not a number
0212    * ```FloatType```: equivalent to Scala <code>Float.NaN</code>.
0213    * ```DoubleType```:  equivalent to Scala <code>Double.NaN</code>.
0214 
0215 #### Positive/Negative Infinity Semantics
0216 
0217 There is special handling for positive and negative infinity. They have the following semantics:
0218 
0219  * Positive infinity multiplied by any positive value returns positive infinity.
0220  * Negative infinity multiplied by any positive value returns negative infinity.
0221  * Positive infinity multiplied by any negative value returns negative infinity.
0222  * Negative infinity multiplied by any negative value returns positive infinity.
0223  * Positive/negative infinity multiplied by 0 returns NaN.
0224  * Positive/negative infinity is equal to itself.
0225  * In aggregations, all positive infinity values are grouped together. Similarly, all negative infinity values are grouped together.
0226  * Positive infinity and negative infinity are treated as normal values in join keys.
0227  * Positive infinity sorts lower than NaN and higher than any other values.
0228  * Negative infinity sorts lower than any other values.
0229 
0230 #### NaN Semantics
0231 
0232 There is special handling for not-a-number (NaN) when dealing with `float` or `double` types that
0233 do not exactly match standard floating point semantics.
0234 Specifically:
0235 
0236  * NaN = NaN returns true.
0237  * In aggregations, all NaN values are grouped together.
0238  * NaN is treated as a normal value in join keys.
0239  * NaN values go last when in ascending order, larger than any other numeric value.
0240 
0241 #### Examples
0242 
0243 ```sql
0244 SELECT double('infinity') AS col;
0245 +--------+
0246 |     col|
0247 +--------+
0248 |Infinity|
0249 +--------+
0250 
0251 SELECT float('-inf') AS col;
0252 +---------+
0253 |      col|
0254 +---------+
0255 |-Infinity|
0256 +---------+
0257 
0258 SELECT float('NaN') AS col;
0259 +---+
0260 |col|
0261 +---+
0262 |NaN|
0263 +---+
0264 
0265 SELECT double('infinity') * 0 AS col;
0266 +---+
0267 |col|
0268 +---+
0269 |NaN|
0270 +---+
0271 
0272 SELECT double('-infinity') * (-1234567) AS col;
0273 +--------+
0274 |     col|
0275 +--------+
0276 |Infinity|
0277 +--------+
0278 
0279 SELECT double('infinity') < double('NaN') AS col;
0280 +----+
0281 | col|
0282 +----+
0283 |true|
0284 +----+
0285 
0286 SELECT double('NaN') = double('NaN') AS col;
0287 +----+
0288 | col|
0289 +----+
0290 |true|
0291 +----+
0292 
0293 SELECT double('inf') = double('infinity') AS col;
0294 +----+
0295 | col|
0296 +----+
0297 |true|
0298 +----+
0299 
0300 CREATE TABLE test (c1 int, c2 double);
0301 INSERT INTO test VALUES (1, double('infinity'));
0302 INSERT INTO test VALUES (2, double('infinity'));
0303 INSERT INTO test VALUES (3, double('inf'));
0304 INSERT INTO test VALUES (4, double('-inf'));
0305 INSERT INTO test VALUES (5, double('NaN'));
0306 INSERT INTO test VALUES (6, double('NaN'));
0307 INSERT INTO test VALUES (7, double('-infinity'));
0308 SELECT COUNT(*), c2 FROM test GROUP BY c2;
0309 +---------+---------+
0310 | count(1)|       c2|
0311 +---------+---------+
0312 |        2|      NaN|
0313 |        2|-Infinity|
0314 |        3| Infinity|
0315 +---------+---------+
0316 ```