0001 ---
0002 layout: global
0003 title: NULL Semantics
0004 displayTitle: NULL Semantics
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 ### Description
0023
0024 A table consists of a set of rows and each row contains a set of columns.
0025 A column is associated with a data type and represents
0026 a specific attribute of an entity (for example, `age` is a column of an
0027 entity called `person`). Sometimes, the value of a column
0028 specific to a row is not known at the time the row comes into existence.
0029 In `SQL`, such values are represented as `NULL`. This section details the
0030 semantics of `NULL` values handling in various operators, expressions and
0031 other `SQL` constructs.
0032
0033 1. [Null handling in comparison operators](#comp-operators)
0034 2. [Null handling in Logical operators](#logical-operators)
0035 3. [Null handling in Expressions](#expressions)
0036 1. [Null handling in null-intolerant expressions](#null-intolerant)
0037 2. [Null handling Expressions that can process null value operands](#can-process-null)
0038 3. [Null handling in built-in aggregate expressions](#built-in-aggregate)
0039 4. [Null handling in WHERE, HAVING and JOIN conditions](#condition-expressions)
0040 5. [Null handling in GROUP BY and DISTINCT](#aggregate-operator)
0041 6. [Null handling in ORDER BY](#order-by)
0042 7. [Null handling in UNION, INTERSECT, EXCEPT](#set-operators)
0043 8. [Null handling in EXISTS and NOT EXISTS subquery](#exists-not-exists)
0044 9. [Null handling in IN and NOT IN subquery](#in-not-in)
0045
0046 <style type="text/css">
0047 .tsclass {font-size:12px;color:#333333;width:40%;border-width: 2px;border-color: #729ea5;border-collapse: collapse;}
0048 .tsclass th {text-align: left;}
0049 </style>
0050
0051 The following illustrates the schema layout and data of a table named `person`. The data contains `NULL` values in
0052 the `age` column and this table will be used in various examples in the sections below.<br/>
0053 **<u>TABLE: person</u>**
0054
0055 |Id |Name|Age|
0056 |---|----|---|
0057 |100|Joe|30|
0058 |200|Marry|NULL|
0059 |300|Mike|18|
0060 |400|Fred|50|
0061 |500|Albert|NULL|
0062 |600|Michelle|30|
0063 |700|Dan|50|
0064
0065 ### Comparison Operators <a name="comp-operators"></a>
0066
0067 Apache spark supports the standard comparison operators such as '>', '>=', '=', '<' and '<='.
0068 The result of these operators is unknown or `NULL` when one of the operands or both the operands are
0069 unknown or `NULL`. In order to compare the `NULL` values for equality, Spark provides a null-safe
0070 equal operator ('<=>'), which returns `False` when one of the operand is `NULL` and returns 'True` when
0071 both the operands are `NULL`. The following table illustrates the behaviour of comparison operators when
0072 one or both operands are `NULL`:
0073
0074 |Left Operand|Right Operand|> |>= |= |< |<= |<=>|
0075 |------------|--------------|---|---|---|---|---|---|
0076 |NULL|Any value|NULL|NULL|NULL|NULL|NULL|False|
0077 |Any value|NULL|NULL|NULL|NULL|NULL|NULL|False|
0078 |NULL|NULL|NULL|NULL|NULL|NULL|NULL|True|
0079
0080 ### Examples
0081
0082 ```sql
0083 -- Normal comparison operators return `NULL` when one of the operand is `NULL`.
0084 SELECT 5 > null AS expression_output;
0085 +-----------------+
0086 |expression_output|
0087 +-----------------+
0088 | null|
0089 +-----------------+
0090
0091 -- Normal comparison operators return `NULL` when both the operands are `NULL`.
0092 SELECT null = null AS expression_output;
0093 +-----------------+
0094 |expression_output|
0095 +-----------------+
0096 | null|
0097 +-----------------+
0098
0099 -- Null-safe equal operator return `False` when one of the operand is `NULL`
0100 SELECT 5 <=> null AS expression_output;
0101 +-----------------+
0102 |expression_output|
0103 +-----------------+
0104 | false|
0105 +-----------------+
0106
0107 -- Null-safe equal operator return `True` when one of the operand is `NULL`
0108 SELECT NULL <=> NULL;
0109 +-----------------+
0110 |expression_output|
0111 +-----------------+
0112 | true|
0113 +-----------------+
0114 ```
0115
0116 ### Logical Operators <a name="logical-operators"></a>
0117
0118 Spark supports standard logical operators such as `AND`, `OR` and `NOT`. These operators take `Boolean` expressions
0119 as the arguments and return a `Boolean` value.
0120
0121 The following tables illustrate the behavior of logical operators when one or both operands are `NULL`.
0122
0123 |Left Operand|Right Operand|OR |AND|
0124 |------------|-------------|---|---|
0125 |True|NULL|True|NULL|
0126 |False|NULL|NULL|False|
0127 |NULL|True|True|NULL|
0128 |NULL|False|NULL|NULL|
0129 |NULL|NULL|NULL|NULL|
0130
0131 |operand|NOT|
0132 |-------|---|
0133 |NULL|NULL|
0134
0135 ### Examples
0136
0137 ```sql
0138 -- Normal comparison operators return `NULL` when one of the operands is `NULL`.
0139 SELECT (true OR null) AS expression_output;
0140 +-----------------+
0141 |expression_output|
0142 +-----------------+
0143 | true|
0144 +-----------------+
0145
0146 -- Normal comparison operators return `NULL` when both the operands are `NULL`.
0147 SELECT (null OR false) AS expression_output
0148 +-----------------+
0149 |expression_output|
0150 +-----------------+
0151 | null|
0152 +-----------------+
0153
0154 -- Null-safe equal operator returns `False` when one of the operands is `NULL`
0155 SELECT NOT(null) AS expression_output;
0156 +-----------------+
0157 |expression_output|
0158 +-----------------+
0159 | null|
0160 +-----------------+
0161 ```
0162
0163 ### Expressions <a name="expressions"></a>
0164
0165 The comparison operators and logical operators are treated as expressions in
0166 Spark. Other than these two kinds of expressions, Spark supports other form of
0167 expressions such as function expressions, cast expressions, etc. The expressions
0168 in Spark can be broadly classified as :
0169 - Null intolerant expressions
0170 - Expressions that can process `NULL` value operands
0171 - The result of these expressions depends on the expression itself.
0172
0173 #### Null Intolerant Expressions <a name="null-intolerant"></a>
0174
0175 Null intolerant expressions return `NULL` when one or more arguments of
0176 expression are `NULL` and most of the expressions fall in this category.
0177
0178 ##### Examples
0179
0180 ```sql
0181 SELECT concat('John', null) AS expression_output;
0182 +-----------------+
0183 |expression_output|
0184 +-----------------+
0185 | null|
0186 +-----------------+
0187
0188 SELECT positive(null) AS expression_output;
0189 +-----------------+
0190 |expression_output|
0191 +-----------------+
0192 | null|
0193 +-----------------+
0194
0195 SELECT to_date(null) AS expression_output;
0196 +-----------------+
0197 |expression_output|
0198 +-----------------+
0199 | null|
0200 +-----------------+
0201 ```
0202
0203 #### Expressions That Can Process Null Value Operands <a name="can-process-null"></a>
0204
0205 This class of expressions are designed to handle `NULL` values. The result of the
0206 expressions depends on the expression itself. As an example, function expression `isnull`
0207 returns a `true` on null input and `false` on non null input where as function `coalesce`
0208 returns the first non `NULL` value in its list of operands. However, `coalesce` returns
0209 `NULL` when all its operands are `NULL`. Below is an incomplete list of expressions of this category.
0210 - COALESCE
0211 - NULLIF
0212 - IFNULL
0213 - NVL
0214 - NVL2
0215 - ISNAN
0216 - NANVL
0217 - ISNULL
0218 - ISNOTNULL
0219 - ATLEASTNNONNULLS
0220 - IN
0221
0222 ##### Examples
0223
0224 ```sql
0225 SELECT isnull(null) AS expression_output;
0226 +-----------------+
0227 |expression_output|
0228 +-----------------+
0229 | true|
0230 +-----------------+
0231
0232 -- Returns the first occurrence of non `NULL` value.
0233 SELECT coalesce(null, null, 3, null) AS expression_output;
0234 +-----------------+
0235 |expression_output|
0236 +-----------------+
0237 | 3|
0238 +-----------------+
0239
0240 -- Returns `NULL` as all its operands are `NULL`.
0241 SELECT coalesce(null, null, null, null) AS expression_output;
0242 +-----------------+
0243 |expression_output|
0244 +-----------------+
0245 | null|
0246 +-----------------+
0247
0248 SELECT isnan(null) AS expression_output;
0249 +-----------------+
0250 |expression_output|
0251 +-----------------+
0252 | false|
0253 +-----------------+
0254 ```
0255
0256 #### Builtin Aggregate Expressions <a name="built-in-aggregate"></a>
0257
0258 Aggregate functions compute a single result by processing a set of input rows. Below are
0259 the rules of how `NULL` values are handled by aggregate functions.
0260 - `NULL` values are ignored from processing by all the aggregate functions.
0261 - Only exception to this rule is COUNT(*) function.
0262 - Some aggregate functions return `NULL` when all input values are `NULL` or the input data set
0263 is empty.<br/> The list of these functions is:
0264 - MAX
0265 - MIN
0266 - SUM
0267 - AVG
0268 - EVERY
0269 - ANY
0270 - SOME
0271
0272 #### Examples
0273
0274 ```sql
0275 -- `count(*)` does not skip `NULL` values.
0276 SELECT count(*) FROM person;
0277 +--------+
0278 |count(1)|
0279 +--------+
0280 | 7|
0281 +--------+
0282
0283 -- `NULL` values in column `age` are skipped from processing.
0284 SELECT count(age) FROM person;
0285 +----------+
0286 |count(age)|
0287 +----------+
0288 | 5|
0289 +----------+
0290
0291 -- `count(*)` on an empty input set returns 0. This is unlike the other
0292 -- aggregate functions, such as `max`, which return `NULL`.
0293 SELECT count(*) FROM person where 1 = 0;
0294 +--------+
0295 |count(1)|
0296 +--------+
0297 | 0|
0298 +--------+
0299
0300 -- `NULL` values are excluded from computation of maximum value.
0301 SELECT max(age) FROM person;
0302 +--------+
0303 |max(age)|
0304 +--------+
0305 | 50|
0306 +--------+
0307
0308 -- `max` returns `NULL` on an empty input set.
0309 SELECT max(age) FROM person where 1 = 0;
0310 +--------+
0311 |max(age)|
0312 +--------+
0313 | null|
0314 +--------+
0315 ```
0316
0317 ### Condition Expressions in WHERE, HAVING and JOIN Clauses <a name="condition-expressions"></a>
0318
0319 `WHERE`, `HAVING` operators filter rows based on the user specified condition.
0320 A `JOIN` operator is used to combine rows from two tables based on a join condition.
0321 For all the three operators, a condition expression is a boolean expression and can return
0322 <code>True, False or Unknown (NULL)</code>. They are "satisfied" if the result of the condition is `True`.
0323
0324 #### Examples
0325
0326 ```sql
0327 -- Persons whose age is unknown (`NULL`) are filtered out from the result set.
0328 SELECT * FROM person WHERE age > 0;
0329 +--------+---+
0330 | name|age|
0331 +--------+---+
0332 |Michelle| 30|
0333 | Fred| 50|
0334 | Mike| 18|
0335 | Dan| 50|
0336 | Joe| 30|
0337 +--------+---+
0338
0339 -- `IS NULL` expression is used in disjunction to select the persons
0340 -- with unknown (`NULL`) records.
0341 SELECT * FROM person WHERE age > 0 OR age IS NULL;
0342 +--------+----+
0343 | name| age|
0344 +--------+----+
0345 | Albert|null|
0346 |Michelle| 30|
0347 | Fred| 50|
0348 | Mike| 18|
0349 | Dan| 50|
0350 | Marry|null|
0351 | Joe| 30|
0352 +--------+----+
0353
0354 -- Person with unknown(`NULL`) ages are skipped from processing.
0355 SELECT * FROM person GROUP BY age HAVING max(age) > 18;
0356 +---+--------+
0357 |age|count(1)|
0358 +---+--------+
0359 | 50| 2|
0360 | 30| 2|
0361 +---+--------+
0362
0363 -- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
0364 -- The persons with unknown age (`NULL`) are filtered out by the join operator.
0365 SELECT * FROM person p1, person p2
0366 WHERE p1.age = p2.age
0367 AND p1.name = p2.name;
0368 +--------+---+--------+---+
0369 | name|age| name|age|
0370 +--------+---+--------+---+
0371 |Michelle| 30|Michelle| 30|
0372 | Fred| 50| Fred| 50|
0373 | Mike| 18| Mike| 18|
0374 | Dan| 50| Dan| 50|
0375 | Joe| 30| Joe| 30|
0376 +--------+---+--------+---+
0377
0378 -- The age column from both legs of join are compared using null-safe equal which
0379 -- is why the persons with unknown age (`NULL`) are qualified by the join.
0380 SELECT * FROM person p1, person p2
0381 WHERE p1.age <=> p2.age
0382 AND p1.name = p2.name;
0383 +--------+----+--------+----+
0384 | name| age| name| age|
0385 +--------+----+--------+----+
0386 | Albert|null| Albert|null|
0387 |Michelle| 30|Michelle| 30|
0388 | Fred| 50| Fred| 50|
0389 | Mike| 18| Mike| 18|
0390 | Dan| 50| Dan| 50|
0391 | Marry|null| Marry|null|
0392 | Joe| 30| Joe| 30|
0393 +--------+----+--------+----+
0394 ```
0395
0396 ### Aggregate Operator (GROUP BY, DISTINCT) <a name="aggregate-operator"></a>
0397
0398 As discussed in the previous section [comparison operator](sql-ref-null-semantics.html#comparison-operators),
0399 two `NULL` values are not equal. However, for the purpose of grouping and distinct processing, the two or more
0400 values with `NULL data`are grouped together into the same bucket. This behaviour is conformant with SQL
0401 standard and with other enterprise database management systems.
0402
0403 #### Examples
0404
0405 ```sql
0406 -- `NULL` values are put in one bucket in `GROUP BY` processing.
0407 SELECT age, count(*) FROM person GROUP BY age;
0408 +----+--------+
0409 | age|count(1)|
0410 +----+--------+
0411 |null| 2|
0412 | 50| 2|
0413 | 30| 2|
0414 | 18| 1|
0415 +----+--------+
0416
0417 -- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
0418 SELECT DISTINCT age FROM person;
0419 +----+
0420 | age|
0421 +----+
0422 |null|
0423 | 50|
0424 | 30|
0425 | 18|
0426 +----+
0427 ```
0428
0429 ### Sort Operator (ORDER BY Clause) <a name="order-by"></a>
0430
0431 Spark SQL supports null ordering specification in `ORDER BY` clause. Spark processes the `ORDER BY` clause by
0432 placing all the `NULL` values at first or at last depending on the null ordering specification. By default, all
0433 the `NULL` values are placed at first.
0434
0435 #### Examples
0436
0437 ```sql
0438 -- `NULL` values are shown at first and other values
0439 -- are sorted in ascending way.
0440 SELECT age, name FROM person ORDER BY age;
0441 +----+--------+
0442 | age| name|
0443 +----+--------+
0444 |null| Marry|
0445 |null| Albert|
0446 | 18| Mike|
0447 | 30|Michelle|
0448 | 30| Joe|
0449 | 50| Fred|
0450 | 50| Dan|
0451 +----+--------+
0452
0453 -- Column values other than `NULL` are sorted in ascending
0454 -- way and `NULL` values are shown at the last.
0455 SELECT age, name FROM person ORDER BY age NULLS LAST;
0456 +----+--------+
0457 | age| name|
0458 +----+--------+
0459 | 18| Mike|
0460 | 30|Michelle|
0461 | 30| Joe|
0462 | 50| Dan|
0463 | 50| Fred|
0464 |null| Marry|
0465 |null| Albert|
0466 +----+--------+
0467
0468 -- Columns other than `NULL` values are sorted in descending
0469 -- and `NULL` values are shown at the last.
0470 SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
0471 +----+--------+
0472 | age| name|
0473 +----+--------+
0474 | 50| Fred|
0475 | 50| Dan|
0476 | 30|Michelle|
0477 | 30| Joe|
0478 | 18| Mike|
0479 |null| Marry|
0480 |null| Albert|
0481 +----+--------+
0482 ```
0483
0484 ### Set Operators (UNION, INTERSECT, EXCEPT) <a name="set-operators"></a>
0485
0486 `NULL` values are compared in a null-safe manner for equality in the context of
0487 set operations. That means when comparing rows, two `NULL` values are considered
0488 equal unlike the regular `EqualTo`(`=`) operator.
0489
0490 #### Examples
0491
0492 ```sql
0493 CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;
0494
0495 -- Only common rows between two legs of `INTERSECT` are in the
0496 -- result set. The comparison between columns of the row are done
0497 -- in a null-safe manner.
0498 SELECT name, age FROM person
0499 INTERSECT
0500 SELECT name, age from unknown_age;
0501 +------+----+
0502 | name| age|
0503 +------+----+
0504 |Albert|null|
0505 | Marry|null|
0506 +------+----+
0507
0508 -- `NULL` values from two legs of the `EXCEPT` are not in output.
0509 -- This basically shows that the comparison happens in a null-safe manner.
0510 SELECT age, name FROM person
0511 EXCEPT
0512 SELECT age FROM unknown_age;
0513 +---+--------+
0514 |age| name|
0515 +---+--------+
0516 | 30| Joe|
0517 | 50| Fred|
0518 | 30|Michelle|
0519 | 18| Mike|
0520 | 50| Dan|
0521 +---+--------+
0522
0523 -- Performs `UNION` operation between two sets of data.
0524 -- The comparison between columns of the row ae done in
0525 -- null-safe manner.
0526 SELECT name, age FROM person
0527 UNION
0528 SELECT name, age FROM unknown_age;
0529 +--------+----+
0530 | name| age|
0531 +--------+----+
0532 | Albert|null|
0533 | Joe| 30|
0534 |Michelle| 30|
0535 | Marry|null|
0536 | Fred| 50|
0537 | Mike| 18|
0538 | Dan| 50|
0539 +--------+----+
0540 ```
0541
0542 ### EXISTS/NOT EXISTS Subquery <a name="exists-not-exists"></a>
0543
0544 In Spark, EXISTS and NOT EXISTS expressions are allowed inside a WHERE clause.
0545 These are boolean expressions which return either `TRUE` or
0546 `FALSE`. In other words, EXISTS is a membership condition and returns `TRUE`
0547 when the subquery it refers to returns one or more rows. Similarly, NOT EXISTS
0548 is a non-membership condition and returns TRUE when no rows or zero rows are
0549 returned from the subquery.
0550
0551 These two expressions are not affected by presence of NULL in the result of
0552 the subquery. They are normally faster because they can be converted to
0553 semijoins / anti-semijoins without special provisions for null awareness.
0554
0555 #### Examples
0556
0557 ```sql
0558 -- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
0559 -- evaluates to `TRUE` as the subquery produces 1 row.
0560 SELECT * FROM person WHERE EXISTS (SELECT null);
0561 +--------+----+
0562 | name| age|
0563 +--------+----+
0564 | Albert|null|
0565 |Michelle| 30|
0566 | Fred| 50|
0567 | Mike| 18|
0568 | Dan| 50|
0569 | Marry|null|
0570 | Joe| 30|
0571 +--------+----+
0572
0573 -- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
0574 -- subquery produces no rows. In this case, it returns 1 row.
0575 SELECT * FROM person WHERE NOT EXISTS (SELECT null);
0576 +----+---+
0577 |name|age|
0578 +----+---+
0579 +----+---+
0580
0581 -- `NOT EXISTS` expression returns `TRUE`.
0582 SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
0583 +--------+----+
0584 | name| age|
0585 +--------+----+
0586 | Albert|null|
0587 |Michelle| 30|
0588 | Fred| 50|
0589 | Mike| 18|
0590 | Dan| 50|
0591 | Marry|null|
0592 | Joe| 30|
0593 +--------+----+
0594 ```
0595
0596 ### IN/NOT IN Subquery <a name="in-not-in"></a>
0597
0598 In Spark, `IN` and `NOT IN` expressions are allowed inside a WHERE clause of
0599 a query. Unlike the `EXISTS` expression, `IN` expression can return a `TRUE`,
0600 `FALSE` or `UNKNOWN (NULL)` value. Conceptually a `IN` expression is semantically
0601 equivalent to a set of equality condition separated by a disjunctive operator (`OR`).
0602 For example, c1 IN (1, 2, 3) is semantically equivalent to `(C1 = 1 OR c1 = 2 OR c1 = 3)`.
0603
0604 As far as handling `NULL` values are concerned, the semantics can be deduced from
0605 the `NULL` value handling in comparison operators(`=`) and logical operators(`OR`).
0606 To summarize, below are the rules for computing the result of an `IN` expression.
0607
0608 - TRUE is returned when the non-NULL value in question is found in the list
0609 - FALSE is returned when the non-NULL value is not found in the list and the
0610 list does not contain NULL values
0611 - UNKNOWN is returned when the value is `NULL`, or the non-NULL value is not found in the list
0612 and the list contains at least one `NULL` value
0613
0614 NOT IN always returns UNKNOWN when the list contains `NULL`, regardless of the input value.
0615 This is because IN returns UNKNOWN if the value is not in the list containing `NULL`,
0616 and because NOT UNKNOWN is again UNKNOWN.
0617
0618 #### Examples
0619
0620 ```sql
0621 -- The subquery has only `NULL` value in its result set. Therefore,
0622 -- the result of `IN` predicate is UNKNOWN.
0623 SELECT * FROM person WHERE age IN (SELECT null);
0624 +----+---+
0625 |name|age|
0626 +----+---+
0627 +----+---+
0628
0629 -- The subquery has `NULL` value in the result set as well as a valid
0630 -- value `50`. Rows with age = 50 are returned.
0631 SELECT * FROM person
0632 WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
0633 +----+---+
0634 |name|age|
0635 +----+---+
0636 |Fred| 50|
0637 | Dan| 50|
0638 +----+---+
0639
0640 -- Since subquery has `NULL` value in the result set, the `NOT IN`
0641 -- predicate would return UNKNOWN. Hence, no rows are
0642 -- qualified for this query.
0643 SELECT * FROM person
0644 WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
0645 +----+---+
0646 |name|age|
0647 +----+---+
0648 +----+---+
0649 ```