Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: NULL Semantics
0004 displayTitle: NULL Semantics
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 ### Description
0023 
0024 A table consists of a set of rows and each row contains a set of columns.
0025 A column is associated with a data type and represents
0026 a specific attribute of an entity (for example, `age` is a column of an
0027 entity called `person`). Sometimes, the value of a column
0028 specific to a row is not known at the time the row comes into existence.
0029 In `SQL`, such values are represented as `NULL`. This section details the
0030 semantics of `NULL` values handling in various operators, expressions and
0031 other `SQL` constructs.
0032 
0033 1. [Null handling in comparison operators](#comp-operators)
0034 2. [Null handling in Logical operators](#logical-operators)
0035 3. [Null handling in Expressions](#expressions)
0036       1. [Null handling in null-intolerant expressions](#null-intolerant)
0037       2. [Null handling Expressions that can process null value operands](#can-process-null)
0038       3. [Null handling in built-in aggregate expressions](#built-in-aggregate)
0039 4. [Null handling in WHERE, HAVING and JOIN conditions](#condition-expressions)
0040 5. [Null handling in GROUP BY and DISTINCT](#aggregate-operator)
0041 6. [Null handling in ORDER BY](#order-by)
0042 7. [Null handling in UNION, INTERSECT, EXCEPT](#set-operators)
0043 8. [Null handling in EXISTS and NOT EXISTS subquery](#exists-not-exists)
0044 9. [Null handling in IN and NOT IN subquery](#in-not-in)
0045 
0046 <style type="text/css">
0047 .tsclass {font-size:12px;color:#333333;width:40%;border-width: 2px;border-color: #729ea5;border-collapse: collapse;}
0048 .tsclass th {text-align: left;}
0049 </style>
0050 
0051 The following illustrates the schema layout and data of a table named `person`. The data contains `NULL` values in
0052 the `age` column and this table will be used in various examples in the sections below.<br/>
0053 **<u>TABLE: person</u>**
0054 
0055 |Id |Name|Age|
0056 |---|----|---|
0057 |100|Joe|30|
0058 |200|Marry|NULL|
0059 |300|Mike|18|
0060 |400|Fred|50|
0061 |500|Albert|NULL|
0062 |600|Michelle|30|
0063 |700|Dan|50|
0064 
0065 ### Comparison Operators <a name="comp-operators"></a>
0066 
0067 Apache spark supports the standard comparison operators such as '>', '>=', '=', '<' and '<='.
0068 The result of these operators is unknown or `NULL` when one of the operands or both the operands are
0069 unknown or `NULL`. In order to compare the `NULL` values for equality, Spark provides a null-safe
0070 equal operator ('<=>'), which returns `False` when one of the operand is `NULL` and returns 'True` when
0071 both the operands are `NULL`. The following table illustrates the behaviour of comparison operators when
0072 one or both operands are `NULL`:
0073 
0074 |Left Operand|Right  Operand|>  |>= |=  |<  |<= |<=>|
0075 |------------|--------------|---|---|---|---|---|---|
0076 |NULL|Any value|NULL|NULL|NULL|NULL|NULL|False|
0077 |Any value|NULL|NULL|NULL|NULL|NULL|NULL|False|
0078 |NULL|NULL|NULL|NULL|NULL|NULL|NULL|True|
0079 
0080 ### Examples
0081 
0082 ```sql
0083 -- Normal comparison operators return `NULL` when one of the operand is `NULL`.
0084 SELECT 5 > null AS expression_output;
0085 +-----------------+
0086 |expression_output|
0087 +-----------------+
0088 |             null|
0089 +-----------------+
0090 
0091 -- Normal comparison operators return `NULL` when both the operands are `NULL`.
0092 SELECT null = null AS expression_output;
0093 +-----------------+
0094 |expression_output|
0095 +-----------------+
0096 |             null|
0097 +-----------------+
0098 
0099 -- Null-safe equal operator return `False` when one of the operand is `NULL`
0100 SELECT 5 <=> null AS expression_output;
0101 +-----------------+
0102 |expression_output|
0103 +-----------------+
0104 |            false|
0105 +-----------------+
0106 
0107 -- Null-safe equal operator return `True` when one of the operand is `NULL`
0108 SELECT NULL <=> NULL;
0109 +-----------------+
0110 |expression_output|
0111 +-----------------+
0112 |             true|
0113 +-----------------+
0114 ```
0115 
0116 ### Logical Operators <a name="logical-operators"></a>
0117 
0118 Spark supports standard logical operators such as `AND`, `OR` and `NOT`. These operators take `Boolean` expressions
0119 as the arguments and return a `Boolean` value.  
0120 
0121 The following tables illustrate the behavior of logical operators when one or both operands are `NULL`.
0122 
0123 |Left Operand|Right Operand|OR |AND|
0124 |------------|-------------|---|---|
0125 |True|NULL|True|NULL|
0126 |False|NULL|NULL|False|
0127 |NULL|True|True|NULL|
0128 |NULL|False|NULL|NULL|
0129 |NULL|NULL|NULL|NULL|
0130 
0131 |operand|NOT|
0132 |-------|---|
0133 |NULL|NULL|
0134 
0135 ### Examples
0136 
0137 ```sql
0138 -- Normal comparison operators return `NULL` when one of the operands is `NULL`.
0139 SELECT (true OR null) AS expression_output;
0140 +-----------------+
0141 |expression_output|
0142 +-----------------+
0143 |             true|
0144 +-----------------+
0145 
0146 -- Normal comparison operators return `NULL` when both the operands are `NULL`.
0147 SELECT (null OR false) AS expression_output
0148 +-----------------+
0149 |expression_output|
0150 +-----------------+
0151 |             null|
0152 +-----------------+
0153 
0154 -- Null-safe equal operator returns `False` when one of the operands is `NULL`
0155 SELECT NOT(null) AS expression_output;
0156 +-----------------+
0157 |expression_output|
0158 +-----------------+
0159 |             null|
0160 +-----------------+
0161 ```
0162 
0163 ### Expressions <a name="expressions"></a>
0164 
0165 The comparison operators and logical operators are treated as expressions in
0166 Spark. Other than these two kinds of expressions, Spark supports other form of
0167 expressions such as function expressions, cast expressions, etc. The expressions
0168 in Spark can be broadly classified as :
0169 - Null intolerant expressions
0170 - Expressions that can process `NULL` value operands
0171   - The result of these expressions depends on the expression itself.
0172 
0173 #### Null Intolerant Expressions <a name="null-intolerant"></a>
0174 
0175 Null intolerant expressions return `NULL` when one or more arguments of 
0176 expression are `NULL` and most of the expressions fall in this category.
0177 
0178 ##### Examples
0179 
0180 ```sql
0181 SELECT concat('John', null) AS expression_output;
0182 +-----------------+
0183 |expression_output|
0184 +-----------------+
0185 |             null|
0186 +-----------------+
0187 
0188 SELECT positive(null) AS expression_output;
0189 +-----------------+
0190 |expression_output|
0191 +-----------------+
0192 |             null|
0193 +-----------------+
0194 
0195 SELECT to_date(null) AS expression_output;
0196 +-----------------+
0197 |expression_output|
0198 +-----------------+
0199 |             null|
0200 +-----------------+
0201 ```
0202 
0203 #### Expressions That Can Process Null Value Operands <a name="can-process-null"></a>
0204 
0205 This class of expressions are designed to handle `NULL` values. The result of the 
0206 expressions depends on the expression itself. As an example, function expression `isnull`
0207 returns a `true` on null input and `false` on non null input where as function `coalesce`
0208 returns the first non `NULL` value in its list of operands. However, `coalesce` returns
0209 `NULL` when all its operands are `NULL`. Below is an incomplete list of expressions of this category.
0210   - COALESCE
0211   - NULLIF
0212   - IFNULL
0213   - NVL
0214   - NVL2
0215   - ISNAN
0216   - NANVL
0217   - ISNULL
0218   - ISNOTNULL
0219   - ATLEASTNNONNULLS 
0220   - IN
0221 
0222 ##### Examples
0223 
0224 ```sql
0225 SELECT isnull(null) AS expression_output;
0226 +-----------------+
0227 |expression_output|
0228 +-----------------+
0229 |             true|
0230 +-----------------+
0231 
0232 -- Returns the first occurrence of non `NULL` value.
0233 SELECT coalesce(null, null, 3, null) AS expression_output;
0234 +-----------------+
0235 |expression_output|
0236 +-----------------+
0237 |                3|
0238 +-----------------+
0239 
0240 -- Returns `NULL` as all its operands are `NULL`. 
0241 SELECT coalesce(null, null, null, null) AS expression_output;
0242 +-----------------+
0243 |expression_output|
0244 +-----------------+
0245 |             null|
0246 +-----------------+
0247 
0248 SELECT isnan(null) AS expression_output;
0249 +-----------------+
0250 |expression_output|
0251 +-----------------+
0252 |            false|
0253 +-----------------+
0254 ```
0255 
0256 #### Builtin Aggregate Expressions <a name="built-in-aggregate"></a>
0257 
0258 Aggregate functions compute a single result by processing a set of input rows. Below are
0259 the rules of how `NULL` values are handled by aggregate functions.
0260 - `NULL` values are ignored from processing by all the aggregate functions.
0261   - Only exception to this rule is COUNT(*) function.
0262 - Some aggregate functions return `NULL` when all input values are `NULL` or the input data set
0263   is empty.<br/> The list of these functions is:
0264   - MAX
0265   - MIN
0266   - SUM
0267   - AVG
0268   - EVERY
0269   - ANY
0270   - SOME
0271    
0272 #### Examples
0273 
0274 ```sql
0275 -- `count(*)` does not skip `NULL` values.
0276 SELECT count(*) FROM person;
0277 +--------+
0278 |count(1)|
0279 +--------+
0280 |       7|
0281 +--------+
0282 
0283 -- `NULL` values in column `age` are skipped from processing.
0284 SELECT count(age) FROM person;
0285 +----------+
0286 |count(age)|
0287 +----------+
0288 |         5|
0289 +----------+
0290 
0291 -- `count(*)` on an empty input set returns 0. This is unlike the other
0292 -- aggregate functions, such as `max`, which return `NULL`.
0293 SELECT count(*) FROM person where 1 = 0;
0294 +--------+
0295 |count(1)|
0296 +--------+
0297 |       0|
0298 +--------+
0299 
0300 -- `NULL` values are excluded from computation of maximum value.
0301 SELECT max(age) FROM person;
0302 +--------+
0303 |max(age)|
0304 +--------+
0305 |      50|
0306 +--------+
0307 
0308 -- `max` returns `NULL` on an empty input set.
0309 SELECT max(age) FROM person where 1 = 0;
0310 +--------+
0311 |max(age)|
0312 +--------+
0313 |    null|
0314 +--------+
0315 ```
0316 
0317 ### Condition Expressions in WHERE, HAVING and JOIN Clauses <a name="condition-expressions"></a>
0318 
0319 `WHERE`, `HAVING` operators filter rows based on the user specified condition.
0320 A `JOIN` operator is used to combine rows from two tables based on a join condition.
0321 For all the three operators, a condition expression is a boolean expression and can return
0322  <code>True, False or Unknown (NULL)</code>. They are "satisfied" if the result of the condition is `True`.
0323 
0324 #### Examples
0325 
0326 ```sql
0327 -- Persons whose age is unknown (`NULL`) are filtered out from the result set.
0328 SELECT * FROM person WHERE age > 0;
0329 +--------+---+
0330 |    name|age|
0331 +--------+---+
0332 |Michelle| 30|
0333 |    Fred| 50|
0334 |    Mike| 18|
0335 |     Dan| 50|
0336 |     Joe| 30|
0337 +--------+---+
0338 
0339 -- `IS NULL` expression is used in disjunction to select the persons
0340 -- with unknown (`NULL`) records.
0341 SELECT * FROM person WHERE age > 0 OR age IS NULL;
0342 +--------+----+
0343 |    name| age|
0344 +--------+----+
0345 |  Albert|null|
0346 |Michelle|  30|
0347 |    Fred|  50|
0348 |    Mike|  18|
0349 |     Dan|  50|
0350 |   Marry|null|
0351 |     Joe|  30|
0352 +--------+----+
0353 
0354 -- Person with unknown(`NULL`) ages are skipped from processing.
0355 SELECT * FROM person GROUP BY age HAVING max(age) > 18;
0356 +---+--------+
0357 |age|count(1)|
0358 +---+--------+
0359 | 50|       2|
0360 | 30|       2|
0361 +---+--------+
0362 
0363 -- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
0364 -- The persons with unknown age (`NULL`) are filtered out by the join operator.
0365 SELECT * FROM person p1, person p2
0366     WHERE p1.age = p2.age
0367     AND p1.name = p2.name;
0368 +--------+---+--------+---+
0369 |    name|age|    name|age|
0370 +--------+---+--------+---+
0371 |Michelle| 30|Michelle| 30|
0372 |    Fred| 50|    Fred| 50|
0373 |    Mike| 18|    Mike| 18|
0374 |     Dan| 50|     Dan| 50|
0375 |     Joe| 30|     Joe| 30|
0376 +--------+---+--------+---+
0377 
0378 -- The age column from both legs of join are compared using null-safe equal which
0379 -- is why the persons with unknown age (`NULL`) are qualified by the join.
0380 SELECT * FROM person p1, person p2
0381     WHERE p1.age <=> p2.age
0382     AND p1.name = p2.name;
0383 +--------+----+--------+----+
0384 |    name| age|    name| age|
0385 +--------+----+--------+----+
0386 |  Albert|null|  Albert|null|
0387 |Michelle|  30|Michelle|  30|
0388 |    Fred|  50|    Fred|  50|
0389 |    Mike|  18|    Mike|  18|
0390 |     Dan|  50|     Dan|  50|
0391 |   Marry|null|   Marry|null|
0392 |     Joe|  30|     Joe|  30|
0393 +--------+----+--------+----+
0394 ```
0395 
0396 ### Aggregate Operator (GROUP BY, DISTINCT) <a name="aggregate-operator"></a>
0397 
0398 As discussed in the previous section [comparison operator](sql-ref-null-semantics.html#comparison-operators),
0399 two `NULL` values are not equal. However, for the purpose of grouping and distinct processing, the two or more
0400 values with `NULL data`are grouped together into the same bucket. This behaviour is conformant with SQL
0401 standard and with other enterprise database management systems.
0402 
0403 #### Examples
0404 
0405 ```sql
0406 -- `NULL` values are put in one bucket in `GROUP BY` processing.
0407 SELECT age, count(*) FROM person GROUP BY age;
0408 +----+--------+
0409 | age|count(1)|
0410 +----+--------+
0411 |null|       2|
0412 |  50|       2|
0413 |  30|       2|
0414 |  18|       1|
0415 +----+--------+
0416 
0417 -- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
0418 SELECT DISTINCT age FROM person;
0419 +----+
0420 | age|
0421 +----+
0422 |null|
0423 |  50|
0424 |  30|
0425 |  18|
0426 +----+
0427 ```
0428 
0429 ### Sort Operator (ORDER BY Clause) <a name="order-by"></a>
0430 
0431 Spark SQL supports null ordering specification in `ORDER BY` clause. Spark processes the `ORDER BY` clause by
0432 placing all the `NULL` values at first or at last depending on the null ordering specification. By default, all
0433 the `NULL` values are placed at first.
0434 
0435 #### Examples
0436 
0437 ```sql
0438 -- `NULL` values are shown at first and other values
0439 -- are sorted in ascending way.
0440 SELECT age, name FROM person ORDER BY age;
0441 +----+--------+
0442 | age|    name|
0443 +----+--------+
0444 |null|   Marry|
0445 |null|  Albert|
0446 |  18|    Mike|
0447 |  30|Michelle|
0448 |  30|     Joe|
0449 |  50|    Fred|
0450 |  50|     Dan|
0451 +----+--------+
0452 
0453 -- Column values other than `NULL` are sorted in ascending
0454 -- way and `NULL` values are shown at the last.
0455 SELECT age, name FROM person ORDER BY age NULLS LAST;
0456 +----+--------+
0457 | age|    name|
0458 +----+--------+
0459 |  18|    Mike|
0460 |  30|Michelle|
0461 |  30|     Joe|
0462 |  50|     Dan|
0463 |  50|    Fred|
0464 |null|   Marry|
0465 |null|  Albert|
0466 +----+--------+
0467 
0468 -- Columns other than `NULL` values are sorted in descending
0469 -- and `NULL` values are shown at the last.
0470 SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
0471 +----+--------+
0472 | age|    name|
0473 +----+--------+
0474 |  50|    Fred|
0475 |  50|     Dan|
0476 |  30|Michelle|
0477 |  30|     Joe|
0478 |  18|    Mike|
0479 |null|   Marry|
0480 |null|  Albert|
0481 +----+--------+
0482 ```
0483 
0484 ### Set Operators (UNION, INTERSECT, EXCEPT) <a name="set-operators"></a>
0485 
0486 `NULL` values are compared in a null-safe manner for equality in the context of
0487 set operations. That means when comparing rows, two `NULL` values are considered 
0488 equal unlike the regular `EqualTo`(`=`) operator.
0489 
0490 #### Examples
0491 
0492 ```sql
0493 CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;
0494 
0495 -- Only common rows between two legs of `INTERSECT` are in the 
0496 -- result set. The comparison between columns of the row are done
0497 -- in a null-safe manner.
0498 SELECT name, age FROM person
0499     INTERSECT
0500     SELECT name, age from unknown_age;
0501 +------+----+
0502 |  name| age|
0503 +------+----+
0504 |Albert|null|
0505 | Marry|null|
0506 +------+----+
0507 
0508 -- `NULL` values from two legs of the `EXCEPT` are not in output. 
0509 -- This basically shows that the comparison happens in a null-safe manner.
0510 SELECT age, name FROM person
0511     EXCEPT
0512     SELECT age FROM unknown_age;
0513 +---+--------+
0514 |age|    name|
0515 +---+--------+
0516 | 30|     Joe|
0517 | 50|    Fred|
0518 | 30|Michelle|
0519 | 18|    Mike|
0520 | 50|     Dan|
0521 +---+--------+
0522 
0523 -- Performs `UNION` operation between two sets of data. 
0524 -- The comparison between columns of the row ae done in
0525 -- null-safe manner.
0526 SELECT name, age FROM person
0527     UNION 
0528     SELECT name, age FROM unknown_age;
0529 +--------+----+
0530 |    name| age|
0531 +--------+----+
0532 |  Albert|null|
0533 |     Joe|  30|
0534 |Michelle|  30|
0535 |   Marry|null|
0536 |    Fred|  50|
0537 |    Mike|  18|
0538 |     Dan|  50|
0539 +--------+----+
0540 ```
0541 
0542 ### EXISTS/NOT EXISTS Subquery <a name="exists-not-exists"></a>
0543 
0544 In Spark, EXISTS and NOT EXISTS expressions are allowed inside a WHERE clause. 
0545 These are boolean expressions which return either `TRUE` or
0546 `FALSE`. In other words, EXISTS is a membership condition and returns `TRUE`
0547 when the subquery it refers to returns one or more rows. Similarly, NOT EXISTS
0548 is a non-membership condition and returns TRUE when no rows or zero rows are
0549 returned from the subquery.
0550 
0551 These two expressions are not affected by presence of NULL in the result of
0552 the subquery. They are normally faster because they can be converted to
0553 semijoins / anti-semijoins without special provisions for null awareness.
0554 
0555 #### Examples
0556 
0557 ```sql
0558 -- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
0559 -- evaluates to `TRUE` as the subquery produces 1 row.
0560 SELECT * FROM person WHERE EXISTS (SELECT null);
0561 +--------+----+
0562 |    name| age|
0563 +--------+----+
0564 |  Albert|null|
0565 |Michelle|  30|
0566 |    Fred|  50|
0567 |    Mike|  18|
0568 |     Dan|  50|
0569 |   Marry|null|
0570 |     Joe|  30|
0571 +--------+----+
0572 
0573 -- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
0574 -- subquery produces no rows. In this case, it returns 1 row.
0575 SELECT * FROM person WHERE NOT EXISTS (SELECT null);
0576 +----+---+
0577 |name|age|
0578 +----+---+
0579 +----+---+
0580 
0581 -- `NOT EXISTS` expression returns `TRUE`.
0582 SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
0583 +--------+----+
0584 |    name| age|
0585 +--------+----+
0586 |  Albert|null|
0587 |Michelle|  30|
0588 |    Fred|  50|
0589 |    Mike|  18|
0590 |     Dan|  50|
0591 |   Marry|null|
0592 |     Joe|  30|
0593 +--------+----+
0594 ```
0595 
0596 ### IN/NOT IN Subquery <a name="in-not-in"></a>
0597 
0598 In Spark, `IN` and `NOT IN` expressions are allowed inside a WHERE clause of
0599 a query. Unlike the `EXISTS` expression, `IN` expression can return a `TRUE`,
0600 `FALSE` or `UNKNOWN (NULL)` value. Conceptually a `IN` expression is semantically
0601 equivalent to a set of equality condition separated by a disjunctive operator (`OR`).
0602 For example, c1 IN (1, 2, 3) is semantically equivalent to `(C1 = 1 OR c1 = 2 OR c1 = 3)`.
0603 
0604 As far as handling `NULL` values are concerned, the semantics can be deduced from
0605 the `NULL` value handling in comparison operators(`=`) and logical operators(`OR`).
0606 To summarize, below are the rules for computing the result of an `IN` expression.
0607 
0608 - TRUE is returned when the non-NULL value in question is found in the list
0609 - FALSE is returned when the non-NULL value is not found in the list and the
0610   list does not contain NULL values
0611 - UNKNOWN is returned when the value is `NULL`, or the non-NULL value is not found in the list
0612   and the list contains at least one `NULL` value
0613 
0614 NOT IN always returns UNKNOWN when the list contains `NULL`, regardless of the input value.
0615 This is because IN returns UNKNOWN if the value is not in the list containing `NULL`,
0616 and because NOT UNKNOWN is again UNKNOWN.
0617  
0618 #### Examples
0619 
0620 ```sql
0621 -- The subquery has only `NULL` value in its result set. Therefore,
0622 -- the result of `IN` predicate is UNKNOWN.
0623 SELECT * FROM person WHERE age IN (SELECT null);
0624 +----+---+
0625 |name|age|
0626 +----+---+
0627 +----+---+
0628 
0629 -- The subquery has `NULL` value in the result set as well as a valid 
0630 -- value `50`. Rows with age = 50 are returned. 
0631 SELECT * FROM person
0632     WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
0633 +----+---+
0634 |name|age|
0635 +----+---+
0636 |Fred| 50|
0637 | Dan| 50|
0638 +----+---+
0639 
0640 -- Since subquery has `NULL` value in the result set, the `NOT IN`
0641 -- predicate would return UNKNOWN. Hence, no rows are
0642 -- qualified for this query.
0643 SELECT * FROM person
0644     WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
0645 +----+---+
0646 |name|age|
0647 +----+---+
0648 +----+---+
0649 ```