the-tree/docs/sql-ref-syntax-qry-select-hints.md

0001 ---
0002 layout: global
0003 title: Hints
0004 displayTitle: Hints
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 ### Description
0023
0024 Hints give users a way to suggest how Spark SQL to use specific approaches to generate its execution plan.
0025
0026 ### Syntax
0027
0028 ```sql
0029 /*+ hint [ , ... ] */
0030 ```
0031
0032 ### Partitioning Hints
0033
0034 Partitioning hints allow users to suggest a partitioning stragety that Spark should follow. `COALESCE`, `REPARTITION`,
0035 and `REPARTITION_BY_RANGE` hints are supported and are equivalent to `coalesce`, `repartition`, and
0036 `repartitionByRange` [Dataset APIs](api/scala/org/apache/spark/sql/Dataset.html), respectively. These hints give users
0037 a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are
0038 specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.
0039
0040 #### Partitioning Hints Types
0041
0042 * **COALESCE**
0043
0044   The `COALESCE` hint can be used to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.
0045
0046 * **REPARTITION**
0047
0048   The `REPARTITION` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.
0049
0050 * **REPARTITION_BY_RANGE**
0051
0052   The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.
0053
0054 #### Examples
0055
0056 ```sql
0057 SELECT /*+ COALESCE(3) */ * FROM t;
0058
0059 SELECT /*+ REPARTITION(3) */ * FROM t;
0060
0061 SELECT /*+ REPARTITION(c) */ * FROM t;
0062
0063 SELECT /*+ REPARTITION(3, c) */ * FROM t;
0064
0065 SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;
0066
0067 SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;
0068
0069 -- multiple partitioning hints
0070 EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
0071 == Parsed Logical Plan ==
0072 'UnresolvedHint REPARTITION, [100]
0073 +- 'UnresolvedHint COALESCE, [500]
0074    +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
0075       +- 'Project [*]
0076          +- 'UnresolvedRelation [t]
0077
0078 == Analyzed Logical Plan ==
0079 name: string, c: int
0080 Repartition 100, true
0081 +- Repartition 500, false
0082    +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
0083       +- Project [name#29, c#30]
0084          +- SubqueryAlias spark_catalog.default.t
0085             +- Relation[name#29,c#30] parquet
0086
0087 == Optimized Logical Plan ==
0088 Repartition 100, true
0089 +- Relation[name#29,c#30] parquet
0090
0091 == Physical Plan ==
0092 Exchange RoundRobinPartitioning(100), false, [id=#121]
0093 +- *(1) ColumnarToRow
0094    +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
0095       Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
0096       PushedFilters: [], ReadSchema: struct<name:string>
0097 ```
0098
0099 ### Join Hints
0100
0101 Join hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, `SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. When different join strategy hints are specified on both sides of a join, Spark prioritizes hints in the following order: `BROADCAST` over `MERGE` over `SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Spark is not guaranteed to use the join strategy suggested by the hint.
0102
0103 #### Join Hints Types
0104
0105 * **BROADCAST**
0106
0107     Suggests that Spark use broadcast join. The join side with the hint will be broadcast regardless of `autoBroadcastJoinThreshold`. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast. The aliases for `BROADCAST` are `BROADCASTJOIN` and `MAPJOIN`.
0108
0109 * **MERGE**
0110
0111     Suggests that Spark use shuffle sort merge join. The aliases for `MERGE` are `SHUFFLE_MERGE` and `MERGEJOIN`.
0112
0113 * **SHUFFLE_HASH**
0114
0115     Suggests that Spark use shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.
0116
0117 * **SHUFFLE_REPLICATE_NL**
0118
0119     Suggests that Spark use shuffle-and-replicate nested loop join.
0120
0121 #### Examples
0122
0123 ```sql
0124 -- Join Hints for broadcast join
0125 SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0126 SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
0127 SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;
0128
0129 -- Join Hints for shuffle sort merge join
0130 SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0131 SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0132 SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0133
0134 -- Join Hints for shuffle hash join
0135 SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0136
0137 -- Join Hints for shuffle-and-replicate nested loop join
0138 SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0139
0140 -- When different join strategy hints are specified on both sides of a join, Spark
0141 -- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
0142 -- over the SHUFFLE_REPLICATE_NL hint.
0143 -- Spark will issue Warning in the following example
0144 -- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
0145 -- is overridden by another hint and will not take effect.
0146 SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0147 ```
0148
0149 ### Related Statements
0150
0151 * [JOIN](sql-ref-syntax-qry-select-join.html)
0152 * [SELECT](sql-ref-syntax-qry-select.html)