Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Hints
0004 displayTitle: Hints
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012 
0013      http://www.apache.org/licenses/LICENSE-2.0
0014 
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 ### Description
0023 
0024 Hints give users a way to suggest how Spark SQL to use specific approaches to generate its execution plan.
0025 
0026 ### Syntax
0027 
0028 ```sql
0029 /*+ hint [ , ... ] */
0030 ```
0031 
0032 ### Partitioning Hints
0033 
0034 Partitioning hints allow users to suggest a partitioning stragety that Spark should follow. `COALESCE`, `REPARTITION`,
0035 and `REPARTITION_BY_RANGE` hints are supported and are equivalent to `coalesce`, `repartition`, and
0036 `repartitionByRange` [Dataset APIs](api/scala/org/apache/spark/sql/Dataset.html), respectively. These hints give users
0037 a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are
0038 specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.
0039 
0040 #### Partitioning Hints Types
0041 
0042 * **COALESCE**
0043 
0044   The `COALESCE` hint can be used to reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.
0045 
0046 * **REPARTITION**
0047 
0048   The `REPARTITION` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.
0049 
0050 * **REPARTITION_BY_RANGE**
0051 
0052   The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.
0053 
0054 #### Examples
0055 
0056 ```sql
0057 SELECT /*+ COALESCE(3) */ * FROM t;
0058 
0059 SELECT /*+ REPARTITION(3) */ * FROM t;
0060 
0061 SELECT /*+ REPARTITION(c) */ * FROM t;
0062 
0063 SELECT /*+ REPARTITION(3, c) */ * FROM t;
0064 
0065 SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;
0066 
0067 SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;
0068 
0069 -- multiple partitioning hints
0070 EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
0071 == Parsed Logical Plan ==
0072 'UnresolvedHint REPARTITION, [100]
0073 +- 'UnresolvedHint COALESCE, [500]
0074    +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
0075       +- 'Project [*]
0076          +- 'UnresolvedRelation [t]
0077 
0078 == Analyzed Logical Plan ==
0079 name: string, c: int
0080 Repartition 100, true
0081 +- Repartition 500, false
0082    +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
0083       +- Project [name#29, c#30]
0084          +- SubqueryAlias spark_catalog.default.t
0085             +- Relation[name#29,c#30] parquet
0086 
0087 == Optimized Logical Plan ==
0088 Repartition 100, true
0089 +- Relation[name#29,c#30] parquet
0090 
0091 == Physical Plan ==
0092 Exchange RoundRobinPartitioning(100), false, [id=#121]
0093 +- *(1) ColumnarToRow
0094    +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
0095       Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
0096       PushedFilters: [], ReadSchema: struct<name:string>
0097 ```
0098 
0099 ### Join Hints
0100 
0101 Join hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, `SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. When different join strategy hints are specified on both sides of a join, Spark prioritizes hints in the following order: `BROADCAST` over `MERGE` over `SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Spark is not guaranteed to use the join strategy suggested by the hint.
0102 
0103 #### Join Hints Types
0104 
0105 * **BROADCAST**
0106 
0107     Suggests that Spark use broadcast join. The join side with the hint will be broadcast regardless of `autoBroadcastJoinThreshold`. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast. The aliases for `BROADCAST` are `BROADCASTJOIN` and `MAPJOIN`.
0108 
0109 * **MERGE**
0110 
0111     Suggests that Spark use shuffle sort merge join. The aliases for `MERGE` are `SHUFFLE_MERGE` and `MERGEJOIN`.
0112 
0113 * **SHUFFLE_HASH**
0114 
0115     Suggests that Spark use shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.
0116 
0117 * **SHUFFLE_REPLICATE_NL**
0118 
0119     Suggests that Spark use shuffle-and-replicate nested loop join.
0120 
0121 #### Examples
0122 
0123 ```sql
0124 -- Join Hints for broadcast join
0125 SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0126 SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
0127 SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;
0128 
0129 -- Join Hints for shuffle sort merge join
0130 SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0131 SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0132 SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0133 
0134 -- Join Hints for shuffle hash join
0135 SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0136 
0137 -- Join Hints for shuffle-and-replicate nested loop join
0138 SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0139 
0140 -- When different join strategy hints are specified on both sides of a join, Spark
0141 -- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
0142 -- over the SHUFFLE_REPLICATE_NL hint.
0143 -- Spark will issue Warning in the following example
0144 -- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
0145 -- is overridden by another hint and will not take effect.
0146 SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
0147 ```
0148 
0149 ### Related Statements
0150 
0151 * [JOIN](sql-ref-syntax-qry-select-join.html)
0152 * [SELECT](sql-ref-syntax-qry-select.html)