the-tree/docs/mllib-frequent-pattern-mining.md

0001 ---
0002 layout: global
0003 title: Frequent Pattern Mining - RDD-based API
0004 displayTitle: Frequent Pattern Mining - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 Mining frequent items, itemsets, subsequences, or other substructures is usually among the
0023 first steps to analyze a large-scale dataset, which has been an active research topic in
0024 data mining for years.
0025 We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
0026 for more information.
0027 `spark.mllib` provides a parallel implementation of FP-growth,
0028 a popular algorithm to mining frequent itemsets.
0029
0030 ## FP-growth
0031
0032 The FP-growth algorithm is described in the paper
0033 [Han et al., Mining frequent patterns without candidate generation](https://doi.org/10.1145/335191.335372),
0034 where "FP" stands for frequent pattern.
0035 Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
0036 Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
0037 the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
0038 explicitly, which are usually expensive to generate.
0039 After the second step, the frequent itemsets can be extracted from the FP-tree.
0040 In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
0041 as described in [Li et al., PFP: Parallel FP-growth for query recommendation](https://doi.org/10.1145/1454008.1454027).
0042 PFP distributes the work of growing FP-trees based on the suffixes of transactions,
0043 and hence more scalable than a single-machine implementation.
0044 We refer users to the papers for more details.
0045
0046 `spark.mllib`'s FP-growth implementation takes the following (hyper-)parameters:
0047
0048 * `minSupport`: the minimum support for an itemset to be identified as frequent.
0049   For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
0050 * `numPartitions`: the number of partitions used to distribute the work.
0051
0052 **Examples**
0053
0054 <div class="codetabs">
0055 <div data-lang="scala" markdown="1">
0056
0057 [`FPGrowth`](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
0058 FP-growth algorithm.
0059 It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
0060 Calling `FPGrowth.run` with transactions returns an
0061 [`FPGrowthModel`](api/scala/org/apache/spark/mllib/fpm/FPGrowthModel.html)
0062 that stores the frequent itemsets with their frequencies.  The following
0063 example illustrates how to mine frequent itemsets and association rules
0064 (see [Association
0065 Rules](mllib-frequent-pattern-mining.html#association-rules) for
0066 details) from `transactions`.
0067
0068 Refer to the [`FPGrowth` Scala docs](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
0069
0070 {% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %}
0071
0072 </div>
0073
0074 <div data-lang="java" markdown="1">
0075
0076 [`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
0077 FP-growth algorithm.
0078 It takes a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type.
0079 Calling `FPGrowth.run` with transactions returns an
0080 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
0081 that stores the frequent itemsets with their frequencies.  The following
0082 example illustrates how to mine frequent itemsets and association rules
0083 (see [Association
0084 Rules](mllib-frequent-pattern-mining.html#association-rules) for
0085 details) from `transactions`.
0086
0087 Refer to the [`FPGrowth` Java docs](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
0088
0089 {% include_example java/org/apache/spark/examples/mllib/JavaSimpleFPGrowth.java %}
0090
0091 </div>
0092
0093 <div data-lang="python" markdown="1">
0094
0095 [`FPGrowth`](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth) implements the
0096 FP-growth algorithm.
0097 It takes an `RDD` of transactions, where each transaction is an `List` of items of a generic type.
0098 Calling `FPGrowth.train` with transactions returns an
0099 [`FPGrowthModel`](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowthModel)
0100 that stores the frequent itemsets with their frequencies.
0101
0102 Refer to the [`FPGrowth` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth) for more details on the API.
0103
0104 {% include_example python/mllib/fpgrowth_example.py %}
0105
0106 </div>
0107
0108 </div>
0109
0110 ## Association Rules
0111
0112 <div class="codetabs">
0113 <div data-lang="scala" markdown="1">
0114 [AssociationRules](api/scala/org/apache/spark/mllib/fpm/AssociationRules.html)
0115 implements a parallel rule generation algorithm for constructing rules
0116 that have a single item as the consequent.
0117
0118 Refer to the [`AssociationRules` Scala docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API.
0119
0120 {% include_example scala/org/apache/spark/examples/mllib/AssociationRulesExample.scala %}
0121
0122 </div>
0123
0124 <div data-lang="java" markdown="1">
0125 [AssociationRules](api/java/org/apache/spark/mllib/fpm/AssociationRules.html)
0126 implements a parallel rule generation algorithm for constructing rules
0127 that have a single item as the consequent.
0128
0129 Refer to the [`AssociationRules` Java docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API.
0130
0131 {% include_example java/org/apache/spark/examples/mllib/JavaAssociationRulesExample.java %}
0132
0133 </div>
0134 </div>
0135
0136 ## PrefixSpan
0137
0138 PrefixSpan is a sequential pattern mining algorithm described in
0139 [Pei et al., Mining Sequential Patterns by Pattern-Growth: The
0140 PrefixSpan Approach](https://doi.org/10.1109%2FTKDE.2004.77). We refer
0141 the reader to the referenced paper for formalizing the sequential
0142 pattern mining problem.
0143
0144 `spark.mllib`'s PrefixSpan implementation takes the following parameters:
0145
0146 * `minSupport`: the minimum support required to be considered a frequent
0147   sequential pattern.
0148 * `maxPatternLength`: the maximum length of a frequent sequential
0149   pattern. Any frequent pattern exceeding this length will not be
0150   included in the results.
0151 * `maxLocalProjDBSize`: the maximum number of items allowed in a
0152   prefix-projected database before local iterative processing of the
0153   projected database begins. This parameter should be tuned with respect
0154   to the size of your executors.
0155
0156 **Examples**
0157
0158 The following example illustrates PrefixSpan running on the sequences
0159 (using same notation as Pei et al):
0160
0161 ~~~
0162   <(12)3>
0163   <1(32)(12)>
0164   <(12)5>
0165   <6>
0166 ~~~
0167
0168 <div class="codetabs">
0169 <div data-lang="scala" markdown="1">
0170
0171 [`PrefixSpan`](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
0172 PrefixSpan algorithm.
0173 Calling `PrefixSpan.run` returns a
0174 [`PrefixSpanModel`](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
0175 that stores the frequent sequences with their frequencies.
0176
0177 Refer to the [`PrefixSpan` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
0178
0179 {% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %}
0180
0181 </div>
0182
0183 <div data-lang="java" markdown="1">
0184
0185 [`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
0186 PrefixSpan algorithm.
0187 Calling `PrefixSpan.run` returns a
0188 [`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
0189 that stores the frequent sequences with their frequencies.
0190
0191 Refer to the [`PrefixSpan` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
0192
0193 {% include_example java/org/apache/spark/examples/mllib/JavaPrefixSpanExample.java %}
0194
0195 </div>
0196 </div>
0197