Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Frequent Pattern Mining - RDD-based API
0004 displayTitle: Frequent Pattern Mining - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 Mining frequent items, itemsets, subsequences, or other substructures is usually among the
0023 first steps to analyze a large-scale dataset, which has been an active research topic in
0024 data mining for years.
0025 We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
0026 for more information.
0027 `spark.mllib` provides a parallel implementation of FP-growth,
0028 a popular algorithm to mining frequent itemsets.
0029 
0030 ## FP-growth
0031 
0032 The FP-growth algorithm is described in the paper
0033 [Han et al., Mining frequent patterns without candidate generation](https://doi.org/10.1145/335191.335372),
0034 where "FP" stands for frequent pattern.
0035 Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
0036 Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
0037 the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
0038 explicitly, which are usually expensive to generate.
0039 After the second step, the frequent itemsets can be extracted from the FP-tree.
0040 In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
0041 as described in [Li et al., PFP: Parallel FP-growth for query recommendation](https://doi.org/10.1145/1454008.1454027).
0042 PFP distributes the work of growing FP-trees based on the suffixes of transactions,
0043 and hence more scalable than a single-machine implementation.
0044 We refer users to the papers for more details.
0045 
0046 `spark.mllib`'s FP-growth implementation takes the following (hyper-)parameters:
0047 
0048 * `minSupport`: the minimum support for an itemset to be identified as frequent.
0049   For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
0050 * `numPartitions`: the number of partitions used to distribute the work.
0051 
0052 **Examples**
0053 
0054 <div class="codetabs">
0055 <div data-lang="scala" markdown="1">
0056 
0057 [`FPGrowth`](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
0058 FP-growth algorithm.
0059 It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
0060 Calling `FPGrowth.run` with transactions returns an
0061 [`FPGrowthModel`](api/scala/org/apache/spark/mllib/fpm/FPGrowthModel.html)
0062 that stores the frequent itemsets with their frequencies.  The following
0063 example illustrates how to mine frequent itemsets and association rules
0064 (see [Association
0065 Rules](mllib-frequent-pattern-mining.html#association-rules) for
0066 details) from `transactions`.
0067 
0068 Refer to the [`FPGrowth` Scala docs](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
0069 
0070 {% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %}
0071 
0072 </div>
0073 
0074 <div data-lang="java" markdown="1">
0075 
0076 [`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
0077 FP-growth algorithm.
0078 It takes a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type.
0079 Calling `FPGrowth.run` with transactions returns an
0080 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
0081 that stores the frequent itemsets with their frequencies.  The following
0082 example illustrates how to mine frequent itemsets and association rules
0083 (see [Association
0084 Rules](mllib-frequent-pattern-mining.html#association-rules) for
0085 details) from `transactions`.
0086 
0087 Refer to the [`FPGrowth` Java docs](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
0088 
0089 {% include_example java/org/apache/spark/examples/mllib/JavaSimpleFPGrowth.java %}
0090 
0091 </div>
0092 
0093 <div data-lang="python" markdown="1">
0094 
0095 [`FPGrowth`](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth) implements the
0096 FP-growth algorithm.
0097 It takes an `RDD` of transactions, where each transaction is an `List` of items of a generic type.
0098 Calling `FPGrowth.train` with transactions returns an
0099 [`FPGrowthModel`](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowthModel)
0100 that stores the frequent itemsets with their frequencies.
0101 
0102 Refer to the [`FPGrowth` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth) for more details on the API.
0103 
0104 {% include_example python/mllib/fpgrowth_example.py %}
0105 
0106 </div>
0107 
0108 </div>
0109 
0110 ## Association Rules
0111 
0112 <div class="codetabs">
0113 <div data-lang="scala" markdown="1">
0114 [AssociationRules](api/scala/org/apache/spark/mllib/fpm/AssociationRules.html)
0115 implements a parallel rule generation algorithm for constructing rules
0116 that have a single item as the consequent.
0117 
0118 Refer to the [`AssociationRules` Scala docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API.
0119 
0120 {% include_example scala/org/apache/spark/examples/mllib/AssociationRulesExample.scala %}
0121 
0122 </div>
0123 
0124 <div data-lang="java" markdown="1">
0125 [AssociationRules](api/java/org/apache/spark/mllib/fpm/AssociationRules.html)
0126 implements a parallel rule generation algorithm for constructing rules
0127 that have a single item as the consequent.
0128 
0129 Refer to the [`AssociationRules` Java docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API.
0130 
0131 {% include_example java/org/apache/spark/examples/mllib/JavaAssociationRulesExample.java %}
0132 
0133 </div>
0134 </div>
0135 
0136 ## PrefixSpan
0137 
0138 PrefixSpan is a sequential pattern mining algorithm described in
0139 [Pei et al., Mining Sequential Patterns by Pattern-Growth: The
0140 PrefixSpan Approach](https://doi.org/10.1109%2FTKDE.2004.77). We refer
0141 the reader to the referenced paper for formalizing the sequential
0142 pattern mining problem.
0143 
0144 `spark.mllib`'s PrefixSpan implementation takes the following parameters:
0145 
0146 * `minSupport`: the minimum support required to be considered a frequent
0147   sequential pattern.
0148 * `maxPatternLength`: the maximum length of a frequent sequential
0149   pattern. Any frequent pattern exceeding this length will not be
0150   included in the results.
0151 * `maxLocalProjDBSize`: the maximum number of items allowed in a
0152   prefix-projected database before local iterative processing of the
0153   projected database begins. This parameter should be tuned with respect
0154   to the size of your executors.
0155 
0156 **Examples**
0157 
0158 The following example illustrates PrefixSpan running on the sequences
0159 (using same notation as Pei et al):
0160 
0161 ~~~
0162   <(12)3>
0163   <1(32)(12)>
0164   <(12)5>
0165   <6>
0166 ~~~
0167 
0168 <div class="codetabs">
0169 <div data-lang="scala" markdown="1">
0170 
0171 [`PrefixSpan`](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
0172 PrefixSpan algorithm.
0173 Calling `PrefixSpan.run` returns a
0174 [`PrefixSpanModel`](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
0175 that stores the frequent sequences with their frequencies.
0176 
0177 Refer to the [`PrefixSpan` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
0178 
0179 {% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %}
0180 
0181 </div>
0182 
0183 <div data-lang="java" markdown="1">
0184 
0185 [`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
0186 PrefixSpan algorithm.
0187 Calling `PrefixSpan.run` returns a
0188 [`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
0189 that stores the frequent sequences with their frequencies.
0190 
0191 Refer to the [`PrefixSpan` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
0192 
0193 {% include_example java/org/apache/spark/examples/mllib/JavaPrefixSpanExample.java %}
0194 
0195 </div>
0196 </div>
0197