the-tree/docs/ml-clustering.md

0001 ---
0002 layout: global
0003 title: Clustering
0004 displayTitle: Clustering
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 This page describes clustering algorithms in MLlib.
0023 The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
0024 about these algorithms.
0025
0026 **Table of Contents**
0027
0028 * This will become a table of contents (this text will be scraped).
0029 {:toc}
0030
0031 ## K-means
0032
0033 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
0034 most commonly used clustering algorithms that clusters the data points into a
0035 predefined number of clusters. The MLlib implementation includes a parallelized
0036 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
0037 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
0038
0039 `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
0040
0041 ### Input Columns
0042
0043 <table class="table">
0044   <thead>
0045     <tr>
0046       <th align="left">Param name</th>
0047       <th align="left">Type(s)</th>
0048       <th align="left">Default</th>
0049       <th align="left">Description</th>
0050     </tr>
0051   </thead>
0052   <tbody>
0053     <tr>
0054       <td>featuresCol</td>
0055       <td>Vector</td>
0056       <td>"features"</td>
0057       <td>Feature vector</td>
0058     </tr>
0059   </tbody>
0060 </table>
0061
0062 ### Output Columns
0063
0064 <table class="table">
0065   <thead>
0066     <tr>
0067       <th align="left">Param name</th>
0068       <th align="left">Type(s)</th>
0069       <th align="left">Default</th>
0070       <th align="left">Description</th>
0071     </tr>
0072   </thead>
0073   <tbody>
0074     <tr>
0075       <td>predictionCol</td>
0076       <td>Int</td>
0077       <td>"prediction"</td>
0078       <td>Predicted cluster center</td>
0079     </tr>
0080   </tbody>
0081 </table>
0082
0083 **Examples**
0084
0085 <div class="codetabs">
0086
0087 <div data-lang="scala" markdown="1">
0088 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.
0089
0090 {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
0091 </div>
0092
0093 <div data-lang="java" markdown="1">
0094 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
0095
0096 {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
0097 </div>
0098
0099 <div data-lang="python" markdown="1">
0100 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.
0101
0102 {% include_example python/ml/kmeans_example.py %}
0103 </div>
0104
0105 <div data-lang="r" markdown="1">
0106
0107 Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
0108
0109 {% include_example r/ml/kmeans.R %}
0110 </div>
0111
0112 </div>
0113
0114 ## Latent Dirichlet allocation (LDA)
0115
0116 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
0117 and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
0118 `EMLDAOptimizer` to a `DistributedLDAModel` if needed.
0119
0120 **Examples**
0121
0122 <div class="codetabs">
0123
0124 <div data-lang="scala" markdown="1">
0125
0126 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details.
0127
0128 {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
0129 </div>
0130
0131 <div data-lang="java" markdown="1">
0132
0133 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
0134
0135 {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
0136 </div>
0137
0138 <div data-lang="python" markdown="1">
0139
0140 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.
0141
0142 {% include_example python/ml/lda_example.py %}
0143 </div>
0144
0145 <div data-lang="r" markdown="1">
0146
0147 Refer to the [R API docs](api/R/spark.lda.html) for more details.
0148
0149 {% include_example r/ml/lda.R %}
0150 </div>
0151
0152 </div>
0153
0154 ## Bisecting k-means
0155
0156 Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
0157 divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
0158 moves down the hierarchy.
0159
0160 Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
0161
0162 `BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.
0163
0164 **Examples**
0165
0166 <div class="codetabs">
0167
0168 <div data-lang="scala" markdown="1">
0169 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
0170
0171 {% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
0172 </div>
0173
0174 <div data-lang="java" markdown="1">
0175 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
0176
0177 {% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
0178 </div>
0179
0180 <div data-lang="python" markdown="1">
0181 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.
0182
0183 {% include_example python/ml/bisecting_k_means_example.py %}
0184 </div>
0185
0186 <div data-lang="r" markdown="1">
0187
0188 Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details.
0189
0190 {% include_example r/ml/bisectingKmeans.R %}
0191 </div>
0192 </div>
0193
0194 ## Gaussian Mixture Model (GMM)
0195
0196 A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
0197 represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
0198 each with its own probability. The `spark.ml` implementation uses the
0199 [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
0200 algorithm to induce the maximum-likelihood model given a set of samples.
0201
0202 `GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
0203 model.
0204
0205 ### Input Columns
0206
0207 <table class="table">
0208   <thead>
0209     <tr>
0210       <th align="left">Param name</th>
0211       <th align="left">Type(s)</th>
0212       <th align="left">Default</th>
0213       <th align="left">Description</th>
0214     </tr>
0215   </thead>
0216   <tbody>
0217     <tr>
0218       <td>featuresCol</td>
0219       <td>Vector</td>
0220       <td>"features"</td>
0221       <td>Feature vector</td>
0222     </tr>
0223   </tbody>
0224 </table>
0225
0226 ### Output Columns
0227
0228 <table class="table">
0229   <thead>
0230     <tr>
0231       <th align="left">Param name</th>
0232       <th align="left">Type(s)</th>
0233       <th align="left">Default</th>
0234       <th align="left">Description</th>
0235     </tr>
0236   </thead>
0237   <tbody>
0238     <tr>
0239       <td>predictionCol</td>
0240       <td>Int</td>
0241       <td>"prediction"</td>
0242       <td>Predicted cluster center</td>
0243     </tr>
0244     <tr>
0245       <td>probabilityCol</td>
0246       <td>Vector</td>
0247       <td>"probability"</td>
0248       <td>Probability of each cluster</td>
0249     </tr>
0250   </tbody>
0251 </table>
0252
0253 **Examples**
0254
0255 <div class="codetabs">
0256
0257 <div data-lang="scala" markdown="1">
0258 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
0259
0260 {% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
0261 </div>
0262
0263 <div data-lang="java" markdown="1">
0264 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
0265
0266 {% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
0267 </div>
0268
0269 <div data-lang="python" markdown="1">
0270 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.
0271
0272 {% include_example python/ml/gaussian_mixture_example.py %}
0273 </div>
0274
0275 <div data-lang="r" markdown="1">
0276
0277 Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.
0278
0279 {% include_example r/ml/gaussianMixture.R %}
0280 </div>
0281
0282 </div>
0283
0284 ## Power Iteration Clustering (PIC)
0285
0286 Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
0287 developed by [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf).
0288 From the abstract: PIC finds a very low-dimensional embedding of a dataset
0289 using truncated power iteration on a normalized pair-wise similarity matrix of the data.
0290
0291 `spark.ml`'s PowerIterationClustering implementation takes the following parameters:
0292
0293 * `k`: the number of clusters to create
0294 * `initMode`: param for the initialization algorithm
0295 * `maxIter`: param for maximum number of iterations
0296 * `srcCol`: param for the name of the input column for source vertex IDs
0297 * `dstCol`: name of the input column for destination vertex IDs
0298 * `weightCol`: Param for weight column name
0299
0300 **Examples**
0301
0302 <div class="codetabs">
0303
0304 <div data-lang="scala" markdown="1">
0305 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
0306
0307 {% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
0308 </div>
0309
0310 <div data-lang="java" markdown="1">
0311 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
0312
0313 {% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
0314 </div>
0315
0316 <div data-lang="python" markdown="1">
0317 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.PowerIterationClustering) for more details.
0318
0319 {% include_example python/ml/power_iteration_clustering_example.py %}
0320 </div>
0321
0322 <div data-lang="r" markdown="1">
0323
0324 Refer to the [R API docs](api/R/spark.powerIterationClustering.html) for more details.
0325
0326 {% include_example r/ml/powerIterationClustering.R %}
0327 </div>
0328
0329 </div>