Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Clustering
0004 displayTitle: Clustering
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 This page describes clustering algorithms in MLlib.
0023 The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
0024 about these algorithms.
0025 
0026 **Table of Contents**
0027 
0028 * This will become a table of contents (this text will be scraped).
0029 {:toc}
0030 
0031 ## K-means
0032 
0033 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
0034 most commonly used clustering algorithms that clusters the data points into a
0035 predefined number of clusters. The MLlib implementation includes a parallelized
0036 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
0037 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
0038 
0039 `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
0040 
0041 ### Input Columns
0042 
0043 <table class="table">
0044   <thead>
0045     <tr>
0046       <th align="left">Param name</th>
0047       <th align="left">Type(s)</th>
0048       <th align="left">Default</th>
0049       <th align="left">Description</th>
0050     </tr>
0051   </thead>
0052   <tbody>
0053     <tr>
0054       <td>featuresCol</td>
0055       <td>Vector</td>
0056       <td>"features"</td>
0057       <td>Feature vector</td>
0058     </tr>
0059   </tbody>
0060 </table>
0061 
0062 ### Output Columns
0063 
0064 <table class="table">
0065   <thead>
0066     <tr>
0067       <th align="left">Param name</th>
0068       <th align="left">Type(s)</th>
0069       <th align="left">Default</th>
0070       <th align="left">Description</th>
0071     </tr>
0072   </thead>
0073   <tbody>
0074     <tr>
0075       <td>predictionCol</td>
0076       <td>Int</td>
0077       <td>"prediction"</td>
0078       <td>Predicted cluster center</td>
0079     </tr>
0080   </tbody>
0081 </table>
0082 
0083 **Examples**
0084 
0085 <div class="codetabs">
0086 
0087 <div data-lang="scala" markdown="1">
0088 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.
0089 
0090 {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
0091 </div>
0092 
0093 <div data-lang="java" markdown="1">
0094 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
0095 
0096 {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
0097 </div>
0098 
0099 <div data-lang="python" markdown="1">
0100 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.
0101 
0102 {% include_example python/ml/kmeans_example.py %}
0103 </div>
0104 
0105 <div data-lang="r" markdown="1">
0106 
0107 Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
0108 
0109 {% include_example r/ml/kmeans.R %}
0110 </div>
0111 
0112 </div>
0113 
0114 ## Latent Dirichlet allocation (LDA)
0115 
0116 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
0117 and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
0118 `EMLDAOptimizer` to a `DistributedLDAModel` if needed.
0119 
0120 **Examples**
0121 
0122 <div class="codetabs">
0123 
0124 <div data-lang="scala" markdown="1">
0125 
0126 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details.
0127 
0128 {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
0129 </div>
0130 
0131 <div data-lang="java" markdown="1">
0132 
0133 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
0134 
0135 {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
0136 </div>
0137 
0138 <div data-lang="python" markdown="1">
0139 
0140 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.
0141 
0142 {% include_example python/ml/lda_example.py %}
0143 </div>
0144 
0145 <div data-lang="r" markdown="1">
0146 
0147 Refer to the [R API docs](api/R/spark.lda.html) for more details.
0148 
0149 {% include_example r/ml/lda.R %}
0150 </div>
0151 
0152 </div>
0153 
0154 ## Bisecting k-means
0155 
0156 Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
0157 divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
0158 moves down the hierarchy.
0159 
0160 Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
0161 
0162 `BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.
0163 
0164 **Examples**
0165 
0166 <div class="codetabs">
0167 
0168 <div data-lang="scala" markdown="1">
0169 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
0170 
0171 {% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
0172 </div>
0173 
0174 <div data-lang="java" markdown="1">
0175 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
0176 
0177 {% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
0178 </div>
0179 
0180 <div data-lang="python" markdown="1">
0181 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.
0182 
0183 {% include_example python/ml/bisecting_k_means_example.py %}
0184 </div>
0185 
0186 <div data-lang="r" markdown="1">
0187 
0188 Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details. 
0189 
0190 {% include_example r/ml/bisectingKmeans.R %}
0191 </div>
0192 </div>
0193 
0194 ## Gaussian Mixture Model (GMM)
0195 
0196 A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
0197 represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
0198 each with its own probability. The `spark.ml` implementation uses the
0199 [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
0200 algorithm to induce the maximum-likelihood model given a set of samples.
0201 
0202 `GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
0203 model.
0204 
0205 ### Input Columns
0206 
0207 <table class="table">
0208   <thead>
0209     <tr>
0210       <th align="left">Param name</th>
0211       <th align="left">Type(s)</th>
0212       <th align="left">Default</th>
0213       <th align="left">Description</th>
0214     </tr>
0215   </thead>
0216   <tbody>
0217     <tr>
0218       <td>featuresCol</td>
0219       <td>Vector</td>
0220       <td>"features"</td>
0221       <td>Feature vector</td>
0222     </tr>
0223   </tbody>
0224 </table>
0225 
0226 ### Output Columns
0227 
0228 <table class="table">
0229   <thead>
0230     <tr>
0231       <th align="left">Param name</th>
0232       <th align="left">Type(s)</th>
0233       <th align="left">Default</th>
0234       <th align="left">Description</th>
0235     </tr>
0236   </thead>
0237   <tbody>
0238     <tr>
0239       <td>predictionCol</td>
0240       <td>Int</td>
0241       <td>"prediction"</td>
0242       <td>Predicted cluster center</td>
0243     </tr>
0244     <tr>
0245       <td>probabilityCol</td>
0246       <td>Vector</td>
0247       <td>"probability"</td>
0248       <td>Probability of each cluster</td>
0249     </tr>
0250   </tbody>
0251 </table>
0252 
0253 **Examples**
0254 
0255 <div class="codetabs">
0256 
0257 <div data-lang="scala" markdown="1">
0258 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
0259 
0260 {% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
0261 </div>
0262 
0263 <div data-lang="java" markdown="1">
0264 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
0265 
0266 {% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
0267 </div>
0268 
0269 <div data-lang="python" markdown="1">
0270 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.
0271 
0272 {% include_example python/ml/gaussian_mixture_example.py %}
0273 </div>
0274 
0275 <div data-lang="r" markdown="1">
0276 
0277 Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.
0278 
0279 {% include_example r/ml/gaussianMixture.R %}
0280 </div>
0281 
0282 </div>
0283 
0284 ## Power Iteration Clustering (PIC)
0285 
0286 Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
0287 developed by [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf).
0288 From the abstract: PIC finds a very low-dimensional embedding of a dataset
0289 using truncated power iteration on a normalized pair-wise similarity matrix of the data.
0290 
0291 `spark.ml`'s PowerIterationClustering implementation takes the following parameters:
0292 
0293 * `k`: the number of clusters to create
0294 * `initMode`: param for the initialization algorithm
0295 * `maxIter`: param for maximum number of iterations
0296 * `srcCol`: param for the name of the input column for source vertex IDs
0297 * `dstCol`: name of the input column for destination vertex IDs
0298 * `weightCol`: Param for weight column name
0299 
0300 **Examples**
0301 
0302 <div class="codetabs">
0303 
0304 <div data-lang="scala" markdown="1">
0305 Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
0306 
0307 {% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
0308 </div>
0309 
0310 <div data-lang="java" markdown="1">
0311 Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
0312 
0313 {% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
0314 </div>
0315 
0316 <div data-lang="python" markdown="1">
0317 Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.PowerIterationClustering) for more details.
0318 
0319 {% include_example python/ml/power_iteration_clustering_example.py %}
0320 </div>
0321 
0322 <div data-lang="r" markdown="1">
0323 
0324 Refer to the [R API docs](api/R/spark.powerIterationClustering.html) for more details.
0325 
0326 {% include_example r/ml/powerIterationClustering.R %}
0327 </div>
0328 
0329 </div>