0001 ---
0002 layout: global
0003 title: Dimensionality Reduction - RDD-based API
0004 displayTitle: Dimensionality Reduction - RDD-based API
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 [Dimensionality reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the process
0026 of reducing the number of variables under consideration.
0027 It can be used to extract latent features from raw and noisy features
0028 or compress data while maintaining the structure.
0029 `spark.mllib` provides support for dimensionality reduction on the <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
0030
0031 ## Singular value decomposition (SVD)
0032
0033 [Singular value decomposition (SVD)](http://en.wikipedia.org/wiki/Singular_value_decomposition)
0034 factorizes a matrix into three matrices: $U$, $\Sigma$, and $V$ such that
0035
0036 `\[
0037 A = U \Sigma V^T,
0038 \]`
0039
0040 where
0041
0042 * $U$ is an orthonormal matrix, whose columns are called left singular vectors,
0043 * $\Sigma$ is a diagonal matrix with non-negative diagonals in descending order,
0044 whose diagonals are called singular values,
0045 * $V$ is an orthonormal matrix, whose columns are called right singular vectors.
0046
0047 For large matrices, usually we don't need the complete factorization but only the top singular
0048 values and its associated singular vectors. This can save storage, de-noise
0049 and recover the low-rank structure of the matrix.
0050
0051 If we keep the top $k$ singular values, then the dimensions of the resulting low-rank matrix will be:
0052
0053 * `$U$`: `$m \times k$`,
0054 * `$\Sigma$`: `$k \times k$`,
0055 * `$V$`: `$n \times k$`.
0056
0057 ### Performance
0058 We assume $n$ is smaller than $m$. The singular values and the right singular vectors are derived
0059 from the eigenvalues and the eigenvectors of the Gramian matrix $A^T A$. The matrix
0060 storing the left singular vectors $U$, is computed via matrix multiplication as
0061 $U = A (V S^{-1})$, if requested by the user via the computeU parameter.
0062 The actual method to use is determined automatically based on the computational cost:
0063
0064 * If $n$ is small ($n < 100$) or $k$ is large compared with $n$ ($k > n / 2$), we compute the Gramian matrix
0065 first and then compute its top eigenvalues and eigenvectors locally on the driver.
0066 This requires a single pass with $O(n^2)$ storage on each executor and on the driver, and
0067 $O(n^2 k)$ time on the driver.
0068 * Otherwise, we compute $(A^T A) v$ in a distributive way and send it to
0069 <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> to
0070 compute $(A^T A)$'s top eigenvalues and eigenvectors on the driver node. This requires $O(k)$
0071 passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
0072
0073 ### SVD Example
0074
0075 `spark.mllib` provides SVD functionality to row-oriented matrices, provided in the
0076 <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
0077
0078 <div class="codetabs">
0079 <div data-lang="scala" markdown="1">
0080 Refer to the [`SingularValueDecomposition` Scala docs](api/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.html) for details on the API.
0081
0082 {% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %}
0083
0084 The same code applies to `IndexedRowMatrix` if `U` is defined as an
0085 `IndexedRowMatrix`.
0086 </div>
0087 <div data-lang="java" markdown="1">
0088 Refer to the [`SingularValueDecomposition` Java docs](api/java/org/apache/spark/mllib/linalg/SingularValueDecomposition.html) for details on the API.
0089
0090 {% include_example java/org/apache/spark/examples/mllib/JavaSVDExample.java %}
0091
0092 The same code applies to `IndexedRowMatrix` if `U` is defined as an
0093 `IndexedRowMatrix`.
0094 </div>
0095 <div data-lang="python" markdown="1">
0096 Refer to the [`SingularValueDecomposition` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.SingularValueDecomposition) for details on the API.
0097
0098 {% include_example python/mllib/svd_example.py %}
0099
0100 The same code applies to `IndexedRowMatrix` if `U` is defined as an
0101 `IndexedRowMatrix`.
0102 </div>
0103 </div>
0104
0105 ## Principal component analysis (PCA)
0106
0107 [Principal component analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) is a
0108 statistical method to find a rotation such that the first coordinate has the largest variance
0109 possible, and each succeeding coordinate, in turn, has the largest variance possible. The columns of
0110 the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
0111
0112 `spark.mllib` supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
0113
0114 <div class="codetabs">
0115 <div data-lang="scala" markdown="1">
0116
0117 The following code demonstrates how to compute principal components on a `RowMatrix`
0118 and use them to project the vectors into a low-dimensional space.
0119
0120 Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0121
0122 {% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %}
0123
0124 The following code demonstrates how to compute principal components on source vectors
0125 and use them to project the vectors into a low-dimensional space while keeping associated labels:
0126
0127 Refer to the [`PCA` Scala docs](api/scala/org/apache/spark/mllib/feature/PCA.html) for details on the API.
0128
0129 {% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %}
0130
0131 </div>
0132
0133 <div data-lang="java" markdown="1">
0134
0135 The following code demonstrates how to compute principal components on a `RowMatrix`
0136 and use them to project the vectors into a low-dimensional space.
0137
0138 Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0139
0140 {% include_example java/org/apache/spark/examples/mllib/JavaPCAExample.java %}
0141
0142 </div>
0143
0144 <div data-lang="python" markdown="1">
0145
0146 The following code demonstrates how to compute principal components on a `RowMatrix`
0147 and use them to project the vectors into a low-dimensional space.
0148
0149 Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for details on the API.
0150
0151 {% include_example python/mllib/pca_rowmatrix_example.py %}
0152
0153 </div>
0154 </div>