the-tree/docs/mllib-data-types.md

0001 ---
0002 layout: global
0003 title: Data Types - RDD-based API
0004 displayTitle: Data Types - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 MLlib supports local vectors and matrices stored on a single machine,
0026 as well as distributed matrices backed by one or more RDDs.
0027 Local vectors and local matrices are simple data models
0028 that serve as public interfaces. The underlying linear algebra operations are provided by
0029 [Breeze](http://www.scalanlp.org/).
0030 A training example used in supervised learning is called a "labeled point" in MLlib.
0031
0032 ## Local vector
0033
0034 A local vector has integer-typed and 0-based indices and double-typed values, stored on a single
0035 machine.  MLlib supports two types of local vectors: dense and sparse.  A dense vector is backed by
0036 a double array representing its entry values, while a sparse vector is backed by two parallel
0037 arrays: indices and values.  For example, a vector `(1.0, 0.0, 3.0)` can be represented in dense
0038 format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size
0039 of the vector.
0040
0041 <div class="codetabs">
0042 <div data-lang="scala" markdown="1">
0043
0044 The base class of local vectors is
0045 [`Vector`](api/scala/org/apache/spark/mllib/linalg/Vector.html), and we provide two
0046 implementations: [`DenseVector`](api/scala/org/apache/spark/mllib/linalg/DenseVector.html) and
0047 [`SparseVector`](api/scala/org/apache/spark/mllib/linalg/SparseVector.html).  We recommend
0048 using the factory methods implemented in
0049 [`Vectors`](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) to create local vectors.
0050
0051 Refer to the [`Vector` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
0052
0053 {% highlight scala %}
0054 import org.apache.spark.mllib.linalg.{Vector, Vectors}
0055
0056 // Create a dense vector (1.0, 0.0, 3.0).
0057 val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
0058 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
0059 val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
0060 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
0061 val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
0062 {% endhighlight %}
0063
0064 ***Note:***
0065 Scala imports `scala.collection.immutable.Vector` by default, so you have to import
0066 `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
0067
0068 </div>
0069
0070 <div data-lang="java" markdown="1">
0071
0072 The base class of local vectors is
0073 [`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two
0074 implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
0075 [`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html).  We recommend
0076 using the factory methods implemented in
0077 [`Vectors`](api/java/org/apache/spark/mllib/linalg/Vectors.html) to create local vectors.
0078
0079 Refer to the [`Vector` Java docs](api/java/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Java docs](api/java/org/apache/spark/mllib/linalg/Vectors.html) for details on the API.
0080
0081 {% highlight java %}
0082 import org.apache.spark.mllib.linalg.Vector;
0083 import org.apache.spark.mllib.linalg.Vectors;
0084
0085 // Create a dense vector (1.0, 0.0, 3.0).
0086 Vector dv = Vectors.dense(1.0, 0.0, 3.0);
0087 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
0088 Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
0089 {% endhighlight %}
0090 </div>
0091
0092 <div data-lang="python" markdown="1">
0093 MLlib recognizes the following types as dense vectors:
0094
0095 * NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
0096 * Python's list, e.g., `[1, 2, 3]`
0097
0098 and the following as sparse vectors:
0099
0100 * MLlib's [`SparseVector`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector).
0101 * SciPy's
0102   [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
0103   with a single column
0104
0105 We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
0106 in [`Vectors`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) to create sparse vectors.
0107
0108 Refer to the [`Vectors` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) for more details on the API.
0109
0110 {% highlight python %}
0111 import numpy as np
0112 import scipy.sparse as sps
0113 from pyspark.mllib.linalg import Vectors
0114
0115 # Use a NumPy array as a dense vector.
0116 dv1 = np.array([1.0, 0.0, 3.0])
0117 # Use a Python list as a dense vector.
0118 dv2 = [1.0, 0.0, 3.0]
0119 # Create a SparseVector.
0120 sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
0121 # Use a single-column SciPy csc_matrix as a sparse vector.
0122 sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
0123 {% endhighlight %}
0124
0125 </div>
0126 </div>
0127
0128 ## Labeled point
0129
0130 A labeled point is a local vector, either dense or sparse, associated with a label/response.
0131 In MLlib, labeled points are used in supervised learning algorithms.
0132 We use a double to store a label, so we can use labeled points in both regression and classification.
0133 For binary classification, a label should be either `0` (negative) or `1` (positive).
0134 For multiclass classification, labels should be class indices starting from zero: `0, 1, 2, ...`.
0135
0136 <div class="codetabs">
0137
0138 <div data-lang="scala" markdown="1">
0139
0140 A labeled point is represented by the case class
0141 [`LabeledPoint`](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html).
0142
0143 Refer to the [`LabeledPoint` Scala docs](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
0144
0145 {% highlight scala %}
0146 import org.apache.spark.mllib.linalg.Vectors
0147 import org.apache.spark.mllib.regression.LabeledPoint
0148
0149 // Create a labeled point with a positive label and a dense feature vector.
0150 val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
0151
0152 // Create a labeled point with a negative label and a sparse feature vector.
0153 val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
0154 {% endhighlight %}
0155 </div>
0156
0157 <div data-lang="java" markdown="1">
0158
0159 A labeled point is represented by
0160 [`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
0161
0162 Refer to the [`LabeledPoint` Java docs](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
0163
0164 {% highlight java %}
0165 import org.apache.spark.mllib.linalg.Vectors;
0166 import org.apache.spark.mllib.regression.LabeledPoint;
0167
0168 // Create a labeled point with a positive label and a dense feature vector.
0169 LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
0170
0171 // Create a labeled point with a negative label and a sparse feature vector.
0172 LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));
0173 {% endhighlight %}
0174 </div>
0175
0176 <div data-lang="python" markdown="1">
0177
0178 A labeled point is represented by
0179 [`LabeledPoint`](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint).
0180
0181 Refer to the [`LabeledPoint` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) for more details on the API.
0182
0183 {% highlight python %}
0184 from pyspark.mllib.linalg import SparseVector
0185 from pyspark.mllib.regression import LabeledPoint
0186
0187 # Create a labeled point with a positive label and a dense feature vector.
0188 pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
0189
0190 # Create a labeled point with a negative label and a sparse feature vector.
0191 neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
0192 {% endhighlight %}
0193 </div>
0194 </div>
0195
0196 ***Sparse data***
0197
0198 It is very common in practice to have sparse training data.  MLlib supports reading training
0199 examples stored in `LIBSVM` format, which is the default format used by
0200 [`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
0201 [`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text format in which each line
0202 represents a labeled sparse feature vector using the following format:
0203
0204 ~~~
0205 label index1:value1 index2:value2 ...
0206 ~~~
0207
0208 where the indices are one-based and in ascending order.
0209 After loading, the feature indices are converted to zero-based.
0210
0211 <div class="codetabs">
0212 <div data-lang="scala" markdown="1">
0213
0214 [`MLUtils.loadLibSVMFile`](api/scala/org/apache/spark/mllib/util/MLUtils$.html) reads training
0215 examples stored in LIBSVM format.
0216
0217 Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for details on the API.
0218
0219 {% highlight scala %}
0220 import org.apache.spark.mllib.regression.LabeledPoint
0221 import org.apache.spark.mllib.util.MLUtils
0222 import org.apache.spark.rdd.RDD
0223
0224 val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
0225 {% endhighlight %}
0226 </div>
0227
0228 <div data-lang="java" markdown="1">
0229 [`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training
0230 examples stored in LIBSVM format.
0231
0232 Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for details on the API.
0233
0234 {% highlight java %}
0235 import org.apache.spark.mllib.regression.LabeledPoint;
0236 import org.apache.spark.mllib.util.MLUtils;
0237 import org.apache.spark.api.java.JavaRDD;
0238
0239 JavaRDD<LabeledPoint> examples =
0240   MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();
0241 {% endhighlight %}
0242 </div>
0243
0244 <div data-lang="python" markdown="1">
0245 [`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) reads training
0246 examples stored in LIBSVM format.
0247
0248 Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for more details on the API.
0249
0250 {% highlight python %}
0251 from pyspark.mllib.util import MLUtils
0252
0253 examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
0254 {% endhighlight %}
0255 </div>
0256 </div>
0257
0258 ## Local matrix
0259
0260 A local matrix has integer-typed row and column indices and double-typed values, stored on a single
0261 machine.  MLlib supports dense matrices, whose entry values are stored in a single double array in
0262 column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse
0263 Column (CSC) format in column-major order.  For example, the following dense matrix `\[ \begin{pmatrix}
0264 1.0 & 2.0 \\
0265 3.0 & 4.0 \\
0266 5.0 & 6.0
0267 \end{pmatrix}
0268 \]`
0269 is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`.
0270
0271 <div class="codetabs">
0272 <div data-lang="scala" markdown="1">
0273
0274 The base class of local matrices is
0275 [`Matrix`](api/scala/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
0276 implementations: [`DenseMatrix`](api/scala/org/apache/spark/mllib/linalg/DenseMatrix.html),
0277 and [`SparseMatrix`](api/scala/org/apache/spark/mllib/linalg/SparseMatrix.html).
0278 We recommend using the factory methods implemented
0279 in [`Matrices`](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) to create local
0280 matrices. Remember, local matrices in MLlib are stored in column-major order.
0281
0282 Refer to the [`Matrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) for details on the API.
0283
0284 {% highlight scala %}
0285 import org.apache.spark.mllib.linalg.{Matrix, Matrices}
0286
0287 // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0288 val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
0289
0290 // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0291 val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))
0292 {% endhighlight %}
0293 </div>
0294
0295 <div data-lang="java" markdown="1">
0296
0297 The base class of local matrices is
0298 [`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
0299 implementations: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html),
0300 and [`SparseMatrix`](api/java/org/apache/spark/mllib/linalg/SparseMatrix.html).
0301 We recommend using the factory methods implemented
0302 in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
0303 matrices. Remember, local matrices in MLlib are stored in column-major order.
0304
0305 Refer to the [`Matrix` Java docs](api/java/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Java docs](api/java/org/apache/spark/mllib/linalg/Matrices.html) for details on the API.
0306
0307 {% highlight java %}
0308 import org.apache.spark.mllib.linalg.Matrix;
0309 import org.apache.spark.mllib.linalg.Matrices;
0310
0311 // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0312 Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
0313
0314 // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0315 Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});
0316 {% endhighlight %}
0317 </div>
0318
0319 <div data-lang="python" markdown="1">
0320
0321 The base class of local matrices is
0322 [`Matrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix), and we provide two
0323 implementations: [`DenseMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseMatrix),
0324 and [`SparseMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseMatrix).
0325 We recommend using the factory methods implemented
0326 in [`Matrices`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) to create local
0327 matrices. Remember, local matrices in MLlib are stored in column-major order.
0328
0329 Refer to the [`Matrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix) and [`Matrices` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) for more details on the API.
0330
0331 {% highlight python %}
0332 from pyspark.mllib.linalg import Matrix, Matrices
0333
0334 # Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0335 dm2 = Matrices.dense(3, 2, [1, 3, 5, 2, 4, 6])
0336
0337 # Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0338 sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
0339 {% endhighlight %}
0340 </div>
0341
0342 </div>
0343
0344 ## Distributed matrix
0345
0346 A distributed matrix has long-typed row and column indices and double-typed values, stored
0347 distributively in one or more RDDs.  It is very important to choose the right format to store large
0348 and distributed matrices.  Converting a distributed matrix to a different format may require a
0349 global shuffle, which is quite expensive. Four types of distributed matrices have been implemented
0350 so far.
0351
0352 The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
0353 matrix without meaningful row indices, e.g., a collection of feature vectors.
0354 It is backed by an RDD of its rows, where each row is a local vector.
0355 We assume that the number of columns is not huge for a `RowMatrix` so that a single
0356 local vector can be reasonably communicated to the driver and can also be stored /
0357 operated on using a single node.
0358 An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
0359 which can be used for identifying rows and executing joins.
0360 A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
0361 backed by an RDD of its entries.
0362 A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
0363 which is a tuple of `(Int, Int, Matrix)`.
0364
0365 ***Note***
0366
0367 The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
0368 In general, the use of non-deterministic RDDs can lead to errors.
0369
0370 ### RowMatrix
0371
0372 A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
0373 of its rows, where each row is a local vector.
0374 Since each row is represented by a local vector, the number of columns is
0375 limited by the integer range but it should be much smaller in practice.
0376
0377 <div class="codetabs">
0378 <div data-lang="scala" markdown="1">
0379
0380 A [`RowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
0381 created from an `RDD[Vector]` instance.  Then we can compute its column summary statistics and decompositions.
0382 [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
0383 For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html).
0384
0385 Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0386
0387 {% highlight scala %}
0388 import org.apache.spark.mllib.linalg.Vector
0389 import org.apache.spark.mllib.linalg.distributed.RowMatrix
0390
0391 val rows: RDD[Vector] = ... // an RDD of local vectors
0392 // Create a RowMatrix from an RDD[Vector].
0393 val mat: RowMatrix = new RowMatrix(rows)
0394
0395 // Get its size.
0396 val m = mat.numRows()
0397 val n = mat.numCols()
0398
0399 // QR decomposition
0400 val qrResult = mat.tallSkinnyQR(true)
0401 {% endhighlight %}
0402 </div>
0403
0404 <div data-lang="java" markdown="1">
0405
0406 A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
0407 created from a `JavaRDD<Vector>` instance.  Then we can compute its column summary statistics.
0408
0409 Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0410
0411 {% highlight java %}
0412 import org.apache.spark.api.java.JavaRDD;
0413 import org.apache.spark.mllib.linalg.Vector;
0414 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
0415
0416 JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
0417 // Create a RowMatrix from an JavaRDD<Vector>.
0418 RowMatrix mat = new RowMatrix(rows.rdd());
0419
0420 // Get its size.
0421 long m = mat.numRows();
0422 long n = mat.numCols();
0423
0424 // QR decomposition
0425 QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
0426 {% endhighlight %}
0427 </div>
0428
0429 <div data-lang="python" markdown="1">
0430
0431 A [`RowMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) can be
0432 created from an `RDD` of vectors.
0433
0434 Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for more details on the API.
0435
0436 {% highlight python %}
0437 from pyspark.mllib.linalg.distributed import RowMatrix
0438
0439 # Create an RDD of vectors.
0440 rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
0441
0442 # Create a RowMatrix from an RDD of vectors.
0443 mat = RowMatrix(rows)
0444
0445 # Get its size.
0446 m = mat.numRows()  # 4
0447 n = mat.numCols()  # 3
0448
0449 # Get the rows as an RDD of vectors again.
0450 rowsRDD = mat.rows
0451 {% endhighlight %}
0452 </div>
0453
0454 </div>
0455
0456 ### IndexedRowMatrix
0457
0458 An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices.  It is backed by
0459 an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local
0460 vector.
0461
0462 <div class="codetabs">
0463 <div data-lang="scala" markdown="1">
0464
0465 An
0466 [`IndexedRowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
0467 can be created from an `RDD[IndexedRow]` instance, where
0468 [`IndexedRow`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
0469 wrapper over `(Long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0470 its row indices.
0471
0472 Refer to the [`IndexedRowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
0473
0474 {% highlight scala %}
0475 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
0476
0477 val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
0478 // Create an IndexedRowMatrix from an RDD[IndexedRow].
0479 val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
0480
0481 // Get its size.
0482 val m = mat.numRows()
0483 val n = mat.numCols()
0484
0485 // Drop its row indices.
0486 val rowMat: RowMatrix = mat.toRowMatrix()
0487 {% endhighlight %}
0488 </div>
0489
0490 <div data-lang="java" markdown="1">
0491
0492 An
0493 [`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
0494 can be created from an `JavaRDD<IndexedRow>` instance, where
0495 [`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
0496 wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0497 its row indices.
0498
0499 Refer to the [`IndexedRowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
0500
0501 {% highlight java %}
0502 import org.apache.spark.api.java.JavaRDD;
0503 import org.apache.spark.mllib.linalg.distributed.IndexedRow;
0504 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0505 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
0506
0507 JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
0508 // Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
0509 IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
0510
0511 // Get its size.
0512 long m = mat.numRows();
0513 long n = mat.numCols();
0514
0515 // Drop its row indices.
0516 RowMatrix rowMat = mat.toRowMatrix();
0517 {% endhighlight %}
0518 </div>
0519
0520 <div data-lang="python" markdown="1">
0521
0522 An [`IndexedRowMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix)
0523 can be created from an `RDD` of `IndexedRow`s, where
0524 [`IndexedRow`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRow) is a
0525 wrapper over `(long, vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0526 its row indices.
0527
0528 Refer to the [`IndexedRowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix) for more details on the API.
0529
0530 {% highlight python %}
0531 from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
0532
0533 # Create an RDD of indexed rows.
0534 #   - This can be done explicitly with the IndexedRow class:
0535 indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
0536                               IndexedRow(1, [4, 5, 6]),
0537                               IndexedRow(2, [7, 8, 9]),
0538                               IndexedRow(3, [10, 11, 12])])
0539 #   - or by using (long, vector) tuples:
0540 indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]),
0541                               (2, [7, 8, 9]), (3, [10, 11, 12])])
0542
0543 # Create an IndexedRowMatrix from an RDD of IndexedRows.
0544 mat = IndexedRowMatrix(indexedRows)
0545
0546 # Get its size.
0547 m = mat.numRows()  # 4
0548 n = mat.numCols()  # 3
0549
0550 # Get the rows as an RDD of IndexedRows.
0551 rowsRDD = mat.rows
0552
0553 # Convert to a RowMatrix by dropping the row indices.
0554 rowMat = mat.toRowMatrix()
0555 {% endhighlight %}
0556 </div>
0557
0558 </div>
0559
0560 ### CoordinateMatrix
0561
0562 A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries.  Each entry is a tuple
0563 of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and
0564 `value` is the entry value.  A `CoordinateMatrix` should be used only when both
0565 dimensions of the matrix are huge and the matrix is very sparse.
0566
0567 <div class="codetabs">
0568 <div data-lang="scala" markdown="1">
0569
0570 A
0571 [`CoordinateMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
0572 can be created from an `RDD[MatrixEntry]` instance, where
0573 [`MatrixEntry`](api/scala/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
0574 wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
0575 with sparse rows by calling `toIndexedRowMatrix`.  Other computations for
0576 `CoordinateMatrix` are not currently supported.
0577
0578 Refer to the [`CoordinateMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
0579
0580 {% highlight scala %}
0581 import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
0582
0583 val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
0584 // Create a CoordinateMatrix from an RDD[MatrixEntry].
0585 val mat: CoordinateMatrix = new CoordinateMatrix(entries)
0586
0587 // Get its size.
0588 val m = mat.numRows()
0589 val n = mat.numCols()
0590
0591 // Convert it to an IndexRowMatrix whose rows are sparse vectors.
0592 val indexedRowMatrix = mat.toIndexedRowMatrix()
0593 {% endhighlight %}
0594 </div>
0595
0596 <div data-lang="java" markdown="1">
0597
0598 A
0599 [`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
0600 can be created from a `JavaRDD<MatrixEntry>` instance, where
0601 [`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
0602 wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
0603 with sparse rows by calling `toIndexedRowMatrix`. Other computations for
0604 `CoordinateMatrix` are not currently supported.
0605
0606 Refer to the [`CoordinateMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
0607
0608 {% highlight java %}
0609 import org.apache.spark.api.java.JavaRDD;
0610 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
0611 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0612 import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
0613
0614 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
0615 // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
0616 CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
0617
0618 // Get its size.
0619 long m = mat.numRows();
0620 long n = mat.numCols();
0621
0622 // Convert it to an IndexRowMatrix whose rows are sparse vectors.
0623 IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
0624 {% endhighlight %}
0625 </div>
0626
0627 <div data-lang="python" markdown="1">
0628
0629 A [`CoordinateMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix)
0630 can be created from an `RDD` of `MatrixEntry` entries, where
0631 [`MatrixEntry`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.MatrixEntry) is a
0632 wrapper over `(long, long, float)`.  A `CoordinateMatrix` can be converted to a `RowMatrix` by
0633 calling `toRowMatrix`, or to an `IndexedRowMatrix` with sparse rows by calling `toIndexedRowMatrix`.
0634
0635 Refer to the [`CoordinateMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix) for more details on the API.
0636
0637 {% highlight python %}
0638 from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
0639
0640 # Create an RDD of coordinate entries.
0641 #   - This can be done explicitly with the MatrixEntry class:
0642 entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(2, 1, 3.7)])
0643 #   - or using (long, long, float) tuples:
0644 entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)])
0645
0646 # Create an CoordinateMatrix from an RDD of MatrixEntries.
0647 mat = CoordinateMatrix(entries)
0648
0649 # Get its size.
0650 m = mat.numRows()  # 3
0651 n = mat.numCols()  # 2
0652
0653 # Get the entries as an RDD of MatrixEntries.
0654 entriesRDD = mat.entries
0655
0656 # Convert to a RowMatrix.
0657 rowMat = mat.toRowMatrix()
0658
0659 # Convert to an IndexedRowMatrix.
0660 indexedRowMat = mat.toIndexedRowMatrix()
0661
0662 # Convert to a BlockMatrix.
0663 blockMat = mat.toBlockMatrix()
0664 {% endhighlight %}
0665 </div>
0666
0667 </div>
0668
0669 ### BlockMatrix
0670
0671 A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`s, where a `MatrixBlock` is
0672 a tuple of `((Int, Int), Matrix)`, where the `(Int, Int)` is the index of the block, and `Matrix` is
0673 the sub-matrix at the given index with size `rowsPerBlock` x `colsPerBlock`.
0674 `BlockMatrix` supports methods such as `add` and `multiply` with another `BlockMatrix`.
0675 `BlockMatrix` also has a helper function `validate` which can be used to check whether the
0676 `BlockMatrix` is set up properly.
0677
0678 <div class="codetabs">
0679 <div data-lang="scala" markdown="1">
0680
0681 A [`BlockMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
0682 most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
0683 `toBlockMatrix` creates blocks of size 1024 x 1024 by default.
0684 Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
0685
0686 Refer to the [`BlockMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
0687
0688 {% highlight scala %}
0689 import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
0690
0691 val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
0692 // Create a CoordinateMatrix from an RDD[MatrixEntry].
0693 val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
0694 // Transform the CoordinateMatrix to a BlockMatrix
0695 val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
0696
0697 // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
0698 // Nothing happens if it is valid.
0699 matA.validate()
0700
0701 // Calculate A^T A.
0702 val ata = matA.transpose.multiply(matA)
0703 {% endhighlight %}
0704 </div>
0705
0706 <div data-lang="java" markdown="1">
0707
0708 A [`BlockMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
0709 most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
0710 `toBlockMatrix` creates blocks of size 1024 x 1024 by default.
0711 Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
0712
0713 Refer to the [`BlockMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
0714
0715 {% highlight java %}
0716 import org.apache.spark.api.java.JavaRDD;
0717 import org.apache.spark.mllib.linalg.distributed.BlockMatrix;
0718 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
0719 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0720
0721 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of (i, j, v) Matrix Entries
0722 // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
0723 CoordinateMatrix coordMat = new CoordinateMatrix(entries.rdd());
0724 // Transform the CoordinateMatrix to a BlockMatrix
0725 BlockMatrix matA = coordMat.toBlockMatrix().cache();
0726
0727 // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
0728 // Nothing happens if it is valid.
0729 matA.validate();
0730
0731 // Calculate A^T A.
0732 BlockMatrix ata = matA.transpose().multiply(matA);
0733 {% endhighlight %}
0734 </div>
0735
0736 <div data-lang="python" markdown="1">
0737
0738 A [`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix)
0739 can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is a
0740 `((blockRowIndex, blockColIndex), sub-matrix)` tuple.
0741
0742 Refer to the [`BlockMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) for more details on the API.
0743
0744 {% highlight python %}
0745 from pyspark.mllib.linalg import Matrices
0746 from pyspark.mllib.linalg.distributed import BlockMatrix
0747
0748 # Create an RDD of sub-matrix blocks.
0749 blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])),
0750                          ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))])
0751
0752 # Create a BlockMatrix from an RDD of sub-matrix blocks.
0753 mat = BlockMatrix(blocks, 3, 2)
0754
0755 # Get its size.
0756 m = mat.numRows()  # 6
0757 n = mat.numCols()  # 2
0758
0759 # Get the blocks as an RDD of sub-matrix blocks.
0760 blocksRDD = mat.blocks
0761
0762 # Convert to a LocalMatrix.
0763 localMat = mat.toLocalMatrix()
0764
0765 # Convert to an IndexedRowMatrix.
0766 indexedRowMat = mat.toIndexedRowMatrix()
0767
0768 # Convert to a CoordinateMatrix.
0769 coordinateMat = mat.toCoordinateMatrix()
0770 {% endhighlight %}
0771 </div>
0772 </div>