Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Data Types - RDD-based API
0004 displayTitle: Data Types - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 MLlib supports local vectors and matrices stored on a single machine, 
0026 as well as distributed matrices backed by one or more RDDs.
0027 Local vectors and local matrices are simple data models 
0028 that serve as public interfaces. The underlying linear algebra operations are provided by
0029 [Breeze](http://www.scalanlp.org/).
0030 A training example used in supervised learning is called a "labeled point" in MLlib.
0031 
0032 ## Local vector
0033 
0034 A local vector has integer-typed and 0-based indices and double-typed values, stored on a single
0035 machine.  MLlib supports two types of local vectors: dense and sparse.  A dense vector is backed by
0036 a double array representing its entry values, while a sparse vector is backed by two parallel
0037 arrays: indices and values.  For example, a vector `(1.0, 0.0, 3.0)` can be represented in dense
0038 format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size
0039 of the vector.
0040 
0041 <div class="codetabs">
0042 <div data-lang="scala" markdown="1">
0043 
0044 The base class of local vectors is
0045 [`Vector`](api/scala/org/apache/spark/mllib/linalg/Vector.html), and we provide two
0046 implementations: [`DenseVector`](api/scala/org/apache/spark/mllib/linalg/DenseVector.html) and
0047 [`SparseVector`](api/scala/org/apache/spark/mllib/linalg/SparseVector.html).  We recommend
0048 using the factory methods implemented in
0049 [`Vectors`](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) to create local vectors.
0050 
0051 Refer to the [`Vector` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
0052 
0053 {% highlight scala %}
0054 import org.apache.spark.mllib.linalg.{Vector, Vectors}
0055 
0056 // Create a dense vector (1.0, 0.0, 3.0).
0057 val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
0058 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
0059 val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
0060 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
0061 val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
0062 {% endhighlight %}
0063 
0064 ***Note:***
0065 Scala imports `scala.collection.immutable.Vector` by default, so you have to import
0066 `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
0067 
0068 </div>
0069 
0070 <div data-lang="java" markdown="1">
0071 
0072 The base class of local vectors is
0073 [`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two
0074 implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
0075 [`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html).  We recommend
0076 using the factory methods implemented in
0077 [`Vectors`](api/java/org/apache/spark/mllib/linalg/Vectors.html) to create local vectors.
0078 
0079 Refer to the [`Vector` Java docs](api/java/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Java docs](api/java/org/apache/spark/mllib/linalg/Vectors.html) for details on the API.
0080 
0081 {% highlight java %}
0082 import org.apache.spark.mllib.linalg.Vector;
0083 import org.apache.spark.mllib.linalg.Vectors;
0084 
0085 // Create a dense vector (1.0, 0.0, 3.0).
0086 Vector dv = Vectors.dense(1.0, 0.0, 3.0);
0087 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
0088 Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
0089 {% endhighlight %}
0090 </div>
0091 
0092 <div data-lang="python" markdown="1">
0093 MLlib recognizes the following types as dense vectors:
0094 
0095 * NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
0096 * Python's list, e.g., `[1, 2, 3]`
0097 
0098 and the following as sparse vectors:
0099 
0100 * MLlib's [`SparseVector`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector).
0101 * SciPy's
0102   [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
0103   with a single column
0104 
0105 We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
0106 in [`Vectors`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) to create sparse vectors.
0107 
0108 Refer to the [`Vectors` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) for more details on the API.
0109 
0110 {% highlight python %}
0111 import numpy as np
0112 import scipy.sparse as sps
0113 from pyspark.mllib.linalg import Vectors
0114 
0115 # Use a NumPy array as a dense vector.
0116 dv1 = np.array([1.0, 0.0, 3.0])
0117 # Use a Python list as a dense vector.
0118 dv2 = [1.0, 0.0, 3.0]
0119 # Create a SparseVector.
0120 sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
0121 # Use a single-column SciPy csc_matrix as a sparse vector.
0122 sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
0123 {% endhighlight %}
0124 
0125 </div>
0126 </div>
0127 
0128 ## Labeled point
0129 
0130 A labeled point is a local vector, either dense or sparse, associated with a label/response.
0131 In MLlib, labeled points are used in supervised learning algorithms.
0132 We use a double to store a label, so we can use labeled points in both regression and classification.
0133 For binary classification, a label should be either `0` (negative) or `1` (positive).
0134 For multiclass classification, labels should be class indices starting from zero: `0, 1, 2, ...`.
0135 
0136 <div class="codetabs">
0137 
0138 <div data-lang="scala" markdown="1">
0139 
0140 A labeled point is represented by the case class
0141 [`LabeledPoint`](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html).
0142 
0143 Refer to the [`LabeledPoint` Scala docs](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
0144 
0145 {% highlight scala %}
0146 import org.apache.spark.mllib.linalg.Vectors
0147 import org.apache.spark.mllib.regression.LabeledPoint
0148 
0149 // Create a labeled point with a positive label and a dense feature vector.
0150 val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
0151 
0152 // Create a labeled point with a negative label and a sparse feature vector.
0153 val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
0154 {% endhighlight %}
0155 </div>
0156 
0157 <div data-lang="java" markdown="1">
0158 
0159 A labeled point is represented by
0160 [`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
0161 
0162 Refer to the [`LabeledPoint` Java docs](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
0163 
0164 {% highlight java %}
0165 import org.apache.spark.mllib.linalg.Vectors;
0166 import org.apache.spark.mllib.regression.LabeledPoint;
0167 
0168 // Create a labeled point with a positive label and a dense feature vector.
0169 LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
0170 
0171 // Create a labeled point with a negative label and a sparse feature vector.
0172 LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));
0173 {% endhighlight %}
0174 </div>
0175 
0176 <div data-lang="python" markdown="1">
0177 
0178 A labeled point is represented by
0179 [`LabeledPoint`](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint).
0180 
0181 Refer to the [`LabeledPoint` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) for more details on the API.
0182 
0183 {% highlight python %}
0184 from pyspark.mllib.linalg import SparseVector
0185 from pyspark.mllib.regression import LabeledPoint
0186 
0187 # Create a labeled point with a positive label and a dense feature vector.
0188 pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
0189 
0190 # Create a labeled point with a negative label and a sparse feature vector.
0191 neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
0192 {% endhighlight %}
0193 </div>
0194 </div>
0195 
0196 ***Sparse data***
0197 
0198 It is very common in practice to have sparse training data.  MLlib supports reading training
0199 examples stored in `LIBSVM` format, which is the default format used by
0200 [`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
0201 [`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text format in which each line
0202 represents a labeled sparse feature vector using the following format:
0203 
0204 ~~~
0205 label index1:value1 index2:value2 ...
0206 ~~~
0207 
0208 where the indices are one-based and in ascending order. 
0209 After loading, the feature indices are converted to zero-based.
0210 
0211 <div class="codetabs">
0212 <div data-lang="scala" markdown="1">
0213 
0214 [`MLUtils.loadLibSVMFile`](api/scala/org/apache/spark/mllib/util/MLUtils$.html) reads training
0215 examples stored in LIBSVM format.
0216 
0217 Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for details on the API.
0218 
0219 {% highlight scala %}
0220 import org.apache.spark.mllib.regression.LabeledPoint
0221 import org.apache.spark.mllib.util.MLUtils
0222 import org.apache.spark.rdd.RDD
0223 
0224 val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
0225 {% endhighlight %}
0226 </div>
0227 
0228 <div data-lang="java" markdown="1">
0229 [`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training
0230 examples stored in LIBSVM format.
0231 
0232 Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for details on the API.
0233 
0234 {% highlight java %}
0235 import org.apache.spark.mllib.regression.LabeledPoint;
0236 import org.apache.spark.mllib.util.MLUtils;
0237 import org.apache.spark.api.java.JavaRDD;
0238 
0239 JavaRDD<LabeledPoint> examples = 
0240   MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();
0241 {% endhighlight %}
0242 </div>
0243 
0244 <div data-lang="python" markdown="1">
0245 [`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) reads training
0246 examples stored in LIBSVM format.
0247 
0248 Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for more details on the API.
0249 
0250 {% highlight python %}
0251 from pyspark.mllib.util import MLUtils
0252 
0253 examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
0254 {% endhighlight %}
0255 </div>
0256 </div>
0257 
0258 ## Local matrix
0259 
0260 A local matrix has integer-typed row and column indices and double-typed values, stored on a single
0261 machine.  MLlib supports dense matrices, whose entry values are stored in a single double array in
0262 column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse
0263 Column (CSC) format in column-major order.  For example, the following dense matrix `\[ \begin{pmatrix}
0264 1.0 & 2.0 \\
0265 3.0 & 4.0 \\
0266 5.0 & 6.0
0267 \end{pmatrix}
0268 \]`
0269 is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`.
0270 
0271 <div class="codetabs">
0272 <div data-lang="scala" markdown="1">
0273 
0274 The base class of local matrices is
0275 [`Matrix`](api/scala/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
0276 implementations: [`DenseMatrix`](api/scala/org/apache/spark/mllib/linalg/DenseMatrix.html),
0277 and [`SparseMatrix`](api/scala/org/apache/spark/mllib/linalg/SparseMatrix.html).
0278 We recommend using the factory methods implemented
0279 in [`Matrices`](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) to create local
0280 matrices. Remember, local matrices in MLlib are stored in column-major order.
0281 
0282 Refer to the [`Matrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) for details on the API.
0283 
0284 {% highlight scala %}
0285 import org.apache.spark.mllib.linalg.{Matrix, Matrices}
0286 
0287 // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0288 val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
0289 
0290 // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0291 val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))
0292 {% endhighlight %}
0293 </div>
0294 
0295 <div data-lang="java" markdown="1">
0296 
0297 The base class of local matrices is
0298 [`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
0299 implementations: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html),
0300 and [`SparseMatrix`](api/java/org/apache/spark/mllib/linalg/SparseMatrix.html).
0301 We recommend using the factory methods implemented
0302 in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
0303 matrices. Remember, local matrices in MLlib are stored in column-major order.
0304 
0305 Refer to the [`Matrix` Java docs](api/java/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Java docs](api/java/org/apache/spark/mllib/linalg/Matrices.html) for details on the API.
0306 
0307 {% highlight java %}
0308 import org.apache.spark.mllib.linalg.Matrix;
0309 import org.apache.spark.mllib.linalg.Matrices;
0310 
0311 // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0312 Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
0313 
0314 // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0315 Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});
0316 {% endhighlight %}
0317 </div>
0318 
0319 <div data-lang="python" markdown="1">
0320 
0321 The base class of local matrices is
0322 [`Matrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix), and we provide two
0323 implementations: [`DenseMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseMatrix),
0324 and [`SparseMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseMatrix).
0325 We recommend using the factory methods implemented
0326 in [`Matrices`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) to create local
0327 matrices. Remember, local matrices in MLlib are stored in column-major order.
0328 
0329 Refer to the [`Matrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix) and [`Matrices` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) for more details on the API.
0330 
0331 {% highlight python %}
0332 from pyspark.mllib.linalg import Matrix, Matrices
0333 
0334 # Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
0335 dm2 = Matrices.dense(3, 2, [1, 3, 5, 2, 4, 6])
0336 
0337 # Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
0338 sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
0339 {% endhighlight %}
0340 </div>
0341 
0342 </div>
0343 
0344 ## Distributed matrix
0345 
0346 A distributed matrix has long-typed row and column indices and double-typed values, stored
0347 distributively in one or more RDDs.  It is very important to choose the right format to store large
0348 and distributed matrices.  Converting a distributed matrix to a different format may require a
0349 global shuffle, which is quite expensive. Four types of distributed matrices have been implemented
0350 so far.
0351 
0352 The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
0353 matrix without meaningful row indices, e.g., a collection of feature vectors.
0354 It is backed by an RDD of its rows, where each row is a local vector.
0355 We assume that the number of columns is not huge for a `RowMatrix` so that a single
0356 local vector can be reasonably communicated to the driver and can also be stored /
0357 operated on using a single node. 
0358 An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
0359 which can be used for identifying rows and executing joins.
0360 A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
0361 backed by an RDD of its entries.
0362 A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
0363 which is a tuple of `(Int, Int, Matrix)`.
0364 
0365 ***Note***
0366 
0367 The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
0368 In general, the use of non-deterministic RDDs can lead to errors.
0369 
0370 ### RowMatrix
0371 
0372 A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
0373 of its rows, where each row is a local vector.
0374 Since each row is represented by a local vector, the number of columns is
0375 limited by the integer range but it should be much smaller in practice.
0376 
0377 <div class="codetabs">
0378 <div data-lang="scala" markdown="1">
0379 
0380 A [`RowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
0381 created from an `RDD[Vector]` instance.  Then we can compute its column summary statistics and decompositions.
0382 [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
0383 For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html).
0384 
0385 Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0386 
0387 {% highlight scala %}
0388 import org.apache.spark.mllib.linalg.Vector
0389 import org.apache.spark.mllib.linalg.distributed.RowMatrix
0390 
0391 val rows: RDD[Vector] = ... // an RDD of local vectors
0392 // Create a RowMatrix from an RDD[Vector].
0393 val mat: RowMatrix = new RowMatrix(rows)
0394 
0395 // Get its size.
0396 val m = mat.numRows()
0397 val n = mat.numCols()
0398 
0399 // QR decomposition 
0400 val qrResult = mat.tallSkinnyQR(true)
0401 {% endhighlight %}
0402 </div>
0403 
0404 <div data-lang="java" markdown="1">
0405 
0406 A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
0407 created from a `JavaRDD<Vector>` instance.  Then we can compute its column summary statistics.
0408 
0409 Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
0410 
0411 {% highlight java %}
0412 import org.apache.spark.api.java.JavaRDD;
0413 import org.apache.spark.mllib.linalg.Vector;
0414 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
0415 
0416 JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
0417 // Create a RowMatrix from an JavaRDD<Vector>.
0418 RowMatrix mat = new RowMatrix(rows.rdd());
0419 
0420 // Get its size.
0421 long m = mat.numRows();
0422 long n = mat.numCols();
0423 
0424 // QR decomposition 
0425 QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
0426 {% endhighlight %}
0427 </div>
0428 
0429 <div data-lang="python" markdown="1">
0430 
0431 A [`RowMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) can be 
0432 created from an `RDD` of vectors.
0433 
0434 Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for more details on the API.
0435 
0436 {% highlight python %}
0437 from pyspark.mllib.linalg.distributed import RowMatrix
0438 
0439 # Create an RDD of vectors.
0440 rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
0441 
0442 # Create a RowMatrix from an RDD of vectors.
0443 mat = RowMatrix(rows)
0444 
0445 # Get its size.
0446 m = mat.numRows()  # 4
0447 n = mat.numCols()  # 3
0448 
0449 # Get the rows as an RDD of vectors again.
0450 rowsRDD = mat.rows
0451 {% endhighlight %}
0452 </div>
0453 
0454 </div>
0455 
0456 ### IndexedRowMatrix
0457 
0458 An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices.  It is backed by
0459 an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local 
0460 vector.
0461 
0462 <div class="codetabs">
0463 <div data-lang="scala" markdown="1">
0464 
0465 An
0466 [`IndexedRowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
0467 can be created from an `RDD[IndexedRow]` instance, where
0468 [`IndexedRow`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
0469 wrapper over `(Long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0470 its row indices.
0471 
0472 Refer to the [`IndexedRowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
0473 
0474 {% highlight scala %}
0475 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
0476 
0477 val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
0478 // Create an IndexedRowMatrix from an RDD[IndexedRow].
0479 val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
0480 
0481 // Get its size.
0482 val m = mat.numRows()
0483 val n = mat.numCols()
0484 
0485 // Drop its row indices.
0486 val rowMat: RowMatrix = mat.toRowMatrix()
0487 {% endhighlight %}
0488 </div>
0489 
0490 <div data-lang="java" markdown="1">
0491 
0492 An
0493 [`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
0494 can be created from an `JavaRDD<IndexedRow>` instance, where
0495 [`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
0496 wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0497 its row indices.
0498 
0499 Refer to the [`IndexedRowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
0500 
0501 {% highlight java %}
0502 import org.apache.spark.api.java.JavaRDD;
0503 import org.apache.spark.mllib.linalg.distributed.IndexedRow;
0504 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0505 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
0506 
0507 JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
0508 // Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
0509 IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
0510 
0511 // Get its size.
0512 long m = mat.numRows();
0513 long n = mat.numCols();
0514 
0515 // Drop its row indices.
0516 RowMatrix rowMat = mat.toRowMatrix();
0517 {% endhighlight %}
0518 </div>
0519 
0520 <div data-lang="python" markdown="1">
0521 
0522 An [`IndexedRowMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix)
0523 can be created from an `RDD` of `IndexedRow`s, where 
0524 [`IndexedRow`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRow) is a 
0525 wrapper over `(long, vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
0526 its row indices.
0527 
0528 Refer to the [`IndexedRowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix) for more details on the API.
0529 
0530 {% highlight python %}
0531 from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
0532 
0533 # Create an RDD of indexed rows.
0534 #   - This can be done explicitly with the IndexedRow class:
0535 indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
0536                               IndexedRow(1, [4, 5, 6]),
0537                               IndexedRow(2, [7, 8, 9]),
0538                               IndexedRow(3, [10, 11, 12])])
0539 #   - or by using (long, vector) tuples:
0540 indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]),
0541                               (2, [7, 8, 9]), (3, [10, 11, 12])])
0542 
0543 # Create an IndexedRowMatrix from an RDD of IndexedRows.
0544 mat = IndexedRowMatrix(indexedRows)
0545 
0546 # Get its size.
0547 m = mat.numRows()  # 4
0548 n = mat.numCols()  # 3
0549 
0550 # Get the rows as an RDD of IndexedRows.
0551 rowsRDD = mat.rows
0552 
0553 # Convert to a RowMatrix by dropping the row indices.
0554 rowMat = mat.toRowMatrix()
0555 {% endhighlight %}
0556 </div>
0557 
0558 </div>
0559 
0560 ### CoordinateMatrix
0561 
0562 A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries.  Each entry is a tuple
0563 of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and
0564 `value` is the entry value.  A `CoordinateMatrix` should be used only when both
0565 dimensions of the matrix are huge and the matrix is very sparse.
0566 
0567 <div class="codetabs">
0568 <div data-lang="scala" markdown="1">
0569 
0570 A
0571 [`CoordinateMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
0572 can be created from an `RDD[MatrixEntry]` instance, where
0573 [`MatrixEntry`](api/scala/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
0574 wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
0575 with sparse rows by calling `toIndexedRowMatrix`.  Other computations for 
0576 `CoordinateMatrix` are not currently supported.
0577 
0578 Refer to the [`CoordinateMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
0579 
0580 {% highlight scala %}
0581 import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
0582 
0583 val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
0584 // Create a CoordinateMatrix from an RDD[MatrixEntry].
0585 val mat: CoordinateMatrix = new CoordinateMatrix(entries)
0586 
0587 // Get its size.
0588 val m = mat.numRows()
0589 val n = mat.numCols()
0590 
0591 // Convert it to an IndexRowMatrix whose rows are sparse vectors.
0592 val indexedRowMatrix = mat.toIndexedRowMatrix()
0593 {% endhighlight %}
0594 </div>
0595 
0596 <div data-lang="java" markdown="1">
0597 
0598 A
0599 [`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
0600 can be created from a `JavaRDD<MatrixEntry>` instance, where
0601 [`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
0602 wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
0603 with sparse rows by calling `toIndexedRowMatrix`. Other computations for 
0604 `CoordinateMatrix` are not currently supported.
0605 
0606 Refer to the [`CoordinateMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
0607 
0608 {% highlight java %}
0609 import org.apache.spark.api.java.JavaRDD;
0610 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
0611 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0612 import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
0613 
0614 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
0615 // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
0616 CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
0617 
0618 // Get its size.
0619 long m = mat.numRows();
0620 long n = mat.numCols();
0621 
0622 // Convert it to an IndexRowMatrix whose rows are sparse vectors.
0623 IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
0624 {% endhighlight %}
0625 </div>
0626 
0627 <div data-lang="python" markdown="1">
0628 
0629 A [`CoordinateMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix)
0630 can be created from an `RDD` of `MatrixEntry` entries, where 
0631 [`MatrixEntry`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.MatrixEntry) is a 
0632 wrapper over `(long, long, float)`.  A `CoordinateMatrix` can be converted to a `RowMatrix` by 
0633 calling `toRowMatrix`, or to an `IndexedRowMatrix` with sparse rows by calling `toIndexedRowMatrix`.
0634 
0635 Refer to the [`CoordinateMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix) for more details on the API.
0636 
0637 {% highlight python %}
0638 from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
0639 
0640 # Create an RDD of coordinate entries.
0641 #   - This can be done explicitly with the MatrixEntry class:
0642 entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(2, 1, 3.7)])
0643 #   - or using (long, long, float) tuples:
0644 entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)])
0645 
0646 # Create an CoordinateMatrix from an RDD of MatrixEntries.
0647 mat = CoordinateMatrix(entries)
0648 
0649 # Get its size.
0650 m = mat.numRows()  # 3
0651 n = mat.numCols()  # 2
0652 
0653 # Get the entries as an RDD of MatrixEntries.
0654 entriesRDD = mat.entries
0655 
0656 # Convert to a RowMatrix.
0657 rowMat = mat.toRowMatrix()
0658 
0659 # Convert to an IndexedRowMatrix.
0660 indexedRowMat = mat.toIndexedRowMatrix()
0661 
0662 # Convert to a BlockMatrix.
0663 blockMat = mat.toBlockMatrix()
0664 {% endhighlight %}
0665 </div>
0666 
0667 </div>
0668 
0669 ### BlockMatrix
0670 
0671 A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`s, where a `MatrixBlock` is
0672 a tuple of `((Int, Int), Matrix)`, where the `(Int, Int)` is the index of the block, and `Matrix` is
0673 the sub-matrix at the given index with size `rowsPerBlock` x `colsPerBlock`.
0674 `BlockMatrix` supports methods such as `add` and `multiply` with another `BlockMatrix`.
0675 `BlockMatrix` also has a helper function `validate` which can be used to check whether the
0676 `BlockMatrix` is set up properly.
0677 
0678 <div class="codetabs">
0679 <div data-lang="scala" markdown="1">
0680 
0681 A [`BlockMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
0682 most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
0683 `toBlockMatrix` creates blocks of size 1024 x 1024 by default.
0684 Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
0685 
0686 Refer to the [`BlockMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
0687 
0688 {% highlight scala %}
0689 import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
0690 
0691 val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
0692 // Create a CoordinateMatrix from an RDD[MatrixEntry].
0693 val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
0694 // Transform the CoordinateMatrix to a BlockMatrix
0695 val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
0696 
0697 // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
0698 // Nothing happens if it is valid.
0699 matA.validate()
0700 
0701 // Calculate A^T A.
0702 val ata = matA.transpose.multiply(matA)
0703 {% endhighlight %}
0704 </div>
0705 
0706 <div data-lang="java" markdown="1">
0707 
0708 A [`BlockMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
0709 most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
0710 `toBlockMatrix` creates blocks of size 1024 x 1024 by default.
0711 Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
0712 
0713 Refer to the [`BlockMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
0714 
0715 {% highlight java %}
0716 import org.apache.spark.api.java.JavaRDD;
0717 import org.apache.spark.mllib.linalg.distributed.BlockMatrix;
0718 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
0719 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
0720 
0721 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of (i, j, v) Matrix Entries
0722 // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
0723 CoordinateMatrix coordMat = new CoordinateMatrix(entries.rdd());
0724 // Transform the CoordinateMatrix to a BlockMatrix
0725 BlockMatrix matA = coordMat.toBlockMatrix().cache();
0726 
0727 // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
0728 // Nothing happens if it is valid.
0729 matA.validate();
0730 
0731 // Calculate A^T A.
0732 BlockMatrix ata = matA.transpose().multiply(matA);
0733 {% endhighlight %}
0734 </div>
0735 
0736 <div data-lang="python" markdown="1">
0737 
0738 A [`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) 
0739 can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is a 
0740 `((blockRowIndex, blockColIndex), sub-matrix)` tuple.
0741 
0742 Refer to the [`BlockMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) for more details on the API.
0743 
0744 {% highlight python %}
0745 from pyspark.mllib.linalg import Matrices
0746 from pyspark.mllib.linalg.distributed import BlockMatrix
0747 
0748 # Create an RDD of sub-matrix blocks.
0749 blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])),
0750                          ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))])
0751 
0752 # Create a BlockMatrix from an RDD of sub-matrix blocks.
0753 mat = BlockMatrix(blocks, 3, 2)
0754 
0755 # Get its size.
0756 m = mat.numRows()  # 6
0757 n = mat.numCols()  # 2
0758 
0759 # Get the blocks as an RDD of sub-matrix blocks.
0760 blocksRDD = mat.blocks
0761 
0762 # Convert to a LocalMatrix.
0763 localMat = mat.toLocalMatrix()
0764 
0765 # Convert to an IndexedRowMatrix.
0766 indexedRowMat = mat.toIndexedRowMatrix()
0767 
0768 # Convert to a CoordinateMatrix.
0769 coordinateMat = mat.toCoordinateMatrix()
0770 {% endhighlight %}
0771 </div>
0772 </div>