the-tree/docs/mllib-linear-methods.md

0001 ---
0002 layout: global
0003 title: Linear Methods - RDD-based API
0004 displayTitle: Linear Methods - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025
0026 `\[
0027 \newcommand{\R}{\mathbb{R}}
0028 \newcommand{\E}{\mathbb{E}}
0029 \newcommand{\x}{\mathbf{x}}
0030 \newcommand{\y}{\mathbf{y}}
0031 \newcommand{\wv}{\mathbf{w}}
0032 \newcommand{\av}{\mathbf{\alpha}}
0033 \newcommand{\bv}{\mathbf{b}}
0034 \newcommand{\N}{\mathbb{N}}
0035 \newcommand{\id}{\mathbf{I}}
0036 \newcommand{\ind}{\mathbf{1}}
0037 \newcommand{\0}{\mathbf{0}}
0038 \newcommand{\unit}{\mathbf{e}}
0039 \newcommand{\one}{\mathbf{1}}
0040 \newcommand{\zero}{\mathbf{0}}
0041 \]`
0042
0043 ## Mathematical formulation
0044
0045 Many standard *machine learning* methods can be formulated as a convex optimization problem, i.e.
0046 the task of finding a minimizer of a convex function `$f$` that depends on a variable vector
0047 `$\wv$` (called `weights` in the code), which has `$d$` entries.
0048 Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
0049 the objective function is of the form
0050 `\begin{equation}
0051     f(\wv) := \lambda\, R(\wv) +
0052     \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)
0053     \label{eq:regPrimal}
0054     \ .
0055 \end{equation}`
0056 Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
0057 `$y_i\in\R$` are their corresponding labels, which we want to predict.
0058 We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
0059 Several of `spark.mllib`'s classification and regression algorithms fall into this category,
0060 and are discussed here.
0061
0062 The objective function `$f$` has two parts:
0063 the regularizer that controls the complexity of the model,
0064 and the loss that measures the error of the model on the training data.
0065 The loss function `$L(\wv;.)$` is typically a convex function in `$\wv$`.  The
0066 fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code)
0067 defines the trade-off between the two goals of minimizing the loss (i.e.,
0068 training error) and minimizing model complexity (i.e., to avoid overfitting).
0069
0070 ### Loss functions
0071
0072 The following table summarizes the loss functions and their gradients or sub-gradients for the
0073 methods `spark.mllib` supports:
0074
0075 <table class="table">
0076   <thead>
0077     <tr><th></th><th>loss function $L(\wv; \x, y)$</th><th>gradient or sub-gradient</th></tr>
0078   </thead>
0079   <tbody>
0080     <tr>
0081       <td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
0082       <td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
0083 \text{otherwise}.\end{cases}$</td>
0084     </tr>
0085     <tr>
0086       <td>logistic loss</td><td>$\log(1+\exp( -y \wv^T \x)), \quad y \in \{-1, +1\}$</td>
0087       <td>$-y \left(1-\frac1{1+\exp(-y \wv^T \x)} \right) \cdot \x$</td>
0088     </tr>
0089     <tr>
0090       <td>squared loss</td><td>$\frac{1}{2} (\wv^T \x - y)^2, \quad y \in \R$</td>
0091       <td>$(\wv^T \x - y) \cdot \x$</td>
0092     </tr>
0093   </tbody>
0094 </table>
0095
0096 Note that, in the mathematical formulation above, a binary label $y$ is denoted as either
0097 $+1$ (positive) or $-1$ (negative), which is convenient for the formulation.
0098 *However*, the negative label is represented by $0$ in `spark.mllib` instead of $-1$, to be consistent with
0099 multiclass labeling.
0100
0101 ### Regularizers
0102
0103 The purpose of the
0104 [regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
0105 encourage simple models and avoid overfitting.  We support the following
0106 regularizers in `spark.mllib`:
0107
0108 <table class="table">
0109   <thead>
0110     <tr><th></th><th>regularizer $R(\wv)$</th><th>gradient or sub-gradient</th></tr>
0111   </thead>
0112   <tbody>
0113     <tr>
0114       <td>zero (unregularized)</td><td>0</td><td>$\0$</td>
0115     </tr>
0116     <tr>
0117       <td>L2</td><td>$\frac{1}{2}\|\wv\|_2^2$</td><td>$\wv$</td>
0118     </tr>
0119     <tr>
0120       <td>L1</td><td>$\|\wv\|_1$</td><td>$\mathrm{sign}(\wv)$</td>
0121     </tr>
0122     <tr>
0123       <td>elastic net</td><td>$\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$</td><td>$\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$</td>
0124     </tr>
0125   </tbody>
0126 </table>
0127
0128 Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of all the entries
0129 of `$\wv$`.
0130
0131 L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
0132 However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
0133 [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization) is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization,
0134 especially when the number of training examples is small.
0135
0136 ### Optimization
0137
0138 Under the hood, linear methods use convex optimization methods to optimize the objective functions.
0139 `spark.mllib` uses two methods, SGD and L-BFGS, described in the [optimization section](mllib-optimization.html).
0140 Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS.
0141 Refer to [this optimization section](mllib-optimization.html#Choosing-an-Optimization-Method) for guidelines on choosing between optimization methods.
0142
0143 ## Classification
0144
0145 [Classification](http://en.wikipedia.org/wiki/Statistical_classification) aims to divide items into
0146 categories.
0147 The most common classification type is
0148 [binary classification](http://en.wikipedia.org/wiki/Binary_classification), where there are two
0149 categories, usually named positive and negative.
0150 If there are more than two categories, it is called
0151 [multiclass classification](http://en.wikipedia.org/wiki/Multiclass_classification).
0152 `spark.mllib` supports two linear methods for classification: linear Support Vector Machines (SVMs)
0153 and logistic regression.
0154 Linear SVMs supports only binary classification, while logistic regression supports both binary and
0155 multiclass classification problems.
0156 For both methods, `spark.mllib` supports L1 and L2 regularized variants.
0157 The training data set is represented by an RDD of [LabeledPoint](mllib-data-types.html#labeled-point) in MLlib,
0158 where labels are class indices starting from zero: $0, 1, 2, \ldots$.
0159
0160 ### Linear Support Vector Machines (SVMs)
0161
0162 The [linear SVM](http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
0163 is a standard method for large-scale classification tasks. It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss function in the formulation given by the hinge loss:
0164
0165 `\[
0166 L(\wv;\x,y) := \max \{0, 1-y \wv^T \x \}.
0167 \]`
0168 By default, linear SVMs are trained with an L2 regularization.
0169 We also support alternative L1 regularization. In this case,
0170 the problem becomes a [linear program](http://en.wikipedia.org/wiki/Linear_programming).
0171
0172 The linear SVMs algorithm outputs an SVM model. Given a new data point,
0173 denoted by $\x$, the model makes predictions based on the value of $\wv^T \x$.
0174 By the default, if $\wv^T \x \geq 0$ then the outcome is positive, and negative
0175 otherwise.
0176
0177 **Examples**
0178
0179 <div class="codetabs">
0180
0181 <div data-lang="scala" markdown="1">
0182 The following code snippet illustrates how to load a sample dataset, execute a
0183 training algorithm on this training data using a static method in the algorithm
0184 object, and make predictions with the resulting model to compute the training
0185 error.
0186
0187 Refer to the [`SVMWithSGD` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMModel.html) for details on the API.
0188
0189 {% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %}
0190
0191 The `SVMWithSGD.train()` method by default performs L2 regularization with the
0192 regularization parameter set to 1.0. If we want to configure this algorithm, we
0193 can customize `SVMWithSGD` further by creating a new object directly and
0194 calling setter methods. All other `spark.mllib` algorithms support customization in
0195 this way as well. For example, the following code produces an L1 regularized
0196 variant of SVMs with regularization parameter set to 0.1, and runs the training
0197 algorithm for 200 iterations.
0198
0199 {% highlight scala %}
0200
0201 import org.apache.spark.mllib.optimization.L1Updater
0202
0203 val svmAlg = new SVMWithSGD()
0204 svmAlg.optimizer
0205   .setNumIterations(200)
0206   .setRegParam(0.1)
0207   .setUpdater(new L1Updater)
0208 val modelL1 = svmAlg.run(training)
0209 {% endhighlight %}
0210
0211 </div>
0212
0213 <div data-lang="java" markdown="1">
0214 All of MLlib's methods use Java-friendly types, so you can import and call them there the same
0215 way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
0216 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
0217 calling `.rdd()` on your `JavaRDD` object. A self-contained application example
0218 that is equivalent to the provided example in Scala is given below:
0219
0220 Refer to the [`SVMWithSGD` Java docs](api/java/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Java docs](api/java/org/apache/spark/mllib/classification/SVMModel.html) for details on the API.
0221
0222 {% include_example java/org/apache/spark/examples/mllib/JavaSVMWithSGDExample.java %}
0223
0224 The `SVMWithSGD.train()` method by default performs L2 regularization with the
0225 regularization parameter set to 1.0. If we want to configure this algorithm, we
0226 can customize `SVMWithSGD` further by creating a new object directly and
0227 calling setter methods. All other `spark.mllib` algorithms support customization in
0228 this way as well. For example, the following code produces an L1 regularized
0229 variant of SVMs with regularization parameter set to 0.1, and runs the training
0230 algorithm for 200 iterations.
0231
0232 {% highlight java %}
0233 import org.apache.spark.mllib.optimization.L1Updater;
0234
0235 SVMWithSGD svmAlg = new SVMWithSGD();
0236 svmAlg.optimizer()
0237   .setNumIterations(200)
0238   .setRegParam(0.1)
0239   .setUpdater(new L1Updater());
0240 SVMModel modelL1 = svmAlg.run(training.rdd());
0241 {% endhighlight %}
0242
0243 In order to run the above application, follow the instructions
0244 provided in the [Self-Contained
0245 Applications](quick-start.html#self-contained-applications) section of the Spark
0246 quick-start guide. Be sure to also include *spark-mllib* to your build file as
0247 a dependency.
0248 </div>
0249
0250 <div data-lang="python" markdown="1">
0251 The following example shows how to load a sample dataset, build SVM model,
0252 and make predictions with the resulting model to compute the training error.
0253
0254 Refer to the [`SVMWithSGD` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMWithSGD) and [`SVMModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMModel) for more details on the API.
0255
0256 {% include_example python/mllib/svm_with_sgd_example.py %}
0257 </div>
0258 </div>
0259
0260 ### Logistic regression
0261
0262 [Logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) is widely used to predict a
0263 binary response. It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`,
0264 with the loss function in the formulation given by the logistic loss:
0265 `\[
0266 L(\wv;\x,y) :=  \log(1+\exp( -y \wv^T \x)).
0267 \]`
0268
0269 For binary classification problems, the algorithm outputs a binary logistic regression model.
0270 Given a new data point, denoted by $\x$, the model makes predictions by
0271 applying the logistic function
0272 `\[
0273 \mathrm{f}(z) = \frac{1}{1 + e^{-z}}
0274 \]`
0275 where $z = \wv^T \x$.
0276 By default, if $\mathrm{f}(\wv^T x) > 0.5$, the outcome is positive, or
0277 negative otherwise, though unlike linear SVMs, the raw output of the logistic regression
0278 model, $\mathrm{f}(z)$, has a probabilistic interpretation (i.e., the probability
0279 that $\x$ is positive).
0280
0281 Binary logistic regression can be generalized into
0282 [multinomial logistic regression](http://en.wikipedia.org/wiki/Multinomial_logistic_regression) to
0283 train and predict multiclass classification problems.
0284 For example, for $K$ possible outcomes, one of the outcomes can be chosen as a "pivot", and the
0285 other $K - 1$ outcomes can be separately regressed against the pivot outcome.
0286 In `spark.mllib`, the first class $0$ is chosen as the "pivot" class.
0287 See Section 4.4 of
0288 [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
0289 references.
0290 Here is a
0291 [detailed mathematical derivation](http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297).
0292
0293 For multiclass classification problems, the algorithm will output a multinomial logistic regression
0294 model, which contains $K - 1$ binary logistic regression models regressed against the first class.
0295 Given a new data points, $K - 1$ models will be run, and the class with largest probability will be
0296 chosen as the predicted class.
0297
0298 We implemented two algorithms to solve logistic regression: mini-batch gradient descent and L-BFGS.
0299 We recommend L-BFGS over mini-batch gradient descent for faster convergence.
0300
0301 **Examples**
0302
0303 <div class="codetabs">
0304
0305 <div data-lang="scala" markdown="1">
0306 The following code illustrates how to load a sample multiclass dataset, split it into train and
0307 test, and use
0308 [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
0309 to fit a logistic regression model.
0310 Then the model is evaluated against the test dataset and saved to disk.
0311
0312 Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`LogisticRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionModel.html) for details on the API.
0313
0314 {% include_example scala/org/apache/spark/examples/mllib/LogisticRegressionWithLBFGSExample.scala %}
0315
0316 </div>
0317
0318 <div data-lang="java" markdown="1">
0319 The following code illustrates how to load a sample multiclass dataset, split it into train and
0320 test, and use
0321 [LogisticRegressionWithLBFGS](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
0322 to fit a logistic regression model.
0323 Then the model is evaluated against the test dataset and saved to disk.
0324
0325 Refer to the [`LogisticRegressionWithLBFGS` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`LogisticRegressionModel` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionModel.html) for details on the API.
0326
0327 {% include_example java/org/apache/spark/examples/mllib/JavaLogisticRegressionWithLBFGSExample.java %}
0328 </div>
0329
0330 <div data-lang="python" markdown="1">
0331 The following example shows how to load a sample dataset, build Logistic Regression model,
0332 and make predictions with the resulting model to compute the training error.
0333
0334 Note that the Python API does not yet support multiclass classification and model save/load but
0335 will in the future.
0336
0337 Refer to the [`LogisticRegressionWithLBFGS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) and [`LogisticRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionModel) for more details on the API.
0338
0339 {% include_example python/mllib/logistic_regression_with_lbfgs_example.py %}
0340 </div>
0341 </div>
0342
0343 # Regression
0344
0345 ### Linear least squares, Lasso, and ridge regression
0346
0347
0348 Linear least squares is the most common formulation for regression problems.
0349 It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
0350 function in the formulation given by the squared loss:
0351 `\[
0352 L(\wv;\x,y) :=  \frac{1}{2} (\wv^T \x - y)^2.
0353 \]`
0354
0355 Various related regression methods are derived by using different types of regularization:
0356 [*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
0357 [*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
0358  no regularization; [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
0359 regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) uses L1
0360 regularization.  For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
0361 known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_error).
0362
0363 ### Streaming linear regression
0364
0365 When data arrive in a streaming fashion, it is useful to fit regression models online,
0366 updating the parameters of the model as new data arrives. `spark.mllib` currently supports
0367 streaming linear regression using ordinary least squares. The fitting is similar
0368 to that performed offline, except fitting occurs on each batch of data, so that
0369 the model continually updates to reflect the data from the stream.
0370
0371 **Examples**
0372
0373 The following example demonstrates how to load training and testing data from two different
0374 input streams of text files, parse the streams as labeled points, fit a linear regression model
0375 online to the first stream, and make predictions on the second stream.
0376
0377 <div class="codetabs">
0378
0379 <div data-lang="scala" markdown="1">
0380
0381 First, we import the necessary classes for parsing our input data and creating the model.
0382
0383 Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
0384 has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
0385 for more info. For this example, we use labeled points in training and testing streams,
0386 but in practice you will likely want to use unlabeled vectors for test data.
0387
0388 We create our model by initializing the weights to zero and register the streams for training and
0389 testing then start the job. Printing predictions alongside true labels lets us easily see the
0390 result.
0391
0392 Finally, we can save text files with data to the training or testing folders.
0393 Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
0394 and `x1,x2,x3` are the features. Anytime a text file is placed in `args(0)`
0395 the model will update. Anytime a text file is placed in `args(1)` you will see predictions.
0396 As you feed more data to the training directory, the predictions
0397 will get better!
0398
0399 Here is a complete example:
0400 {% include_example scala/org/apache/spark/examples/mllib/StreamingLinearRegressionExample.scala %}
0401
0402 </div>
0403
0404 <div data-lang="python" markdown="1">
0405
0406 First, we import the necessary classes for parsing our input data and creating the model.
0407
0408 Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
0409 has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
0410 for more info. For this example, we use labeled points in training and testing streams,
0411 but in practice you will likely want to use unlabeled vectors for test data.
0412
0413 We create our model by initializing the weights to 0.
0414
0415 Now we register the streams for training and testing and start the job.
0416
0417 We can now save text files with data to the training or testing folders.
0418 Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
0419 and `x1,x2,x3` are the features. Anytime a text file is placed in `sys.argv[1]`
0420 the model will update. Anytime a text file is placed in `sys.argv[2]` you will see predictions.
0421 As you feed more data to the training directory, the predictions
0422 will get better!
0423
0424 Here a complete example:
0425 {% include_example python/mllib/streaming_linear_regression_example.py %}
0426
0427 </div>
0428
0429 </div>
0430
0431
0432 # Implementation (developer)
0433
0434 Behind the scene, `spark.mllib` implements a simple distributed version of stochastic gradient descent
0435 (SGD), building on the underlying gradient descent primitive (as described in the <a
0436 href="mllib-optimization.html">optimization</a> section).  All provided algorithms take as input a
0437 regularization parameter (`regParam`) along with various parameters associated with stochastic
0438 gradient descent (`stepSize`, `numIterations`, `miniBatchFraction`).  For each of them, we support
0439 all three possible regularizations (none, L1 or L2).
0440
0441 For Logistic Regression, [L-BFGS](api/scala/org/apache/spark/mllib/optimization/LBFGS.html)
0442 version is implemented under [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html), and this
0443 version supports both binary and multinomial Logistic Regression while SGD version only supports
0444 binary Logistic Regression. However, L-BFGS version doesn't support L1 regularization but SGD one
0445 supports L1 regularization. When L1 regularization is not required, L-BFGS version is strongly
0446 recommended since it converges faster and more accurately compared to SGD by approximating the
0447 inverse Hessian matrix using quasi-Newton method.
0448
0449 Algorithms are all implemented in Scala:
0450
0451 * [SVMWithSGD](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html)
0452 * [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
0453 * [LogisticRegressionWithSGD](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html)
0454 * [LinearRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html)
0455 * [RidgeRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/RidgeRegressionWithSGD.html)
0456 * [LassoWithSGD](api/scala/org/apache/spark/mllib/regression/LassoWithSGD.html)
0457