the-tree/docs/mllib-evaluation-metrics.md

0001 ---
0002 layout: global
0003 title: Evaluation Metrics - RDD-based API
0004 displayTitle: Evaluation Metrics - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 `spark.mllib` comes with a number of machine learning algorithms that can be used to learn from and make predictions
0026 on data. When these algorithms are applied to build machine learning models, there is a need to evaluate the performance
0027 of the model on some criteria, which depends on the application and its requirements. `spark.mllib` also provides a
0028 suite of metrics for the purpose of evaluating the performance of machine learning models.
0029
0030 Specific machine learning algorithms fall under broader types of machine learning applications like classification,
0031 regression, clustering, etc. Each of these types have well-established metrics for performance evaluation and those
0032 metrics that are currently available in `spark.mllib` are detailed in this section.
0033
0034 ## Classification model evaluation
0035
0036 While there are many different types of classification algorithms, the evaluation of classification models all share
0037 similar principles. In a [supervised classification problem](https://en.wikipedia.org/wiki/Statistical_classification),
0038 there exists a true output and a model-generated predicted output for each data point. For this reason, the results for
0039 each data point can be assigned to one of four categories:
0040
0041 * True Positive (TP) - label is positive and prediction is also positive
0042 * True Negative (TN) - label is negative and prediction is also negative
0043 * False Positive (FP) - label is negative but prediction is positive
0044 * False Negative (FN) - label is positive but prediction is negative
0045
0046 These four numbers are the building blocks for most classifier evaluation metrics. A fundamental point when considering
0047 classifier evaluation is that pure accuracy (i.e. was the prediction correct or incorrect) is not generally a good metric. The
0048 reason for this is because a dataset may be highly unbalanced. For example, if a model is designed to predict fraud from
0049 a dataset where 95% of the data points are _not fraud_ and 5% of the data points are _fraud_, then a naive classifier
0050 that predicts _not fraud_, regardless of input, will be 95% accurate. For this reason, metrics like
0051 [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) are typically used because they take into
0052 account the *type* of error. In most applications there is some desired balance between precision and recall, which can
0053 be captured by combining the two into a single metric, called the [F-measure](https://en.wikipedia.org/wiki/F1_score).
0054
0055 ### Binary classification
0056
0057 [Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification) are used to separate the elements of a given
0058 dataset into one of two possible groups (e.g. fraud or not fraud) and is a special case of multiclass classification.
0059 Most binary classification metrics can be generalized to multiclass classification metrics.
0060
0061 #### Threshold tuning
0062
0063 It is import to understand that many classification models actually output a "score" (often times a probability) for
0064 each class, where a higher score indicates higher likelihood. In the binary case, the model may output a probability for
0065 each class: $P(Y=1|X)$ and $P(Y=0|X)$. Instead of simply taking the higher probability, there may be some cases where
0066 the model might need to be tuned so that it only predicts a class when the probability is very high (e.g. only block a
0067 credit card transaction if the model predicts fraud with >90% probability). Therefore, there is a prediction *threshold*
0068 which determines what the predicted class will be based on the probabilities that the model outputs.
0069
0070 Tuning the prediction threshold will change the precision and recall of the model and is an important part of model
0071 optimization. In order to visualize how precision, recall, and other metrics change as a function of the threshold it is
0072 common practice to plot competing metrics against one another, parameterized by threshold. A P-R curve plots (precision,
0073 recall) points for different threshold values, while a
0074 [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), or ROC, curve
0075 plots (recall, false positive rate) points.
0076
0077 **Available metrics**
0078
0079 <table class="table">
0080   <thead>
0081     <tr><th>Metric</th><th>Definition</th></tr>
0082   </thead>
0083   <tbody>
0084     <tr>
0085       <td>Precision (Positive Predictive Value)</td>
0086       <td>$PPV=\frac{TP}{TP + FP}$</td>
0087     </tr>
0088     <tr>
0089       <td>Recall (True Positive Rate)</td>
0090       <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
0091     </tr>
0092     <tr>
0093       <td>F-measure</td>
0094       <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR}
0095           {\beta^2 \cdot PPV + TPR}\right)$</td>
0096     </tr>
0097     <tr>
0098       <td>Receiver Operating Characteristic (ROC)</td>
0099       <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} P_1(T)\,dT$</td>
0100     </tr>
0101     <tr>
0102       <td>Area Under ROC Curve</td>
0103       <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
0104     </tr>
0105     <tr>
0106       <td>Area Under Precision-Recall Curve</td>
0107       <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)$</td>
0108     </tr>
0109   </tbody>
0110 </table>
0111
0112
0113 **Examples**
0114
0115 <div class="codetabs">
0116 The following code snippets illustrate how to load a sample dataset, train a binary classification algorithm on the
0117 data, and evaluate the performance of the algorithm by several binary evaluation metrics.
0118
0119 <div data-lang="scala" markdown="1">
0120 Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`BinaryClassificationMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html) for details on the API.
0121
0122 {% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %}
0123
0124 </div>
0125
0126 <div data-lang="java" markdown="1">
0127 Refer to the [`LogisticRegressionModel` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionModel.html) and [`LogisticRegressionWithLBFGS` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) for details on the API.
0128
0129 {% include_example java/org/apache/spark/examples/mllib/JavaBinaryClassificationMetricsExample.java %}
0130
0131 </div>
0132
0133 <div data-lang="python" markdown="1">
0134 Refer to the [`BinaryClassificationMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.BinaryClassificationMetrics) and [`LogisticRegressionWithLBFGS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) for more details on the API.
0135
0136 {% include_example python/mllib/binary_classification_metrics_example.py %}
0137 </div>
0138 </div>
0139
0140
0141 ### Multiclass classification
0142
0143 A [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification) describes a classification
0144 problem where there are $M \gt 2$ possible labels for each data point (the case where $M=2$ is the binary
0145 classification problem). For example, classifying handwriting samples to the digits 0 to 9, having 10 possible classes.
0146
0147 For multiclass metrics, the notion of positives and negatives is slightly different. Predictions and labels can still
0148 be positive or negative, but they must be considered under the context of a particular class. Each label and prediction
0149 take on the value of one of the multiple classes and so they are said to be positive for their particular class and negative
0150 for all other classes. So, a true positive occurs whenever the prediction and the label match, while a true negative
0151 occurs when neither the prediction nor the label take on the value of a given class. By this convention, there can be
0152 multiple true negatives for a given data sample. The extension of false negatives and false positives from the former
0153 definitions of positive and negative labels is straightforward.
0154
0155 #### Label based metrics
0156
0157 Opposed to binary classification where there are only two possible labels, multiclass classification problems have many
0158 possible labels and so the concept of label-based metrics is introduced. Accuracy measures precision across all
0159 labels -  the number of times any class was predicted correctly (true positives) normalized by the number of data
0160 points. Precision by label considers only one class, and measures the number of time a specific label was predicted
0161 correctly normalized by the number of times that label appears in the output.
0162
0163 **Available metrics**
0164
0165 Define the class, or label, set as
0166
0167 $$L = \{\ell_0, \ell_1, \ldots, \ell_{M-1} \} $$
0168
0169 The true output vector $\mathbf{y}$ consists of $N$ elements
0170
0171 $$\mathbf{y}_0, \mathbf{y}_1, \ldots, \mathbf{y}_{N-1} \in L $$
0172
0173 A multiclass prediction algorithm generates a prediction vector $\hat{\mathbf{y}}$ of $N$ elements
0174
0175 $$\hat{\mathbf{y}}_0, \hat{\mathbf{y}}_1, \ldots, \hat{\mathbf{y}}_{N-1} \in L $$
0176
0177 For this section, a modified delta function $\hat{\delta}(x)$ will prove useful
0178
0179 $$\hat{\delta}(x) = \begin{cases}1 & \text{if $x = 0$}, \\ 0 & \text{otherwise}.\end{cases}$$
0180
0181 <table class="table">
0182   <thead>
0183     <tr><th>Metric</th><th>Definition</th></tr>
0184   </thead>
0185   <tbody>
0186     <tr>
0187       <td>Confusion Matrix</td>
0188       <td>
0189         $C_{ij} = \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_i) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_j)\\ \\
0190          \left( \begin{array}{ccc}
0191          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots &
0192          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N) \\
0193          \vdots & \ddots & \vdots \\
0194          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots &
0195          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N)
0196          \end{array} \right)$
0197       </td>
0198     </tr>
0199     <tr>
0200       <td>Accuracy</td>
0201       <td>$ACC = \frac{TP}{TP + FP} = \frac{1}{N}\sum_{i=0}^{N-1} \hat{\delta}\left(\hat{\mathbf{y}}_i -
0202         \mathbf{y}_i\right)$</td>
0203     </tr>
0204     <tr>
0205       <td>Precision by label</td>
0206       <td>$PPV(\ell) = \frac{TP}{TP + FP} =
0207           \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)}
0208           {\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell)}$</td>
0209     </tr>
0210     <tr>
0211       <td>Recall by label</td>
0212       <td>$TPR(\ell)=\frac{TP}{P} =
0213           \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)}
0214           {\sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i - \ell)}$</td>
0215     </tr>
0216     <tr>
0217       <td>F-measure by label</td>
0218       <td>$F(\beta, \ell) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)}
0219           {\beta^2 \cdot PPV(\ell) + TPR(\ell)}\right)$</td>
0220     </tr>
0221     <tr>
0222       <td>Weighted precision</td>
0223       <td>$PPV_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} PPV(\ell)
0224           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0225     </tr>
0226     <tr>
0227       <td>Weighted recall</td>
0228       <td>$TPR_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} TPR(\ell)
0229           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0230     </tr>
0231     <tr>
0232       <td>Weighted F-measure</td>
0233       <td>$F_{w}(\beta)= \frac{1}{N} \sum\nolimits_{\ell \in L} F(\beta, \ell)
0234           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0235     </tr>
0236   </tbody>
0237 </table>
0238
0239 **Examples**
0240
0241 <div class="codetabs">
0242 The following code snippets illustrate how to load a sample dataset, train a multiclass classification algorithm on
0243 the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics.
0244
0245 <div data-lang="scala" markdown="1">
0246 Refer to the [`MulticlassMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
0247
0248 {% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %}
0249
0250 </div>
0251
0252 <div data-lang="java" markdown="1">
0253 Refer to the [`MulticlassMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
0254
0255  {% include_example java/org/apache/spark/examples/mllib/JavaMulticlassClassificationMetricsExample.java %}
0256
0257 </div>
0258
0259 <div data-lang="python" markdown="1">
0260 Refer to the [`MulticlassMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MulticlassMetrics) for more details on the API.
0261
0262 {% include_example python/mllib/multi_class_metrics_example.py %}
0263
0264 </div>
0265 </div>
0266
0267 ### Multilabel classification
0268
0269 A [multilabel classification](https://en.wikipedia.org/wiki/Multi-label_classification) problem involves mapping
0270 each sample in a dataset to a set of class labels. In this type of classification problem, the labels are not
0271 mutually exclusive. For example, when classifying a set of news articles into topics, a single article might be both
0272 science and politics.
0273
0274 Because the labels are not mutually exclusive, the predictions and true labels are now vectors of label *sets*, rather
0275 than vectors of labels. Multilabel metrics, therefore, extend the fundamental ideas of precision, recall, etc. to
0276 operations on sets. For example, a true positive for a given class now occurs when that class exists in the predicted
0277 set and it exists in the true label set, for a specific data point.
0278
0279 **Available metrics**
0280
0281 Here we define a set $D$ of $N$ documents
0282
0283 $$D = \left\{d_0, d_1, ..., d_{N-1}\right\}$$
0284
0285 Define $L_0, L_1, ..., L_{N-1}$ to be a family of label sets and $P_0, P_1, ..., P_{N-1}$
0286 to be a family of prediction sets where $L_i$ and $P_i$ are the label set and prediction set, respectively, that
0287 correspond to document $d_i$.
0288
0289 The set of all unique labels is given by
0290
0291 $$L = \bigcup_{k=0}^{N-1} L_k$$
0292
0293 The following definition of indicator function $I_A(x)$ on a set $A$ will be necessary
0294
0295 $$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{cases}$$
0296
0297 <table class="table">
0298   <thead>
0299     <tr><th>Metric</th><th>Definition</th></tr>
0300   </thead>
0301   <tbody>
0302     <tr>
0303       <td>Precision</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|P_i \cap L_i\right|}{\left|P_i\right|}$</td>
0304     </tr>
0305     <tr>
0306       <td>Recall</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|L_i \cap P_i\right|}{\left|L_i\right|}$</td>
0307     </tr>
0308     <tr>
0309       <td>Accuracy</td>
0310       <td>
0311         $\frac{1}{N} \sum_{i=0}^{N - 1} \frac{\left|L_i \cap P_i \right|}
0312         {\left|L_i\right| + \left|P_i\right| - \left|L_i \cap P_i \right|}$
0313       </td>
0314     </tr>
0315     <tr>
0316       <td>Precision by label</td><td>$PPV(\ell)=\frac{TP}{TP + FP}=
0317           \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
0318           {\sum_{i=0}^{N-1} I_{P_i}(\ell)}$</td>
0319     </tr>
0320     <tr>
0321       <td>Recall by label</td><td>$TPR(\ell)=\frac{TP}{P}=
0322           \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
0323           {\sum_{i=0}^{N-1} I_{L_i}(\ell)}$</td>
0324     </tr>
0325     <tr>
0326       <td>F1-measure by label</td><td>$F1(\ell) = 2
0327                             \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)}
0328                             {PPV(\ell) + TPR(\ell)}\right)$</td>
0329     </tr>
0330     <tr>
0331       <td>Hamming Loss</td>
0332       <td>
0333         $\frac{1}{N \cdot \left|L\right|} \sum_{i=0}^{N - 1} \left|L_i\right| + \left|P_i\right| - 2\left|L_i
0334           \cap P_i\right|$
0335       </td>
0336     </tr>
0337     <tr>
0338       <td>Subset Accuracy</td>
0339       <td>$\frac{1}{N} \sum_{i=0}^{N-1} I_{\{L_i\}}(P_i)$</td>
0340     </tr>
0341     <tr>
0342       <td>F1 Measure</td>
0343       <td>$\frac{1}{N} \sum_{i=0}^{N-1} 2 \frac{\left|P_i \cap L_i\right|}{\left|P_i\right| \cdot \left|L_i\right|}$</td>
0344     </tr>
0345     <tr>
0346       <td>Micro precision</td>
0347       <td>$\frac{TP}{TP + FP}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
0348           {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|P_i - L_i\right|}$</td>
0349     </tr>
0350     <tr>
0351       <td>Micro recall</td>
0352       <td>$\frac{TP}{TP + FN}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
0353         {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right|}$</td>
0354     </tr>
0355     <tr>
0356       <td>Micro F1 Measure</td>
0357       <td>
0358         $2 \cdot \frac{TP}{2 \cdot TP + FP + FN}=2 \cdot \frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}{2 \cdot
0359         \sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right| + \sum_{i=0}^{N-1}
0360         \left|P_i - L_i\right|}$
0361       </td>
0362     </tr>
0363   </tbody>
0364 </table>
0365
0366 **Examples**
0367
0368 The following code snippets illustrate how to evaluate the performance of a multilabel classifier. The examples
0369 use the fake prediction and label data for multilabel classification that is shown below.
0370
0371 Document predictions:
0372
0373 * doc 0 - predict 0, 1 - class 0, 2
0374 * doc 1 - predict 0, 2 - class 0, 1
0375 * doc 2 - predict none - class 0
0376 * doc 3 - predict 2 - class 2
0377 * doc 4 - predict 2, 0 - class 2, 0
0378 * doc 5 - predict 0, 1, 2 - class 0, 1
0379 * doc 6 - predict 1 - class 1, 2
0380
0381 Predicted classes:
0382
0383 * class 0 - doc 0, 1, 4, 5 (total 4)
0384 * class 1 - doc 0, 5, 6 (total 3)
0385 * class 2 - doc 1, 3, 4, 5 (total 4)
0386
0387 True classes:
0388
0389 * class 0 - doc 0, 1, 2, 4, 5 (total 5)
0390 * class 1 - doc 1, 5, 6 (total 3)
0391 * class 2 - doc 0, 3, 4, 6 (total 4)
0392
0393 <div class="codetabs">
0394
0395 <div data-lang="scala" markdown="1">
0396 Refer to the [`MultilabelMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
0397
0398 {% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %}
0399
0400 </div>
0401
0402 <div data-lang="java" markdown="1">
0403 Refer to the [`MultilabelMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
0404
0405 {% include_example java/org/apache/spark/examples/mllib/JavaMultiLabelClassificationMetricsExample.java %}
0406
0407 </div>
0408
0409 <div data-lang="python" markdown="1">
0410 Refer to the [`MultilabelMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MultilabelMetrics) for more details on the API.
0411
0412 {% include_example python/mllib/multi_label_metrics_example.py %}
0413
0414 </div>
0415 </div>
0416
0417 ### Ranking systems
0418
0419 The role of a ranking algorithm (often thought of as a [recommender system](https://en.wikipedia.org/wiki/Recommender_system))
0420 is to return to the user a set of relevant items or documents based on some training data. The definition of relevance
0421 may vary and is usually application specific. Ranking system metrics aim to quantify the effectiveness of these
0422 rankings or recommendations in various contexts. Some metrics compare a set of recommended documents to a ground truth
0423 set of relevant documents, while other metrics may incorporate numerical ratings explicitly.
0424
0425 **Available metrics**
0426
0427 A ranking system usually deals with a set of $M$ users
0428
0429 $$U = \left\{u_0, u_1, ..., u_{M-1}\right\}$$
0430
0431 Each user ($u_i$) having a set of $N_i$ ground truth relevant documents
0432
0433 $$D_i = \left\{d_0, d_1, ..., d_{N_i-1}\right\}$$
0434
0435 And a list of $Q_i$ recommended documents, in order of decreasing relevance
0436
0437 $$R_i = \left[r_0, r_1, ..., r_{Q_i-1}\right]$$
0438
0439 The goal of the ranking system is to produce the most relevant set of documents for each user. The relevance of the
0440 sets and the effectiveness of the algorithms can be measured using the metrics listed below.
0441
0442 It is necessary to define a function which, provided a recommended document and a set of ground truth relevant
0443 documents, returns a relevance score for the recommended document.
0444
0445 $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & \text{otherwise}.\end{cases}$$
0446
0447 <table class="table">
0448   <thead>
0449     <tr><th>Metric</th><th>Definition</th><th>Notes</th></tr>
0450   </thead>
0451   <tbody>
0452     <tr>
0453       <td>
0454         Precision at k
0455       </td>
0456       <td>
0457         $p(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{k} \sum_{j=0}^{\text{min}(Q_i, k) - 1} rel_{D_i}(R_i(j))}$
0458       </td>
0459       <td>
0460         <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Precision_at_K">Precision at k</a> is a measure of
0461          how many of the first k recommended documents are in the set of true relevant documents averaged across all
0462          users. In this metric, the order of the recommendations is not taken into account.
0463       </td>
0464     </tr>
0465     <tr>
0466       <td>Mean Average Precision</td>
0467       <td>
0468         $MAP=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{N_i} \sum_{j=0}^{Q_i-1} \frac{rel_{D_i}(R_i(j))}{j + 1}}$
0469       </td>
0470       <td>
0471         <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision">MAP</a> is a measure of how
0472          many of the recommended documents are in the set of true relevant documents, where the
0473         order of the recommendations is taken into account (i.e. penalty for highly relevant documents is higher).
0474       </td>
0475     </tr>
0476     <tr>
0477       <td>Normalized Discounted Cumulative Gain</td>
0478       <td>
0479         $NDCG(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{IDCG(D_i, k)}\sum_{j=0}^{n-1}
0480           \frac{rel_{D_i}(R_i(j))}{\text{log}(j+2)}} \\
0481         \text{Where} \\
0482         \hspace{5 mm} n = \text{min}\left(\text{max}\left(Q_i, N_i\right),k\right) \\
0483         \hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 1} \frac{1}{\text{log}(j+2)}$
0484       </td>
0485       <td>
0486         <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">NDCG at k</a> is a
0487         measure of how many of the first k recommended documents are in the set of true relevant documents averaged
0488         across all users. In contrast to precision at k, this metric takes into account the order of the recommendations
0489         (documents are assumed to be in order of decreasing relevance).
0490       </td>
0491     </tr>
0492   </tbody>
0493 </table>
0494
0495 **Examples**
0496
0497 The following code snippets illustrate how to load a sample dataset, train an alternating least squares recommendation
0498 model on the data, and evaluate the performance of the recommender by several ranking metrics. A brief summary of the
0499 methodology is provided below.
0500
0501 MovieLens ratings are on a scale of 1-5:
0502
0503  * 5: Must see
0504  * 4: Will enjoy
0505  * 3: It's okay
0506  * 2: Fairly bad
0507  * 1: Awful
0508
0509 So we should not recommend a movie if the predicted rating is less than 3.
0510 To map ratings to confidence scores, we use:
0511
0512  * 5 -> 2.5
0513  * 4 -> 1.5
0514  * 3 -> 0.5
0515  * 2 -> -0.5
0516  * 1 -> -1.5.
0517
0518 This mappings means unobserved entries are generally between It's okay and Fairly bad. The semantics of 0 in this
0519 expanded world of non-positive weights are "the same as never having interacted at all."
0520
0521 <div class="codetabs">
0522
0523 <div data-lang="scala" markdown="1">
0524 Refer to the [`RegressionMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
0525
0526 {% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %}
0527
0528 </div>
0529
0530 <div data-lang="java" markdown="1">
0531 Refer to the [`RegressionMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
0532
0533 {% include_example java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java %}
0534
0535 </div>
0536
0537 <div data-lang="python" markdown="1">
0538 Refer to the [`RegressionMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RankingMetrics) for more details on the API.
0539
0540 {% include_example python/mllib/ranking_metrics_example.py %}
0541
0542 </div>
0543 </div>
0544
0545 ## Regression model evaluation
0546
0547 [Regression analysis](https://en.wikipedia.org/wiki/Regression_analysis) is used when predicting a continuous output
0548 variable from a number of independent variables.
0549
0550 **Available metrics**
0551
0552 <table class="table">
0553   <thead>
0554     <tr><th>Metric</th><th>Definition</th></tr>
0555   </thead>
0556   <tbody>
0557     <tr>
0558       <td>Mean Squared Error (MSE)</td>
0559       <td>$MSE = \frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}$</td>
0560     </tr>
0561     <tr>
0562       <td>Root Mean Squared Error (RMSE)</td>
0563       <td>$RMSE = \sqrt{\frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}}$</td>
0564     </tr>
0565     <tr>
0566       <td>Mean Absolute Error (MAE)</td>
0567       <td>$MAE=\frac{1}{N}\sum_{i=0}^{N-1} \left|\mathbf{y}_i - \hat{\mathbf{y}}_i\right|$</td>
0568     </tr>
0569     <tr>
0570       <td>Coefficient of Determination $(R^2)$</td>
0571       <td>$R^2=1 - \frac{MSE}{\text{VAR}(\mathbf{y}) \cdot (N-1)}=1-\frac{\sum_{i=0}^{N-1}
0572         (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{\sum_{i=0}^{N-1}(\mathbf{y}_i-\bar{\mathbf{y}})^2}$</td>
0573     </tr>
0574     <tr>
0575       <td>Explained Variance</td>
0576       <td>$1 - \frac{\text{VAR}(\mathbf{y} - \mathbf{\hat{y}})}{\text{VAR}(\mathbf{y})}$</td>
0577     </tr>
0578   </tbody>
0579 </table>