Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Evaluation Metrics - RDD-based API
0004 displayTitle: Evaluation Metrics - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 `spark.mllib` comes with a number of machine learning algorithms that can be used to learn from and make predictions
0026 on data. When these algorithms are applied to build machine learning models, there is a need to evaluate the performance
0027 of the model on some criteria, which depends on the application and its requirements. `spark.mllib` also provides a
0028 suite of metrics for the purpose of evaluating the performance of machine learning models.
0029 
0030 Specific machine learning algorithms fall under broader types of machine learning applications like classification,
0031 regression, clustering, etc. Each of these types have well-established metrics for performance evaluation and those
0032 metrics that are currently available in `spark.mllib` are detailed in this section.
0033 
0034 ## Classification model evaluation
0035 
0036 While there are many different types of classification algorithms, the evaluation of classification models all share
0037 similar principles. In a [supervised classification problem](https://en.wikipedia.org/wiki/Statistical_classification),
0038 there exists a true output and a model-generated predicted output for each data point. For this reason, the results for
0039 each data point can be assigned to one of four categories:
0040 
0041 * True Positive (TP) - label is positive and prediction is also positive
0042 * True Negative (TN) - label is negative and prediction is also negative
0043 * False Positive (FP) - label is negative but prediction is positive
0044 * False Negative (FN) - label is positive but prediction is negative
0045 
0046 These four numbers are the building blocks for most classifier evaluation metrics. A fundamental point when considering
0047 classifier evaluation is that pure accuracy (i.e. was the prediction correct or incorrect) is not generally a good metric. The
0048 reason for this is because a dataset may be highly unbalanced. For example, if a model is designed to predict fraud from
0049 a dataset where 95% of the data points are _not fraud_ and 5% of the data points are _fraud_, then a naive classifier
0050 that predicts _not fraud_, regardless of input, will be 95% accurate. For this reason, metrics like
0051 [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) are typically used because they take into
0052 account the *type* of error. In most applications there is some desired balance between precision and recall, which can
0053 be captured by combining the two into a single metric, called the [F-measure](https://en.wikipedia.org/wiki/F1_score).
0054 
0055 ### Binary classification
0056 
0057 [Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification) are used to separate the elements of a given
0058 dataset into one of two possible groups (e.g. fraud or not fraud) and is a special case of multiclass classification.
0059 Most binary classification metrics can be generalized to multiclass classification metrics.
0060 
0061 #### Threshold tuning
0062 
0063 It is import to understand that many classification models actually output a "score" (often times a probability) for
0064 each class, where a higher score indicates higher likelihood. In the binary case, the model may output a probability for
0065 each class: $P(Y=1|X)$ and $P(Y=0|X)$. Instead of simply taking the higher probability, there may be some cases where
0066 the model might need to be tuned so that it only predicts a class when the probability is very high (e.g. only block a
0067 credit card transaction if the model predicts fraud with >90% probability). Therefore, there is a prediction *threshold*
0068 which determines what the predicted class will be based on the probabilities that the model outputs.
0069 
0070 Tuning the prediction threshold will change the precision and recall of the model and is an important part of model
0071 optimization. In order to visualize how precision, recall, and other metrics change as a function of the threshold it is
0072 common practice to plot competing metrics against one another, parameterized by threshold. A P-R curve plots (precision,
0073 recall) points for different threshold values, while a
0074 [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), or ROC, curve
0075 plots (recall, false positive rate) points.
0076 
0077 **Available metrics**
0078 
0079 <table class="table">
0080   <thead>
0081     <tr><th>Metric</th><th>Definition</th></tr>
0082   </thead>
0083   <tbody>
0084     <tr>
0085       <td>Precision (Positive Predictive Value)</td>
0086       <td>$PPV=\frac{TP}{TP + FP}$</td>
0087     </tr>
0088     <tr>
0089       <td>Recall (True Positive Rate)</td>
0090       <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
0091     </tr>
0092     <tr>
0093       <td>F-measure</td>
0094       <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR}
0095           {\beta^2 \cdot PPV + TPR}\right)$</td>
0096     </tr>
0097     <tr>
0098       <td>Receiver Operating Characteristic (ROC)</td>
0099       <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} P_1(T)\,dT$</td>
0100     </tr>
0101     <tr>
0102       <td>Area Under ROC Curve</td>
0103       <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
0104     </tr>
0105     <tr>
0106       <td>Area Under Precision-Recall Curve</td>
0107       <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)$</td>
0108     </tr>
0109   </tbody>
0110 </table>
0111 
0112 
0113 **Examples**
0114 
0115 <div class="codetabs">
0116 The following code snippets illustrate how to load a sample dataset, train a binary classification algorithm on the
0117 data, and evaluate the performance of the algorithm by several binary evaluation metrics.
0118 
0119 <div data-lang="scala" markdown="1">
0120 Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`BinaryClassificationMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html) for details on the API.
0121 
0122 {% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %}
0123 
0124 </div>
0125 
0126 <div data-lang="java" markdown="1">
0127 Refer to the [`LogisticRegressionModel` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionModel.html) and [`LogisticRegressionWithLBFGS` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) for details on the API.
0128 
0129 {% include_example java/org/apache/spark/examples/mllib/JavaBinaryClassificationMetricsExample.java %}
0130 
0131 </div>
0132 
0133 <div data-lang="python" markdown="1">
0134 Refer to the [`BinaryClassificationMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.BinaryClassificationMetrics) and [`LogisticRegressionWithLBFGS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) for more details on the API.
0135 
0136 {% include_example python/mllib/binary_classification_metrics_example.py %}
0137 </div>
0138 </div>
0139 
0140 
0141 ### Multiclass classification
0142 
0143 A [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification) describes a classification
0144 problem where there are $M \gt 2$ possible labels for each data point (the case where $M=2$ is the binary
0145 classification problem). For example, classifying handwriting samples to the digits 0 to 9, having 10 possible classes.
0146 
0147 For multiclass metrics, the notion of positives and negatives is slightly different. Predictions and labels can still
0148 be positive or negative, but they must be considered under the context of a particular class. Each label and prediction
0149 take on the value of one of the multiple classes and so they are said to be positive for their particular class and negative
0150 for all other classes. So, a true positive occurs whenever the prediction and the label match, while a true negative
0151 occurs when neither the prediction nor the label take on the value of a given class. By this convention, there can be
0152 multiple true negatives for a given data sample. The extension of false negatives and false positives from the former
0153 definitions of positive and negative labels is straightforward.
0154 
0155 #### Label based metrics
0156 
0157 Opposed to binary classification where there are only two possible labels, multiclass classification problems have many
0158 possible labels and so the concept of label-based metrics is introduced. Accuracy measures precision across all
0159 labels -  the number of times any class was predicted correctly (true positives) normalized by the number of data
0160 points. Precision by label considers only one class, and measures the number of time a specific label was predicted
0161 correctly normalized by the number of times that label appears in the output.
0162 
0163 **Available metrics**
0164 
0165 Define the class, or label, set as
0166 
0167 $$L = \{\ell_0, \ell_1, \ldots, \ell_{M-1} \} $$
0168 
0169 The true output vector $\mathbf{y}$ consists of $N$ elements
0170 
0171 $$\mathbf{y}_0, \mathbf{y}_1, \ldots, \mathbf{y}_{N-1} \in L $$
0172 
0173 A multiclass prediction algorithm generates a prediction vector $\hat{\mathbf{y}}$ of $N$ elements
0174 
0175 $$\hat{\mathbf{y}}_0, \hat{\mathbf{y}}_1, \ldots, \hat{\mathbf{y}}_{N-1} \in L $$
0176 
0177 For this section, a modified delta function $\hat{\delta}(x)$ will prove useful
0178 
0179 $$\hat{\delta}(x) = \begin{cases}1 & \text{if $x = 0$}, \\ 0 & \text{otherwise}.\end{cases}$$
0180 
0181 <table class="table">
0182   <thead>
0183     <tr><th>Metric</th><th>Definition</th></tr>
0184   </thead>
0185   <tbody>
0186     <tr>
0187       <td>Confusion Matrix</td>
0188       <td>
0189         $C_{ij} = \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_i) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_j)\\ \\
0190          \left( \begin{array}{ccc}
0191          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots &
0192          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N) \\
0193          \vdots & \ddots & \vdots \\
0194          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots &
0195          \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N)
0196          \end{array} \right)$
0197       </td>
0198     </tr>
0199     <tr>
0200       <td>Accuracy</td>
0201       <td>$ACC = \frac{TP}{TP + FP} = \frac{1}{N}\sum_{i=0}^{N-1} \hat{\delta}\left(\hat{\mathbf{y}}_i -
0202         \mathbf{y}_i\right)$</td>
0203     </tr>
0204     <tr>
0205       <td>Precision by label</td>
0206       <td>$PPV(\ell) = \frac{TP}{TP + FP} =
0207           \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)}
0208           {\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell)}$</td>
0209     </tr>
0210     <tr>
0211       <td>Recall by label</td>
0212       <td>$TPR(\ell)=\frac{TP}{P} =
0213           \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)}
0214           {\sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i - \ell)}$</td>
0215     </tr>
0216     <tr>
0217       <td>F-measure by label</td>
0218       <td>$F(\beta, \ell) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)}
0219           {\beta^2 \cdot PPV(\ell) + TPR(\ell)}\right)$</td>
0220     </tr>
0221     <tr>
0222       <td>Weighted precision</td>
0223       <td>$PPV_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} PPV(\ell)
0224           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0225     </tr>
0226     <tr>
0227       <td>Weighted recall</td>
0228       <td>$TPR_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} TPR(\ell)
0229           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0230     </tr>
0231     <tr>
0232       <td>Weighted F-measure</td>
0233       <td>$F_{w}(\beta)= \frac{1}{N} \sum\nolimits_{\ell \in L} F(\beta, \ell)
0234           \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
0235     </tr>
0236   </tbody>
0237 </table>
0238 
0239 **Examples**
0240 
0241 <div class="codetabs">
0242 The following code snippets illustrate how to load a sample dataset, train a multiclass classification algorithm on
0243 the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics.
0244 
0245 <div data-lang="scala" markdown="1">
0246 Refer to the [`MulticlassMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
0247 
0248 {% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %}
0249 
0250 </div>
0251 
0252 <div data-lang="java" markdown="1">
0253 Refer to the [`MulticlassMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
0254 
0255  {% include_example java/org/apache/spark/examples/mllib/JavaMulticlassClassificationMetricsExample.java %}
0256 
0257 </div>
0258 
0259 <div data-lang="python" markdown="1">
0260 Refer to the [`MulticlassMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MulticlassMetrics) for more details on the API.
0261 
0262 {% include_example python/mllib/multi_class_metrics_example.py %}
0263 
0264 </div>
0265 </div>
0266 
0267 ### Multilabel classification
0268 
0269 A [multilabel classification](https://en.wikipedia.org/wiki/Multi-label_classification) problem involves mapping
0270 each sample in a dataset to a set of class labels. In this type of classification problem, the labels are not
0271 mutually exclusive. For example, when classifying a set of news articles into topics, a single article might be both
0272 science and politics.
0273 
0274 Because the labels are not mutually exclusive, the predictions and true labels are now vectors of label *sets*, rather
0275 than vectors of labels. Multilabel metrics, therefore, extend the fundamental ideas of precision, recall, etc. to
0276 operations on sets. For example, a true positive for a given class now occurs when that class exists in the predicted
0277 set and it exists in the true label set, for a specific data point.
0278 
0279 **Available metrics**
0280 
0281 Here we define a set $D$ of $N$ documents
0282 
0283 $$D = \left\{d_0, d_1, ..., d_{N-1}\right\}$$
0284 
0285 Define $L_0, L_1, ..., L_{N-1}$ to be a family of label sets and $P_0, P_1, ..., P_{N-1}$
0286 to be a family of prediction sets where $L_i$ and $P_i$ are the label set and prediction set, respectively, that
0287 correspond to document $d_i$.
0288 
0289 The set of all unique labels is given by
0290 
0291 $$L = \bigcup_{k=0}^{N-1} L_k$$
0292 
0293 The following definition of indicator function $I_A(x)$ on a set $A$ will be necessary
0294 
0295 $$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{cases}$$
0296 
0297 <table class="table">
0298   <thead>
0299     <tr><th>Metric</th><th>Definition</th></tr>
0300   </thead>
0301   <tbody>
0302     <tr>
0303       <td>Precision</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|P_i \cap L_i\right|}{\left|P_i\right|}$</td>
0304     </tr>
0305     <tr>
0306       <td>Recall</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|L_i \cap P_i\right|}{\left|L_i\right|}$</td>
0307     </tr>
0308     <tr>
0309       <td>Accuracy</td>
0310       <td>
0311         $\frac{1}{N} \sum_{i=0}^{N - 1} \frac{\left|L_i \cap P_i \right|}
0312         {\left|L_i\right| + \left|P_i\right| - \left|L_i \cap P_i \right|}$
0313       </td>
0314     </tr>
0315     <tr>
0316       <td>Precision by label</td><td>$PPV(\ell)=\frac{TP}{TP + FP}=
0317           \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
0318           {\sum_{i=0}^{N-1} I_{P_i}(\ell)}$</td>
0319     </tr>
0320     <tr>
0321       <td>Recall by label</td><td>$TPR(\ell)=\frac{TP}{P}=
0322           \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
0323           {\sum_{i=0}^{N-1} I_{L_i}(\ell)}$</td>
0324     </tr>
0325     <tr>
0326       <td>F1-measure by label</td><td>$F1(\ell) = 2
0327                             \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)}
0328                             {PPV(\ell) + TPR(\ell)}\right)$</td>
0329     </tr>
0330     <tr>
0331       <td>Hamming Loss</td>
0332       <td>
0333         $\frac{1}{N \cdot \left|L\right|} \sum_{i=0}^{N - 1} \left|L_i\right| + \left|P_i\right| - 2\left|L_i
0334           \cap P_i\right|$
0335       </td>
0336     </tr>
0337     <tr>
0338       <td>Subset Accuracy</td>
0339       <td>$\frac{1}{N} \sum_{i=0}^{N-1} I_{\{L_i\}}(P_i)$</td>
0340     </tr>
0341     <tr>
0342       <td>F1 Measure</td>
0343       <td>$\frac{1}{N} \sum_{i=0}^{N-1} 2 \frac{\left|P_i \cap L_i\right|}{\left|P_i\right| \cdot \left|L_i\right|}$</td>
0344     </tr>
0345     <tr>
0346       <td>Micro precision</td>
0347       <td>$\frac{TP}{TP + FP}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
0348           {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|P_i - L_i\right|}$</td>
0349     </tr>
0350     <tr>
0351       <td>Micro recall</td>
0352       <td>$\frac{TP}{TP + FN}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
0353         {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right|}$</td>
0354     </tr>
0355     <tr>
0356       <td>Micro F1 Measure</td>
0357       <td>
0358         $2 \cdot \frac{TP}{2 \cdot TP + FP + FN}=2 \cdot \frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}{2 \cdot
0359         \sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right| + \sum_{i=0}^{N-1}
0360         \left|P_i - L_i\right|}$
0361       </td>
0362     </tr>
0363   </tbody>
0364 </table>
0365 
0366 **Examples**
0367 
0368 The following code snippets illustrate how to evaluate the performance of a multilabel classifier. The examples
0369 use the fake prediction and label data for multilabel classification that is shown below.
0370 
0371 Document predictions:
0372 
0373 * doc 0 - predict 0, 1 - class 0, 2
0374 * doc 1 - predict 0, 2 - class 0, 1
0375 * doc 2 - predict none - class 0
0376 * doc 3 - predict 2 - class 2
0377 * doc 4 - predict 2, 0 - class 2, 0
0378 * doc 5 - predict 0, 1, 2 - class 0, 1
0379 * doc 6 - predict 1 - class 1, 2
0380 
0381 Predicted classes:
0382 
0383 * class 0 - doc 0, 1, 4, 5 (total 4)
0384 * class 1 - doc 0, 5, 6 (total 3)
0385 * class 2 - doc 1, 3, 4, 5 (total 4)
0386 
0387 True classes:
0388 
0389 * class 0 - doc 0, 1, 2, 4, 5 (total 5)
0390 * class 1 - doc 1, 5, 6 (total 3)
0391 * class 2 - doc 0, 3, 4, 6 (total 4)
0392 
0393 <div class="codetabs">
0394 
0395 <div data-lang="scala" markdown="1">
0396 Refer to the [`MultilabelMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
0397 
0398 {% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %}
0399 
0400 </div>
0401 
0402 <div data-lang="java" markdown="1">
0403 Refer to the [`MultilabelMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
0404 
0405 {% include_example java/org/apache/spark/examples/mllib/JavaMultiLabelClassificationMetricsExample.java %}
0406 
0407 </div>
0408 
0409 <div data-lang="python" markdown="1">
0410 Refer to the [`MultilabelMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MultilabelMetrics) for more details on the API.
0411 
0412 {% include_example python/mllib/multi_label_metrics_example.py %}
0413 
0414 </div>
0415 </div>
0416 
0417 ### Ranking systems
0418 
0419 The role of a ranking algorithm (often thought of as a [recommender system](https://en.wikipedia.org/wiki/Recommender_system))
0420 is to return to the user a set of relevant items or documents based on some training data. The definition of relevance
0421 may vary and is usually application specific. Ranking system metrics aim to quantify the effectiveness of these
0422 rankings or recommendations in various contexts. Some metrics compare a set of recommended documents to a ground truth
0423 set of relevant documents, while other metrics may incorporate numerical ratings explicitly.
0424 
0425 **Available metrics**
0426 
0427 A ranking system usually deals with a set of $M$ users
0428 
0429 $$U = \left\{u_0, u_1, ..., u_{M-1}\right\}$$
0430 
0431 Each user ($u_i$) having a set of $N_i$ ground truth relevant documents
0432 
0433 $$D_i = \left\{d_0, d_1, ..., d_{N_i-1}\right\}$$
0434 
0435 And a list of $Q_i$ recommended documents, in order of decreasing relevance
0436 
0437 $$R_i = \left[r_0, r_1, ..., r_{Q_i-1}\right]$$
0438 
0439 The goal of the ranking system is to produce the most relevant set of documents for each user. The relevance of the
0440 sets and the effectiveness of the algorithms can be measured using the metrics listed below.
0441 
0442 It is necessary to define a function which, provided a recommended document and a set of ground truth relevant
0443 documents, returns a relevance score for the recommended document.
0444 
0445 $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & \text{otherwise}.\end{cases}$$
0446 
0447 <table class="table">
0448   <thead>
0449     <tr><th>Metric</th><th>Definition</th><th>Notes</th></tr>
0450   </thead>
0451   <tbody>
0452     <tr>
0453       <td>
0454         Precision at k
0455       </td>
0456       <td>
0457         $p(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{k} \sum_{j=0}^{\text{min}(Q_i, k) - 1} rel_{D_i}(R_i(j))}$
0458       </td>
0459       <td>
0460         <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Precision_at_K">Precision at k</a> is a measure of
0461          how many of the first k recommended documents are in the set of true relevant documents averaged across all
0462          users. In this metric, the order of the recommendations is not taken into account.
0463       </td>
0464     </tr>
0465     <tr>
0466       <td>Mean Average Precision</td>
0467       <td>
0468         $MAP=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{N_i} \sum_{j=0}^{Q_i-1} \frac{rel_{D_i}(R_i(j))}{j + 1}}$
0469       </td>
0470       <td>
0471         <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision">MAP</a> is a measure of how
0472          many of the recommended documents are in the set of true relevant documents, where the
0473         order of the recommendations is taken into account (i.e. penalty for highly relevant documents is higher).
0474       </td>
0475     </tr>
0476     <tr>
0477       <td>Normalized Discounted Cumulative Gain</td>
0478       <td>
0479         $NDCG(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{IDCG(D_i, k)}\sum_{j=0}^{n-1}
0480           \frac{rel_{D_i}(R_i(j))}{\text{log}(j+2)}} \\
0481         \text{Where} \\
0482         \hspace{5 mm} n = \text{min}\left(\text{max}\left(Q_i, N_i\right),k\right) \\
0483         \hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 1} \frac{1}{\text{log}(j+2)}$
0484       </td>
0485       <td>
0486         <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG">NDCG at k</a> is a
0487         measure of how many of the first k recommended documents are in the set of true relevant documents averaged
0488         across all users. In contrast to precision at k, this metric takes into account the order of the recommendations
0489         (documents are assumed to be in order of decreasing relevance).
0490       </td>
0491     </tr>
0492   </tbody>
0493 </table>
0494 
0495 **Examples**
0496 
0497 The following code snippets illustrate how to load a sample dataset, train an alternating least squares recommendation
0498 model on the data, and evaluate the performance of the recommender by several ranking metrics. A brief summary of the
0499 methodology is provided below.
0500 
0501 MovieLens ratings are on a scale of 1-5:
0502 
0503  * 5: Must see
0504  * 4: Will enjoy
0505  * 3: It's okay
0506  * 2: Fairly bad
0507  * 1: Awful
0508 
0509 So we should not recommend a movie if the predicted rating is less than 3.
0510 To map ratings to confidence scores, we use:
0511 
0512  * 5 -> 2.5
0513  * 4 -> 1.5
0514  * 3 -> 0.5
0515  * 2 -> -0.5
0516  * 1 -> -1.5.
0517 
0518 This mappings means unobserved entries are generally between It's okay and Fairly bad. The semantics of 0 in this
0519 expanded world of non-positive weights are "the same as never having interacted at all."
0520 
0521 <div class="codetabs">
0522 
0523 <div data-lang="scala" markdown="1">
0524 Refer to the [`RegressionMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
0525 
0526 {% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %}
0527 
0528 </div>
0529 
0530 <div data-lang="java" markdown="1">
0531 Refer to the [`RegressionMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
0532 
0533 {% include_example java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java %}
0534 
0535 </div>
0536 
0537 <div data-lang="python" markdown="1">
0538 Refer to the [`RegressionMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RankingMetrics) for more details on the API.
0539 
0540 {% include_example python/mllib/ranking_metrics_example.py %}
0541 
0542 </div>
0543 </div>
0544 
0545 ## Regression model evaluation
0546 
0547 [Regression analysis](https://en.wikipedia.org/wiki/Regression_analysis) is used when predicting a continuous output
0548 variable from a number of independent variables.
0549 
0550 **Available metrics**
0551 
0552 <table class="table">
0553   <thead>
0554     <tr><th>Metric</th><th>Definition</th></tr>
0555   </thead>
0556   <tbody>
0557     <tr>
0558       <td>Mean Squared Error (MSE)</td>
0559       <td>$MSE = \frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}$</td>
0560     </tr>
0561     <tr>
0562       <td>Root Mean Squared Error (RMSE)</td>
0563       <td>$RMSE = \sqrt{\frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}}$</td>
0564     </tr>
0565     <tr>
0566       <td>Mean Absolute Error (MAE)</td>
0567       <td>$MAE=\frac{1}{N}\sum_{i=0}^{N-1} \left|\mathbf{y}_i - \hat{\mathbf{y}}_i\right|$</td>
0568     </tr>
0569     <tr>
0570       <td>Coefficient of Determination $(R^2)$</td>
0571       <td>$R^2=1 - \frac{MSE}{\text{VAR}(\mathbf{y}) \cdot (N-1)}=1-\frac{\sum_{i=0}^{N-1}
0572         (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{\sum_{i=0}^{N-1}(\mathbf{y}_i-\bar{\mathbf{y}})^2}$</td>
0573     </tr>
0574     <tr>
0575       <td>Explained Variance</td>
0576       <td>$1 - \frac{\text{VAR}(\mathbf{y} - \mathbf{\hat{y}})}{\text{VAR}(\mathbf{y})}$</td>
0577     </tr>
0578   </tbody>
0579 </table>