the-tree/docs/mllib-isotonic-regression.md

0001 ---
0002 layout: global
0003 title: Isotonic regression - RDD-based API
0004 displayTitle: Regression - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 ## Isotonic regression
0023 [Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
0024 belongs to the family of regression algorithms. Formally isotonic regression is a problem where
0025 given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses
0026 and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
0027 finding a function that minimizes
0028
0029 `\begin{equation}
0030   f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
0031 \end{equation}`
0032
0033 with respect to complete order subject to
0034 `$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
0035 The resulting function is called isotonic regression and it is unique.
0036 It can be viewed as least squares problem under order restriction.
0037 Essentially isotonic regression is a
0038 [monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
0039 best fitting the original data points.
0040
0041 `spark.mllib` supports a
0042 [pool adjacent violators algorithm](https://doi.org/10.1198/TECH.2010.10111)
0043 which uses an approach to
0044 [parallelizing isotonic regression](https://doi.org/10.1007/978-3-642-99789-1_10).
0045 The training input is an RDD of tuples of three double values that represent
0046 label, feature and weight in this order. Additionally, IsotonicRegression algorithm has one
0047 optional parameter called $isotonic$ defaulting to true.
0048 This argument specifies if the isotonic regression is
0049 isotonic (monotonically increasing) or antitonic (monotonically decreasing).
0050
0051 Training returns an IsotonicRegressionModel that can be used to predict
0052 labels for both known and unknown features. The result of isotonic regression
0053 is treated as piecewise linear function. The rules for prediction therefore are:
0054
0055 * If the prediction input exactly matches a training feature
0056   then associated prediction is returned. In case there are multiple predictions with the same
0057   feature then one of them is returned. Which one is undefined
0058   (same as java.util.Arrays.binarySearch).
0059 * If the prediction input is lower or higher than all training features
0060   then prediction with lowest or highest feature is returned respectively.
0061   In case there are multiple predictions with the same feature
0062   then the lowest or highest is returned respectively.
0063 * If the prediction input falls between two training features then prediction is treated
0064   as piecewise linear function and interpolated value is calculated from the
0065   predictions of the two closest features. In case there are multiple values
0066   with the same feature then the same rules as in previous point are used.
0067
0068 ### Examples
0069
0070 <div class="codetabs">
0071 <div data-lang="scala" markdown="1">
0072 Data are read from a file where each line has a format label,feature
0073 i.e. 4710.28,500.00. The data are split to training and testing set.
0074 Model is created using the training set and a mean squared error is calculated from the predicted
0075 labels and real labels in the test set.
0076
0077 Refer to the [`IsotonicRegression` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API.
0078
0079 {% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %}
0080 </div>
0081 <div data-lang="java" markdown="1">
0082 Data are read from a file where each line has a format label,feature
0083 i.e. 4710.28,500.00. The data are split to training and testing set.
0084 Model is created using the training set and a mean squared error is calculated from the predicted
0085 labels and real labels in the test set.
0086
0087 Refer to the [`IsotonicRegression` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API.
0088
0089 {% include_example java/org/apache/spark/examples/mllib/JavaIsotonicRegressionExample.java %}
0090 </div>
0091 <div data-lang="python" markdown="1">
0092 Data are read from a file where each line has a format label,feature
0093 i.e. 4710.28,500.00. The data are split to training and testing set.
0094 Model is created using the training set and a mean squared error is calculated from the predicted
0095 labels and real labels in the test set.
0096
0097 Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegressionModel) for more details on the API.
0098
0099 {% include_example python/mllib/isotonic_regression_example.py %}
0100 </div>
0101 </div>