0001 ---
0002 layout: global
0003 title: ML Pipelines
0004 displayTitle: ML Pipelines
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 `\[
0023 \newcommand{\R}{\mathbb{R}}
0024 \newcommand{\E}{\mathbb{E}}
0025 \newcommand{\x}{\mathbf{x}}
0026 \newcommand{\y}{\mathbf{y}}
0027 \newcommand{\wv}{\mathbf{w}}
0028 \newcommand{\av}{\mathbf{\alpha}}
0029 \newcommand{\bv}{\mathbf{b}}
0030 \newcommand{\N}{\mathbb{N}}
0031 \newcommand{\id}{\mathbf{I}}
0032 \newcommand{\ind}{\mathbf{1}}
0033 \newcommand{\0}{\mathbf{0}}
0034 \newcommand{\unit}{\mathbf{e}}
0035 \newcommand{\one}{\mathbf{1}}
0036 \newcommand{\zero}{\mathbf{0}}
0037 \]`
0038
0039 In this section, we introduce the concept of ***ML Pipelines***.
0040 ML Pipelines provide a uniform set of high-level APIs built on top of
0041 [DataFrames](sql-programming-guide.html) that help users create and tune practical
0042 machine learning pipelines.
0043
0044 **Table of Contents**
0045
0046 * This will become a table of contents (this text will be scraped).
0047 {:toc}
0048
0049 # Main concepts in Pipelines
0050
0051 MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple
0052 algorithms into a single pipeline, or workflow.
0053 This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
0054 mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
0055
0056 * **[`DataFrame`](ml-pipeline.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
0057 dataset, which can hold a variety of data types.
0058 E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
0059
0060 * **[`Transformer`](ml-pipeline.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
0061 E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
0062
0063 * **[`Estimator`](ml-pipeline.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
0064 E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
0065
0066 * **[`Pipeline`](ml-pipeline.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
0067
0068 * **[`Parameter`](ml-pipeline.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
0069
0070 ## DataFrame
0071
0072 Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
0073 This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
0074
0075 `DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-reference.html#data-types) for a list of supported types.
0076 In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
0077
0078 A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
0079
0080 Columns in a `DataFrame` are named. The code examples below use names such as "text", "features", and "label".
0081
0082 ## Pipeline components
0083
0084 ### Transformers
0085
0086 A `Transformer` is an abstraction that includes feature transformers and learned models.
0087 Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
0088 another, generally by appending one or more columns.
0089 For example:
0090
0091 * A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
0092 column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
0093 * A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
0094 label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
0095 column.
0096
0097 ### Estimators
0098
0099 An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
0100 data.
0101 Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
0102 `Model`, which is a `Transformer`.
0103 For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
0104 `fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
0105
0106 ### Properties of pipeline components
0107
0108 `Transformer.transform()`s and `Estimator.fit()`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
0109
0110 Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
0111
0112 ## Pipeline
0113
0114 In machine learning, it is common to run a sequence of algorithms to process and learn from data.
0115 E.g., a simple text document processing workflow might include several stages:
0116
0117 * Split each document's text into words.
0118 * Convert each document's words into a numerical feature vector.
0119 * Learn a prediction model using the feature vectors and labels.
0120
0121 MLlib represents such a workflow as a `Pipeline`, which consists of a sequence of
0122 `PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
0123 We will use this simple workflow as a running example in this section.
0124
0125 ### How it works
0126
0127 A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
0128 These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
0129 For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
0130 For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
0131
0132 We illustrate this for the simple text document workflow. The figure below is for the *training time* usage of a `Pipeline`.
0133
0134 <p style="text-align: center;">
0135 <img
0136 src="img/ml-Pipeline.png"
0137 title="ML Pipeline Example"
0138 alt="ML Pipeline Example"
0139 width="80%"
0140 />
0141 </p>
0142
0143 Above, the top row represents a `Pipeline` with three stages.
0144 The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
0145 The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
0146 The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
0147 The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
0148 The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
0149 Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
0150 If the `Pipeline` had more `Estimator`s, it would call the `LogisticRegressionModel`'s `transform()`
0151 method on the `DataFrame` before passing the `DataFrame` to the next stage.
0152
0153 A `Pipeline` is an `Estimator`.
0154 Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
0155 `Transformer`.
0156 This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
0157
0158 <p style="text-align: center;">
0159 <img
0160 src="img/ml-PipelineModel.png"
0161 title="ML PipelineModel Example"
0162 alt="ML PipelineModel Example"
0163 width="80%"
0164 />
0165 </p>
0166
0167 In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
0168 When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
0169 through the fitted pipeline in order.
0170 Each stage's `transform()` method updates the dataset and passes it to the next stage.
0171
0172 `Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
0173
0174 ### Details
0175
0176 *DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array. The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage. It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
0177
0178 *Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
0179 compile-time type checking.
0180 `Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
0181 This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
0182
0183 *Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
0184 `myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
0185 unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
0186 can be put into the same `Pipeline` since different instances will be created with different IDs.
0187
0188 ## Parameters
0189
0190 MLlib `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
0191
0192 A `Param` is a named parameter with self-contained documentation.
0193 A `ParamMap` is a set of (parameter, value) pairs.
0194
0195 There are two main ways to pass parameters to an algorithm:
0196
0197 1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could
0198 call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
0199 This API resembles the API used in `spark.mllib` package.
0200 2. Pass a `ParamMap` to `fit()` or `transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
0201
0202 Parameters belong to specific instances of `Estimator`s and `Transformer`s.
0203 For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
0204 This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
0205
0206 ## ML persistence: Saving and Loading Pipelines
0207
0208 Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API.
0209 As of Spark 2.3, the DataFrame-based API in `spark.ml` and `pyspark.ml` has complete coverage.
0210
0211 ML persistence works across Scala, Java and Python. However, R currently uses a modified format,
0212 so models saved in R can only be loaded back in R; this should be fixed in the future and is
0213 tracked in [SPARK-15572](https://issues.apache.org/jira/browse/SPARK-15572).
0214
0215 ### Backwards compatibility for ML persistence
0216
0217 In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML
0218 model or Pipeline in one version of Spark, then you should be able to load it back and use it in a
0219 future version of Spark. However, there are rare exceptions, described below.
0220
0221 Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark
0222 version X loadable by Spark version Y?
0223
0224 * Major versions: No guarantees, but best-effort.
0225 * Minor and patch versions: Yes; these are backwards compatible.
0226 * Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible.
0227
0228 Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y?
0229
0230 * Major versions: No guarantees, but best-effort.
0231 * Minor and patch versions: Identical behavior, except for bug fixes.
0232
0233 For both model persistence and model behavior, any breaking changes across a minor version or patch
0234 version are reported in the Spark version release notes. If a breakage is not reported in release
0235 notes, then it should be treated as a bug to be fixed.
0236
0237 # Code examples
0238
0239 This section gives code examples illustrating the functionality discussed above.
0240 For more info, please refer to the API documentation
0241 ([Scala](api/scala/org/apache/spark/ml/package.html),
0242 [Java](api/java/org/apache/spark/ml/package-summary.html),
0243 and [Python](api/python/pyspark.ml.html)).
0244
0245 ## Example: Estimator, Transformer, and Param
0246
0247 This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
0248
0249 <div class="codetabs">
0250
0251 <div data-lang="scala" markdown="1">
0252
0253 Refer to the [`Estimator` Scala docs](api/scala/org/apache/spark/ml/Estimator.html),
0254 the [`Transformer` Scala docs](api/scala/org/apache/spark/ml/Transformer.html) and
0255 the [`Params` Scala docs](api/scala/org/apache/spark/ml/param/Params.html) for details on the API.
0256
0257 {% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
0258 </div>
0259
0260 <div data-lang="java" markdown="1">
0261
0262 Refer to the [`Estimator` Java docs](api/java/org/apache/spark/ml/Estimator.html),
0263 the [`Transformer` Java docs](api/java/org/apache/spark/ml/Transformer.html) and
0264 the [`Params` Java docs](api/java/org/apache/spark/ml/param/Params.html) for details on the API.
0265
0266 {% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
0267 </div>
0268
0269 <div data-lang="python" markdown="1">
0270
0271 Refer to the [`Estimator` Python docs](api/python/pyspark.ml.html#pyspark.ml.Estimator),
0272 the [`Transformer` Python docs](api/python/pyspark.ml.html#pyspark.ml.Transformer) and
0273 the [`Params` Python docs](api/python/pyspark.ml.html#pyspark.ml.param.Params) for more details on the API.
0274
0275 {% include_example python/ml/estimator_transformer_param_example.py %}
0276 </div>
0277
0278 </div>
0279
0280 ## Example: Pipeline
0281
0282 This example follows the simple text document `Pipeline` illustrated in the figures above.
0283
0284 <div class="codetabs">
0285
0286 <div data-lang="scala" markdown="1">
0287
0288 Refer to the [`Pipeline` Scala docs](api/scala/org/apache/spark/ml/Pipeline.html) for details on the API.
0289
0290 {% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
0291 </div>
0292
0293 <div data-lang="java" markdown="1">
0294
0295
0296 Refer to the [`Pipeline` Java docs](api/java/org/apache/spark/ml/Pipeline.html) for details on the API.
0297
0298 {% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
0299 </div>
0300
0301 <div data-lang="python" markdown="1">
0302
0303 Refer to the [`Pipeline` Python docs](api/python/pyspark.ml.html#pyspark.ml.Pipeline) for more details on the API.
0304
0305 {% include_example python/ml/pipeline_example.py %}
0306 </div>
0307
0308 </div>
0309
0310 ## Model selection (hyperparameter tuning)
0311
0312 A big benefit of using ML Pipelines is hyperparameter optimization. See the [ML Tuning Guide](ml-tuning.html) for more information on automatic model selection.