the-tree/docs/ml-guide.md

0001 ---
0002 layout: global
0003 title: "MLlib: Main Guide"
0004 displayTitle: "Machine Learning Library (MLlib) Guide"
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 MLlib is Spark's machine learning (ML) library.
0023 Its goal is to make practical machine learning scalable and easy.
0024 At a high level, it provides tools such as:
0025
0026 * ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
0027 * Featurization: feature extraction, transformation, dimensionality reduction, and selection
0028 * Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
0029 * Persistence: saving and load algorithms, models, and Pipelines
0030 * Utilities: linear algebra, statistics, data handling, etc.
0031
0032 # Announcement: DataFrame-based API is primary API
0033
0034 **The MLlib RDD-based API is now in maintenance mode.**
0035
0036 As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
0037 The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
0038
0039 *What are the implications?*
0040
0041 * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
0042 * MLlib will not add new features to the RDD-based API.
0043 * In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
0044
0045 *Why is MLlib switching to the DataFrame-based API?*
0046
0047 * DataFrames provide a more user-friendly API than RDDs.  The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
0048 * The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
0049 * DataFrames facilitate practical ML Pipelines, particularly feature transformations.  See the [Pipelines guide](ml-pipeline.html) for details.
0050
0051 *What is "Spark ML"?*
0052
0053 * "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
0054   This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API,
0055   and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.
0056
0057 *Is MLlib deprecated?*
0058
0059 * No. MLlib includes both the RDD-based API and the DataFrame-based API.
0060   The RDD-based API is now in maintenance mode.
0061   But neither API is deprecated, nor MLlib as a whole.
0062
0063 # Dependencies
0064
0065 MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
0066 [netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
0067 If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
0068 implementation will be used instead.
0069
0070 Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
0071 proxies by default.
0072 To configure `netlib-java` / Breeze to use system optimised binaries, include
0073 `com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
0074 project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
0075 platform's additional installation instructions.
0076
0077 The most popular native BLAS such as [Intel MKL](https://software.intel.com/en-us/mkl), [OpenBLAS](http://www.openblas.net), can use multiple threads in a single operation, which can conflict with Spark's execution model.
0078
0079 Configuring these BLAS implementations to use a single thread for operations may actually improve performance (see [SPARK-21305](https://issues.apache.org/jira/browse/SPARK-21305)). It is usually optimal to match this to the number of cores each Spark task is configured to use, which is 1 by default and typically left at 1.
0080
0081 Please refer to resources like the following to understand how to configure the number of threads these BLAS implementations use: [Intel MKL](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications) or [Intel oneMKL](https://software.intel.com/en-us/onemkl-linux-developer-guide-improving-performance-with-threading) and [OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded). Note that if nativeBLAS is not properly configured in system, java implementation(f2jBLAS) will be used as fallback option.
0082
0083 To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
0084
0085 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to
0086     watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
0087
0088 # Highlights in 3.0
0089
0090 The list below highlights some of the new features and enhancements added to MLlib in the `3.0`
0091 release of Spark:
0092
0093 * Multiple columns support was added to `Binarizer` ([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)), `StringIndexer` ([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)), `StopWordsRemover` ([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)) and PySpark `QuantileDiscretizer` ([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)).
0094 * Tree-Based Feature Transformation was added
0095 ([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)).
0096 * Two new evaluators `MultilabelClassificationEvaluator` ([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)) and `RankingEvaluator` ([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)) were added.
0097 * Sample weights support was added in `DecisionTreeClassifier/Regressor` ([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)), `RandomForestClassifier/Regressor` ([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)), `GBTClassifier/Regressor` ([SPARK-9612](https://issues.apache.org/jira/browse/SPARK-9612)),  `MulticlassClassificationEvaluator` ([SPARK-24101](https://issues.apache.org/jira/browse/SPARK-24101)), `RegressionEvaluator` ([SPARK-24102](https://issues.apache.org/jira/browse/SPARK-24102)), `BinaryClassificationEvaluator` ([SPARK-24103](https://issues.apache.org/jira/browse/SPARK-24103)), `BisectingKMeans` ([SPARK-30351](https://issues.apache.org/jira/browse/SPARK-30351)), `KMeans` ([SPARK-29967](https://issues.apache.org/jira/browse/SPARK-29967)) and `GaussianMixture` ([SPARK-30102](https://issues.apache.org/jira/browse/SPARK-30102)).
0098 * R API for `PowerIterationClustering` was added
0099 ([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)).
0100 * Added Spark ML listener for tracking ML pipeline status
0101 ([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)).
0102 * Fit with validation set was added to Gradient Boosted Trees in Python
0103 ([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)).
0104 * [`RobustScaler`](ml-features.html#robustscaler) transformer was added
0105 ([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)).
0106 * [`Factorization Machines`](ml-classification-regression.html#factorization-machines) classifier and regressor were added
0107 ([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)).
0108 * Gaussian Naive Bayes Classifier ([SPARK-16872](https://issues.apache.org/jira/browse/SPARK-16872)) and Complement Naive Bayes Classifier ([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)) were added.
0109 * ML function parity between Scala and Python
0110 ([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)).
0111 * `predictRaw` is made public in all the Classification models. `predictProbability` is made public in all the Classification models except `LinearSVCModel`
0112 ([SPARK-30358](https://issues.apache.org/jira/browse/SPARK-30358)).
0113
0114 # Migration Guide
0115
0116 The migration guide is now archived [on this page](ml-migration-guide.html).
0117