Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: "MLlib: Main Guide"
0004 displayTitle: "Machine Learning Library (MLlib) Guide"
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 MLlib is Spark's machine learning (ML) library.
0023 Its goal is to make practical machine learning scalable and easy.
0024 At a high level, it provides tools such as:
0025 
0026 * ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
0027 * Featurization: feature extraction, transformation, dimensionality reduction, and selection
0028 * Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
0029 * Persistence: saving and load algorithms, models, and Pipelines
0030 * Utilities: linear algebra, statistics, data handling, etc.
0031 
0032 # Announcement: DataFrame-based API is primary API
0033 
0034 **The MLlib RDD-based API is now in maintenance mode.**
0035 
0036 As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
0037 The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
0038 
0039 *What are the implications?*
0040 
0041 * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
0042 * MLlib will not add new features to the RDD-based API.
0043 * In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
0044 
0045 *Why is MLlib switching to the DataFrame-based API?*
0046 
0047 * DataFrames provide a more user-friendly API than RDDs.  The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
0048 * The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
0049 * DataFrames facilitate practical ML Pipelines, particularly feature transformations.  See the [Pipelines guide](ml-pipeline.html) for details.
0050 
0051 *What is "Spark ML"?*
0052 
0053 * "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
0054   This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API, 
0055   and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.
0056   
0057 *Is MLlib deprecated?*
0058 
0059 * No. MLlib includes both the RDD-based API and the DataFrame-based API.
0060   The RDD-based API is now in maintenance mode.
0061   But neither API is deprecated, nor MLlib as a whole.
0062 
0063 # Dependencies
0064 
0065 MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
0066 [netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
0067 If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
0068 implementation will be used instead.
0069 
0070 Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
0071 proxies by default.
0072 To configure `netlib-java` / Breeze to use system optimised binaries, include
0073 `com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
0074 project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
0075 platform's additional installation instructions.
0076 
0077 The most popular native BLAS such as [Intel MKL](https://software.intel.com/en-us/mkl), [OpenBLAS](http://www.openblas.net), can use multiple threads in a single operation, which can conflict with Spark's execution model.
0078 
0079 Configuring these BLAS implementations to use a single thread for operations may actually improve performance (see [SPARK-21305](https://issues.apache.org/jira/browse/SPARK-21305)). It is usually optimal to match this to the number of cores each Spark task is configured to use, which is 1 by default and typically left at 1.
0080 
0081 Please refer to resources like the following to understand how to configure the number of threads these BLAS implementations use: [Intel MKL](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications) or [Intel oneMKL](https://software.intel.com/en-us/onemkl-linux-developer-guide-improving-performance-with-threading) and [OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded). Note that if nativeBLAS is not properly configured in system, java implementation(f2jBLAS) will be used as fallback option.
0082 
0083 To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
0084 
0085 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to
0086     watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
0087 
0088 # Highlights in 3.0
0089 
0090 The list below highlights some of the new features and enhancements added to MLlib in the `3.0`
0091 release of Spark:
0092 
0093 * Multiple columns support was added to `Binarizer` ([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)), `StringIndexer` ([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)), `StopWordsRemover` ([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)) and PySpark `QuantileDiscretizer` ([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)).
0094 * Tree-Based Feature Transformation was added
0095 ([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)).
0096 * Two new evaluators `MultilabelClassificationEvaluator` ([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)) and `RankingEvaluator` ([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)) were added.
0097 * Sample weights support was added in `DecisionTreeClassifier/Regressor` ([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)), `RandomForestClassifier/Regressor` ([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)), `GBTClassifier/Regressor` ([SPARK-9612](https://issues.apache.org/jira/browse/SPARK-9612)),  `MulticlassClassificationEvaluator` ([SPARK-24101](https://issues.apache.org/jira/browse/SPARK-24101)), `RegressionEvaluator` ([SPARK-24102](https://issues.apache.org/jira/browse/SPARK-24102)), `BinaryClassificationEvaluator` ([SPARK-24103](https://issues.apache.org/jira/browse/SPARK-24103)), `BisectingKMeans` ([SPARK-30351](https://issues.apache.org/jira/browse/SPARK-30351)), `KMeans` ([SPARK-29967](https://issues.apache.org/jira/browse/SPARK-29967)) and `GaussianMixture` ([SPARK-30102](https://issues.apache.org/jira/browse/SPARK-30102)).
0098 * R API for `PowerIterationClustering` was added
0099 ([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)).
0100 * Added Spark ML listener for tracking ML pipeline status
0101 ([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)).
0102 * Fit with validation set was added to Gradient Boosted Trees in Python
0103 ([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)).
0104 * [`RobustScaler`](ml-features.html#robustscaler) transformer was added
0105 ([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)).
0106 * [`Factorization Machines`](ml-classification-regression.html#factorization-machines) classifier and regressor were added
0107 ([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)).
0108 * Gaussian Naive Bayes Classifier ([SPARK-16872](https://issues.apache.org/jira/browse/SPARK-16872)) and Complement Naive Bayes Classifier ([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)) were added.
0109 * ML function parity between Scala and Python
0110 ([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)).
0111 * `predictRaw` is made public in all the Classification models. `predictProbability` is made public in all the Classification models except `LinearSVCModel`
0112 ([SPARK-30358](https://issues.apache.org/jira/browse/SPARK-30358)).
0113 
0114 # Migration Guide
0115 
0116 The migration guide is now archived [on this page](ml-migration-guide.html).
0117