the-tree/docs/mllib-naive-bayes.md

0001 ---
0002 layout: global
0003 title: Naive Bayes - RDD-based API
0004 displayTitle: Naive Bayes - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
0023 multiclass classification algorithm with the assumption of independence between
0024 every pair of features. Naive Bayes can be trained very efficiently. Within a
0025 single pass to the training data, it computes the conditional probability
0026 distribution of each feature given label, and then it applies Bayes' theorem to
0027 compute the conditional probability distribution of label given an observation
0028 and use it for prediction.
0029
0030 `spark.mllib` supports [multinomial naive
0031 Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
0032 and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
0033 These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
0034 Within that context, each observation is a document and each
0035 feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
0036 a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
0037 Feature values must be nonnegative. The model type is selected with an optional parameter
0038 "multinomial" or "bernoulli" with "multinomial" as the default.
0039 [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
0040 setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
0041 vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
0042 sparsity. Since the training data is only used once, it is not necessary to cache it.
0043
0044 ## Examples
0045
0046 <div class="codetabs">
0047 <div data-lang="scala" markdown="1">
0048
0049 [NaiveBayes](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) implements
0050 multinomial naive Bayes. It takes an RDD of
0051 [LabeledPoint](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) and an optional
0052 smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
0053 [NaiveBayesModel](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
0054 can be used for evaluation and prediction.
0055
0056 Refer to the [`NaiveBayes` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) and [`NaiveBayesModel` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.
0057
0058 {% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
0059 </div>
0060 <div data-lang="java" markdown="1">
0061
0062 [NaiveBayes](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) implements
0063 multinomial naive Bayes. It takes a Scala RDD of
0064 [LabeledPoint](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) and an
0065 optionally smoothing parameter `lambda` as input, and output a
0066 [NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
0067 can be used for evaluation and prediction.
0068
0069 Refer to the [`NaiveBayes` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) and [`NaiveBayesModel` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.
0070
0071 {% include_example java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java %}
0072 </div>
0073 <div data-lang="python" markdown="1">
0074
0075 [NaiveBayes](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes) implements multinomial
0076 naive Bayes. It takes an RDD of
0077 [LabeledPoint](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) and an optionally
0078 smoothing parameter `lambda` as input, and output a
0079 [NaiveBayesModel](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel), which can be
0080 used for evaluation and prediction.
0081
0082 Note that the Python API does not yet support model save/load but will in the future.
0083
0084 Refer to the [`NaiveBayes` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes) and [`NaiveBayesModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel) for more details on the API.
0085
0086 {% include_example python/mllib/naive_bayes_example.py %}
0087 </div>
0088 </div>