0001 ---
0002 layout: global
0003 title: Extracting, transforming and selecting features
0004 displayTitle: Extracting, transforming and selecting features
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 This section covers algorithms for working with features, roughly divided into these groups:
0023
0024 * Extraction: Extracting features from "raw" data
0025 * Transformation: Scaling, converting, or modifying features
0026 * Selection: Selecting a subset from a larger set of features
0027 * Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
0028
0029 **Table of Contents**
0030
0031 * This will become a table of contents (this text will be scraped).
0032 {:toc}
0033
0034
0035 # Feature Extractors
0036
0037 ## TF-IDF
0038
0039 [Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)
0040 is a feature vectorization method widely used in text mining to reflect the importance of a term
0041 to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
0042 Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
0043 document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
0044 term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
0045 often but carry little information about the document, e.g. "a", "the", and "of". If a term appears
0046 very often across the corpus, it means it doesn't carry special information about a particular document.
0047 Inverse document frequency is a numerical measure of how much information a term provides:
0048 `\[
0049 IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
0050 \]`
0051 where `$|D|$` is the total number of documents in the corpus. Since logarithm is used, if a term
0052 appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid
0053 dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
0054 `\[
0055 TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
0056 \]`
0057 There are several variants on the definition of term frequency and document frequency.
0058 In MLlib, we separate TF and IDF to make them flexible.
0059
0060 **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors.
0061
0062 `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
0063 fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
0064 `HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
0065 A raw feature is mapped into an index (term) by applying a hash function. The hash function
0066 used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash). Then term frequencies
0067 are calculated based on the mapped indices. This approach avoids the need to compute a global
0068 term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
0069 collisions, where different raw features may become the same term after hashing. To reduce the
0070 chance of collision, we can increase the target feature dimension, i.e. the number of buckets
0071 of the hash table. Since a simple modulo on the hashed value is used to determine the vector index,
0072 it is advisable to use a power of two as the feature dimension, otherwise the features will not
0073 be mapped evenly to the vector indices. The default feature dimension is `$2^{18} = 262,144$`.
0074 An optional binary toggle parameter controls term frequency counts. When set to true all nonzero
0075 frequency counts are set to 1. This is especially useful for discrete probabilistic models that
0076 model binary, rather than integer, counts.
0077
0078 `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
0079 ](ml-features.html#countvectorizer) for more details.
0080
0081 **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The
0082 `IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and
0083 scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
0084
0085 **Note:** `spark.ml` doesn't provide tools for text segmentation.
0086 We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
0087 [scalanlp/chalk](https://github.com/scalanlp/chalk).
0088
0089 **Examples**
0090
0091 In the following code segment, we start with a set of sentences. We split each sentence into words
0092 using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into
0093 a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance
0094 when using text as features. Our feature vectors could then be passed to a learning algorithm.
0095
0096 <div class="codetabs">
0097 <div data-lang="scala" markdown="1">
0098
0099 Refer to the [HashingTF Scala docs](api/scala/org/apache/spark/ml/feature/HashingTF.html) and
0100 the [IDF Scala docs](api/scala/org/apache/spark/ml/feature/IDF.html) for more details on the API.
0101
0102 {% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}
0103 </div>
0104
0105 <div data-lang="java" markdown="1">
0106
0107 Refer to the [HashingTF Java docs](api/java/org/apache/spark/ml/feature/HashingTF.html) and the
0108 [IDF Java docs](api/java/org/apache/spark/ml/feature/IDF.html) for more details on the API.
0109
0110 {% include_example java/org/apache/spark/examples/ml/JavaTfIdfExample.java %}
0111 </div>
0112
0113 <div data-lang="python" markdown="1">
0114
0115 Refer to the [HashingTF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF) and
0116 the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for more details on the API.
0117
0118 {% include_example python/ml/tf_idf_example.py %}
0119 </div>
0120 </div>
0121
0122 ## Word2Vec
0123
0124 `Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
0125 `Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
0126 transforms each document into a vector using the average of all words in the document; this vector
0127 can then be used as features for prediction, document similarity calculations, etc.
0128 Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#word2vec) for more
0129 details.
0130
0131 **Examples**
0132
0133 In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
0134
0135 <div class="codetabs">
0136 <div data-lang="scala" markdown="1">
0137
0138 Refer to the [Word2Vec Scala docs](api/scala/org/apache/spark/ml/feature/Word2Vec.html)
0139 for more details on the API.
0140
0141 {% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}
0142 </div>
0143
0144 <div data-lang="java" markdown="1">
0145
0146 Refer to the [Word2Vec Java docs](api/java/org/apache/spark/ml/feature/Word2Vec.html)
0147 for more details on the API.
0148
0149 {% include_example java/org/apache/spark/examples/ml/JavaWord2VecExample.java %}
0150 </div>
0151
0152 <div data-lang="python" markdown="1">
0153
0154 Refer to the [Word2Vec Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Word2Vec)
0155 for more details on the API.
0156
0157 {% include_example python/ml/word2vec_example.py %}
0158 </div>
0159 </div>
0160
0161 ## CountVectorizer
0162
0163 `CountVectorizer` and `CountVectorizerModel` aim to help convert a collection of text documents
0164 to vectors of token counts. When an a-priori dictionary is not available, `CountVectorizer` can
0165 be used as an `Estimator` to extract the vocabulary, and generates a `CountVectorizerModel`. The
0166 model produces sparse representations for the documents over the vocabulary, which can then be
0167 passed to other algorithms like LDA.
0168
0169 During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
0170 term frequency across the corpus. An optional parameter `minDF` also affects the fitting process
0171 by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
0172 included in the vocabulary. Another optional binary toggle parameter controls the output vector.
0173 If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic
0174 models that model binary, rather than integer, counts.
0175
0176 **Examples**
0177
0178 Assume that we have the following DataFrame with columns `id` and `texts`:
0179
0180 ~~~~
0181 id | texts
0182 ----|----------
0183 0 | Array("a", "b", "c")
0184 1 | Array("a", "b", "b", "c", "a")
0185 ~~~~
0186
0187 each row in `texts` is a document of type Array[String].
0188 Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c).
0189 Then the output column "vector" after transformation contains:
0190
0191 ~~~~
0192 id | texts | vector
0193 ----|---------------------------------|---------------
0194 0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
0195 1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
0196 ~~~~
0197
0198 Each vector represents the token counts of the document over the vocabulary.
0199
0200 <div class="codetabs">
0201 <div data-lang="scala" markdown="1">
0202
0203 Refer to the [CountVectorizer Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
0204 and the [CountVectorizerModel Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html)
0205 for more details on the API.
0206
0207 {% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}
0208 </div>
0209
0210 <div data-lang="java" markdown="1">
0211
0212 Refer to the [CountVectorizer Java docs](api/java/org/apache/spark/ml/feature/CountVectorizer.html)
0213 and the [CountVectorizerModel Java docs](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html)
0214 for more details on the API.
0215
0216 {% include_example java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java %}
0217 </div>
0218
0219 <div data-lang="python" markdown="1">
0220
0221 Refer to the [CountVectorizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer)
0222 and the [CountVectorizerModel Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizerModel)
0223 for more details on the API.
0224
0225 {% include_example python/ml/count_vectorizer_example.py %}
0226 </div>
0227 </div>
0228
0229 ## FeatureHasher
0230
0231 Feature hashing projects a set of categorical or numerical features into a feature vector of
0232 specified dimension (typically substantially smaller than that of the original feature
0233 space). This is done using the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing)
0234 to map features to indices in the feature vector.
0235
0236 The `FeatureHasher` transformer operates on multiple columns. Each column may contain either
0237 numeric or categorical features. Behavior and handling of column data types is as follows:
0238
0239 - Numeric columns: For numeric features, the hash value of the column name is used to map the
0240 feature value to its index in the feature vector. By default, numeric features are not treated
0241 as categorical (even when they are integers). To treat them as categorical, specify the relevant
0242 columns using the `categoricalCols` parameter.
0243 - String columns: For categorical features, the hash value of the string "column_name=value"
0244 is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features
0245 are "one-hot" encoded (similarly to using [OneHotEncoder](ml-features.html#onehotencoder) with
0246 `dropLast=false`).
0247 - Boolean columns: Boolean values are treated in the same way as string columns. That is,
0248 boolean features are represented as "column_name=true" or "column_name=false", with an indicator
0249 value of `1.0`.
0250
0251 Null (missing) values are ignored (implicitly zero in the resulting feature vector).
0252
0253 The hash function used here is also the [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash)
0254 used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo on the hashed value is used to
0255 determine the vector index, it is advisable to use a power of two as the numFeatures parameter;
0256 otherwise the features will not be mapped evenly to the vector indices.
0257
0258 **Examples**
0259
0260 Assume that we have a DataFrame with 4 input columns `real`, `bool`, `stringNum`, and `string`.
0261 These different data types as input will illustrate the behavior of the transform to produce a
0262 column of feature vectors.
0263
0264 ~~~~
0265 real| bool|stringNum|string
0266 ----|-----|---------|------
0267 2.2| true| 1| foo
0268 3.3|false| 2| bar
0269 4.4|false| 3| baz
0270 5.5|false| 4| foo
0271 ~~~~
0272
0273 Then the output of `FeatureHasher.transform` on this DataFrame is:
0274
0275 ~~~~
0276 real|bool |stringNum|string|features
0277 ----|-----|---------|------|-------------------------------------------------------
0278 2.2 |true |1 |foo |(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
0279 3.3 |false|2 |bar |(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3])
0280 4.4 |false|3 |baz |(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
0281 5.5 |false|4 |foo |(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])
0282 ~~~~
0283
0284 The resulting feature vectors could then be passed to a learning algorithm.
0285
0286 <div class="codetabs">
0287 <div data-lang="scala" markdown="1">
0288
0289 Refer to the [FeatureHasher Scala docs](api/scala/org/apache/spark/ml/feature/FeatureHasher.html)
0290 for more details on the API.
0291
0292 {% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}
0293 </div>
0294
0295 <div data-lang="java" markdown="1">
0296
0297 Refer to the [FeatureHasher Java docs](api/java/org/apache/spark/ml/feature/FeatureHasher.html)
0298 for more details on the API.
0299
0300 {% include_example java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java %}
0301 </div>
0302
0303 <div data-lang="python" markdown="1">
0304
0305 Refer to the [FeatureHasher Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.FeatureHasher)
0306 for more details on the API.
0307
0308 {% include_example python/ml/feature_hasher_example.py %}
0309 </div>
0310 </div>
0311
0312 # Feature Transformers
0313
0314 ## Tokenizer
0315
0316 [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/org/apache/spark/ml/feature/Tokenizer.html) class provides this functionality. The example below shows how to split sentences into sequences of words.
0317
0318 [RegexTokenizer](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html) allows more
0319 advanced tokenization based on regular expression (regex) matching.
0320 By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text.
0321 Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
0322 "tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.
0323
0324 **Examples**
0325
0326 <div class="codetabs">
0327 <div data-lang="scala" markdown="1">
0328
0329 Refer to the [Tokenizer Scala docs](api/scala/org/apache/spark/ml/feature/Tokenizer.html)
0330 and the [RegexTokenizer Scala docs](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
0331 for more details on the API.
0332
0333 {% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}
0334 </div>
0335
0336 <div data-lang="java" markdown="1">
0337
0338 Refer to the [Tokenizer Java docs](api/java/org/apache/spark/ml/feature/Tokenizer.html)
0339 and the [RegexTokenizer Java docs](api/java/org/apache/spark/ml/feature/RegexTokenizer.html)
0340 for more details on the API.
0341
0342 {% include_example java/org/apache/spark/examples/ml/JavaTokenizerExample.java %}
0343 </div>
0344
0345 <div data-lang="python" markdown="1">
0346
0347 Refer to the [Tokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer) and
0348 the [RegexTokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer)
0349 for more details on the API.
0350
0351 {% include_example python/ml/tokenizer_example.py %}
0352 </div>
0353 </div>
0354
0355 ## StopWordsRemover
0356 [Stop words](https://en.wikipedia.org/wiki/Stop_words) are words which
0357 should be excluded from the input, typically because the words appear
0358 frequently and don't carry as much meaning.
0359
0360 `StopWordsRemover` takes as input a sequence of strings (e.g. the output
0361 of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop
0362 words from the input sequences. The list of stopwords is specified by
0363 the `stopWords` parameter. Default stop words for some languages are accessible
0364 by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available
0365 options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian",
0366 "italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish".
0367 A boolean parameter `caseSensitive` indicates if the matches should be case sensitive
0368 (false by default).
0369
0370 **Examples**
0371
0372 Assume that we have the following DataFrame with columns `id` and `raw`:
0373
0374 ~~~~
0375 id | raw
0376 ----|----------
0377 0 | [I, saw, the, red, balloon]
0378 1 | [Mary, had, a, little, lamb]
0379 ~~~~
0380
0381 Applying `StopWordsRemover` with `raw` as the input column and `filtered` as the output
0382 column, we should get the following:
0383
0384 ~~~~
0385 id | raw | filtered
0386 ----|-----------------------------|--------------------
0387 0 | [I, saw, the, red, balloon] | [saw, red, balloon]
0388 1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]
0389 ~~~~
0390
0391 In `filtered`, the stop words "I", "the", "had", and "a" have been
0392 filtered out.
0393
0394 <div class="codetabs">
0395
0396 <div data-lang="scala" markdown="1">
0397
0398 Refer to the [StopWordsRemover Scala docs](api/scala/org/apache/spark/ml/feature/StopWordsRemover.html)
0399 for more details on the API.
0400
0401 {% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}
0402 </div>
0403
0404 <div data-lang="java" markdown="1">
0405
0406 Refer to the [StopWordsRemover Java docs](api/java/org/apache/spark/ml/feature/StopWordsRemover.html)
0407 for more details on the API.
0408
0409 {% include_example java/org/apache/spark/examples/ml/JavaStopWordsRemoverExample.java %}
0410 </div>
0411
0412 <div data-lang="python" markdown="1">
0413
0414 Refer to the [StopWordsRemover Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover)
0415 for more details on the API.
0416
0417 {% include_example python/ml/stopwords_remover_example.py %}
0418 </div>
0419 </div>
0420
0421 ## $n$-gram
0422
0423 An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (typically words) for some integer $n$. The `NGram` class can be used to transform input features into $n$-grams.
0424
0425 `NGram` takes as input a sequence of strings (e.g. the output of a [Tokenizer](ml-features.html#tokenizer)). The parameter `n` is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than `n` strings, no output is produced.
0426
0427 **Examples**
0428
0429 <div class="codetabs">
0430
0431 <div data-lang="scala" markdown="1">
0432
0433 Refer to the [NGram Scala docs](api/scala/org/apache/spark/ml/feature/NGram.html)
0434 for more details on the API.
0435
0436 {% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}
0437 </div>
0438
0439 <div data-lang="java" markdown="1">
0440
0441 Refer to the [NGram Java docs](api/java/org/apache/spark/ml/feature/NGram.html)
0442 for more details on the API.
0443
0444 {% include_example java/org/apache/spark/examples/ml/JavaNGramExample.java %}
0445 </div>
0446
0447 <div data-lang="python" markdown="1">
0448
0449 Refer to the [NGram Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.NGram)
0450 for more details on the API.
0451
0452 {% include_example python/ml/n_gram_example.py %}
0453 </div>
0454 </div>
0455
0456
0457 ## Binarizer
0458
0459 Binarization is the process of thresholding numerical features to binary (0/1) features.
0460
0461 `Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold`
0462 for binarization. Feature values greater than the threshold are binarized to 1.0; values equal
0463 to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
0464 for `inputCol`.
0465
0466 **Examples**
0467
0468 <div class="codetabs">
0469 <div data-lang="scala" markdown="1">
0470
0471 Refer to the [Binarizer Scala docs](api/scala/org/apache/spark/ml/feature/Binarizer.html)
0472 for more details on the API.
0473
0474 {% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}
0475 </div>
0476
0477 <div data-lang="java" markdown="1">
0478
0479 Refer to the [Binarizer Java docs](api/java/org/apache/spark/ml/feature/Binarizer.html)
0480 for more details on the API.
0481
0482 {% include_example java/org/apache/spark/examples/ml/JavaBinarizerExample.java %}
0483 </div>
0484
0485 <div data-lang="python" markdown="1">
0486
0487 Refer to the [Binarizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer)
0488 for more details on the API.
0489
0490 {% include_example python/ml/binarizer_example.py %}
0491 </div>
0492 </div>
0493
0494 ## PCA
0495
0496 [PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/org/apache/spark/ml/feature/PCA.html) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
0497
0498 **Examples**
0499
0500 <div class="codetabs">
0501 <div data-lang="scala" markdown="1">
0502
0503 Refer to the [PCA Scala docs](api/scala/org/apache/spark/ml/feature/PCA.html)
0504 for more details on the API.
0505
0506 {% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}
0507 </div>
0508
0509 <div data-lang="java" markdown="1">
0510
0511 Refer to the [PCA Java docs](api/java/org/apache/spark/ml/feature/PCA.html)
0512 for more details on the API.
0513
0514 {% include_example java/org/apache/spark/examples/ml/JavaPCAExample.java %}
0515 </div>
0516
0517 <div data-lang="python" markdown="1">
0518
0519 Refer to the [PCA Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PCA)
0520 for more details on the API.
0521
0522 {% include_example python/ml/pca_example.py %}
0523 </div>
0524 </div>
0525
0526 ## PolynomialExpansion
0527
0528 [Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
0529
0530 **Examples**
0531
0532 <div class="codetabs">
0533 <div data-lang="scala" markdown="1">
0534
0535 Refer to the [PolynomialExpansion Scala docs](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html)
0536 for more details on the API.
0537
0538 {% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}
0539 </div>
0540
0541 <div data-lang="java" markdown="1">
0542
0543 Refer to the [PolynomialExpansion Java docs](api/java/org/apache/spark/ml/feature/PolynomialExpansion.html)
0544 for more details on the API.
0545
0546 {% include_example java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java %}
0547 </div>
0548
0549 <div data-lang="python" markdown="1">
0550
0551 Refer to the [PolynomialExpansion Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PolynomialExpansion)
0552 for more details on the API.
0553
0554 {% include_example python/ml/polynomial_expansion_example.py %}
0555 </div>
0556 </div>
0557
0558 ## Discrete Cosine Transform (DCT)
0559
0560 The [Discrete Cosine
0561 Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
0562 transforms a length $N$ real-valued sequence in the time domain into
0563 another length $N$ real-valued sequence in the frequency domain. A
0564 [DCT](api/scala/org/apache/spark/ml/feature/DCT.html) class
0565 provides this functionality, implementing the
0566 [DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
0567 and scaling the result by $1/\sqrt{2}$ such that the representing matrix
0568 for the transform is unitary. No shift is applied to the transformed
0569 sequence (e.g. the $0$th element of the transformed sequence is the
0570 $0$th DCT coefficient and _not_ the $N/2$th).
0571
0572 **Examples**
0573
0574 <div class="codetabs">
0575 <div data-lang="scala" markdown="1">
0576
0577 Refer to the [DCT Scala docs](api/scala/org/apache/spark/ml/feature/DCT.html)
0578 for more details on the API.
0579
0580 {% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}
0581 </div>
0582
0583 <div data-lang="java" markdown="1">
0584
0585 Refer to the [DCT Java docs](api/java/org/apache/spark/ml/feature/DCT.html)
0586 for more details on the API.
0587
0588 {% include_example java/org/apache/spark/examples/ml/JavaDCTExample.java %}
0589 </div>
0590
0591 <div data-lang="python" markdown="1">
0592
0593 Refer to the [DCT Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.DCT)
0594 for more details on the API.
0595
0596 {% include_example python/ml/dct_example.py %}
0597 </div>
0598 </div>
0599
0600 ## StringIndexer
0601
0602 `StringIndexer` encodes a string column of labels to a column of label indices.
0603 `StringIndexer` can encode multiple columns. The indices are in `[0, numLabels)`, and four ordering options are supported:
0604 "frequencyDesc": descending order by label frequency (most frequent label assigned 0),
0605 "frequencyAsc": ascending order by label frequency (least frequent label assigned 0),
0606 "alphabetDesc": descending alphabetical order, and "alphabetAsc": ascending alphabetical order
0607 (default = "frequencyDesc"). Note that in case of equal frequency when under
0608 "frequencyDesc"/"frequencyAsc", the strings are further sorted by alphabet.
0609
0610 The unseen labels will be put at index numLabels if user chooses to keep them.
0611 If the input column is numeric, we cast it to string and index the string
0612 values. When downstream pipeline components such as `Estimator` or
0613 `Transformer` make use of this string-indexed label, you must set the input
0614 column of the component to this string-indexed column name. In many cases,
0615 you can set the input column with `setInputCol`.
0616
0617 **Examples**
0618
0619 Assume that we have the following DataFrame with columns `id` and `category`:
0620
0621 ~~~~
0622 id | category
0623 ----|----------
0624 0 | a
0625 1 | b
0626 2 | c
0627 3 | a
0628 4 | a
0629 5 | c
0630 ~~~~
0631
0632 `category` is a string column with three labels: "a", "b", and "c".
0633 Applying `StringIndexer` with `category` as the input column and `categoryIndex` as the output
0634 column, we should get the following:
0635
0636 ~~~~
0637 id | category | categoryIndex
0638 ----|----------|---------------
0639 0 | a | 0.0
0640 1 | b | 2.0
0641 2 | c | 1.0
0642 3 | a | 0.0
0643 4 | a | 0.0
0644 5 | c | 1.0
0645 ~~~~
0646
0647 "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
0648 index `2`.
0649
0650 Additionally, there are three strategies regarding how `StringIndexer` will handle
0651 unseen labels when you have fit a `StringIndexer` on one dataset and then use it
0652 to transform another:
0653
0654 - throw an exception (which is the default)
0655 - skip the row containing the unseen label entirely
0656 - put unseen labels in a special additional bucket, at index numLabels
0657
0658 **Examples**
0659
0660 Let's go back to our previous example but this time reuse our previously defined
0661 `StringIndexer` on the following dataset:
0662
0663 ~~~~
0664 id | category
0665 ----|----------
0666 0 | a
0667 1 | b
0668 2 | c
0669 3 | d
0670 4 | e
0671 ~~~~
0672
0673 If you've not set how `StringIndexer` handles unseen labels or set it to
0674 "error", an exception will be thrown.
0675 However, if you had called `setHandleInvalid("skip")`, the following dataset
0676 will be generated:
0677
0678 ~~~~
0679 id | category | categoryIndex
0680 ----|----------|---------------
0681 0 | a | 0.0
0682 1 | b | 2.0
0683 2 | c | 1.0
0684 ~~~~
0685
0686 Notice that the rows containing "d" or "e" do not appear.
0687
0688 If you call `setHandleInvalid("keep")`, the following dataset
0689 will be generated:
0690
0691 ~~~~
0692 id | category | categoryIndex
0693 ----|----------|---------------
0694 0 | a | 0.0
0695 1 | b | 2.0
0696 2 | c | 1.0
0697 3 | d | 3.0
0698 4 | e | 3.0
0699 ~~~~
0700
0701 Notice that the rows containing "d" or "e" are mapped to index "3.0"
0702
0703 <div class="codetabs">
0704
0705 <div data-lang="scala" markdown="1">
0706
0707 Refer to the [StringIndexer Scala docs](api/scala/org/apache/spark/ml/feature/StringIndexer.html)
0708 for more details on the API.
0709
0710 {% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %}
0711 </div>
0712
0713 <div data-lang="java" markdown="1">
0714
0715 Refer to the [StringIndexer Java docs](api/java/org/apache/spark/ml/feature/StringIndexer.html)
0716 for more details on the API.
0717
0718 {% include_example java/org/apache/spark/examples/ml/JavaStringIndexerExample.java %}
0719 </div>
0720
0721 <div data-lang="python" markdown="1">
0722
0723 Refer to the [StringIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer)
0724 for more details on the API.
0725
0726 {% include_example python/ml/string_indexer_example.py %}
0727 </div>
0728 </div>
0729
0730
0731 ## IndexToString
0732
0733 Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
0734 back to a column containing the original labels as strings. A common use case
0735 is to produce indices from labels with `StringIndexer`, train a model with those
0736 indices and retrieve the original labels from the column of predicted indices
0737 with `IndexToString`. However, you are free to supply your own labels.
0738
0739 **Examples**
0740
0741 Building on the `StringIndexer` example, let's assume we have the following
0742 DataFrame with columns `id` and `categoryIndex`:
0743
0744 ~~~~
0745 id | categoryIndex
0746 ----|---------------
0747 0 | 0.0
0748 1 | 2.0
0749 2 | 1.0
0750 3 | 0.0
0751 4 | 0.0
0752 5 | 1.0
0753 ~~~~
0754
0755 Applying `IndexToString` with `categoryIndex` as the input column,
0756 `originalCategory` as the output column, we are able to retrieve our original
0757 labels (they will be inferred from the columns' metadata):
0758
0759 ~~~~
0760 id | categoryIndex | originalCategory
0761 ----|---------------|-----------------
0762 0 | 0.0 | a
0763 1 | 2.0 | b
0764 2 | 1.0 | c
0765 3 | 0.0 | a
0766 4 | 0.0 | a
0767 5 | 1.0 | c
0768 ~~~~
0769
0770 <div class="codetabs">
0771 <div data-lang="scala" markdown="1">
0772
0773 Refer to the [IndexToString Scala docs](api/scala/org/apache/spark/ml/feature/IndexToString.html)
0774 for more details on the API.
0775
0776 {% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}
0777
0778 </div>
0779
0780 <div data-lang="java" markdown="1">
0781
0782 Refer to the [IndexToString Java docs](api/java/org/apache/spark/ml/feature/IndexToString.html)
0783 for more details on the API.
0784
0785 {% include_example java/org/apache/spark/examples/ml/JavaIndexToStringExample.java %}
0786
0787 </div>
0788
0789 <div data-lang="python" markdown="1">
0790
0791 Refer to the [IndexToString Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IndexToString)
0792 for more details on the API.
0793
0794 {% include_example python/ml/index_to_string_example.py %}
0795
0796 </div>
0797 </div>
0798
0799 ## OneHotEncoder
0800
0801 [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
0802
0803 `OneHotEncoder` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using [VectorAssembler](ml-features.html#vectorassembler).
0804
0805 `OneHotEncoder` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an error).
0806
0807 **Examples**
0808
0809 <div class="codetabs">
0810 <div data-lang="scala" markdown="1">
0811
0812 Refer to the [OneHotEncoder Scala docs](api/scala/org/apache/spark/ml/feature/OneHotEncoder.html) for more details on the API.
0813
0814 {% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
0815 </div>
0816
0817 <div data-lang="java" markdown="1">
0818
0819 Refer to the [OneHotEncoder Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html)
0820 for more details on the API.
0821
0822 {% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java %}
0823 </div>
0824
0825 <div data-lang="python" markdown="1">
0826
0827 Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder) for more details on the API.
0828
0829 {% include_example python/ml/onehot_encoder_example.py %}
0830 </div>
0831 </div>
0832
0833 ## VectorIndexer
0834
0835 `VectorIndexer` helps index categorical features in datasets of `Vector`s.
0836 It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
0837
0838 1. Take an input column of type [Vector](api/scala/org/apache/spark/ml/linalg/Vector.html) and a parameter `maxCategories`.
0839 2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical.
0840 3. Compute 0-based category indices for each categorical feature.
0841 4. Index categorical features and transform original feature values to indices.
0842
0843 Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
0844
0845 **Examples**
0846
0847 In the example below, we read in a dataset of labeled points and then use `VectorIndexer` to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as `DecisionTreeRegressor` that handle categorical features.
0848
0849 <div class="codetabs">
0850 <div data-lang="scala" markdown="1">
0851
0852 Refer to the [VectorIndexer Scala docs](api/scala/org/apache/spark/ml/feature/VectorIndexer.html)
0853 for more details on the API.
0854
0855 {% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %}
0856 </div>
0857
0858 <div data-lang="java" markdown="1">
0859
0860 Refer to the [VectorIndexer Java docs](api/java/org/apache/spark/ml/feature/VectorIndexer.html)
0861 for more details on the API.
0862
0863 {% include_example java/org/apache/spark/examples/ml/JavaVectorIndexerExample.java %}
0864 </div>
0865
0866 <div data-lang="python" markdown="1">
0867
0868 Refer to the [VectorIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorIndexer)
0869 for more details on the API.
0870
0871 {% include_example python/ml/vector_indexer_example.py %}
0872 </div>
0873 </div>
0874
0875 ## Interaction
0876
0877 `Interaction` is a `Transformer` which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.
0878
0879 For example, if you have 2 vector type columns each of which has 3 dimensions as input columns, then you'll get a 9-dimensional vector as the output column.
0880
0881 **Examples**
0882
0883 Assume that we have the following DataFrame with the columns "id1", "vec1", and "vec2":
0884
0885 ~~~~
0886 id1|vec1 |vec2
0887 ---|--------------|--------------
0888 1 |[1.0,2.0,3.0] |[8.0,4.0,5.0]
0889 2 |[4.0,3.0,8.0] |[7.0,9.0,8.0]
0890 3 |[6.0,1.0,9.0] |[2.0,3.0,6.0]
0891 4 |[10.0,8.0,6.0]|[9.0,4.0,5.0]
0892 5 |[9.0,2.0,7.0] |[10.0,7.0,3.0]
0893 6 |[1.0,1.0,4.0] |[2.0,8.0,4.0]
0894 ~~~~
0895
0896 Applying `Interaction` with those input columns,
0897 then `interactedCol` as the output column contains:
0898
0899 ~~~~
0900 id1|vec1 |vec2 |interactedCol
0901 ---|--------------|--------------|------------------------------------------------------
0902 1 |[1.0,2.0,3.0] |[8.0,4.0,5.0] |[8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]
0903 2 |[4.0,3.0,8.0] |[7.0,9.0,8.0] |[56.0,72.0,64.0,42.0,54.0,48.0,112.0,144.0,128.0]
0904 3 |[6.0,1.0,9.0] |[2.0,3.0,6.0] |[36.0,54.0,108.0,6.0,9.0,18.0,54.0,81.0,162.0]
0905 4 |[10.0,8.0,6.0]|[9.0,4.0,5.0] |[360.0,160.0,200.0,288.0,128.0,160.0,216.0,96.0,120.0]
0906 5 |[9.0,2.0,7.0] |[10.0,7.0,3.0]|[450.0,315.0,135.0,100.0,70.0,30.0,350.0,245.0,105.0]
0907 6 |[1.0,1.0,4.0] |[2.0,8.0,4.0] |[12.0,48.0,24.0,12.0,48.0,24.0,48.0,192.0,96.0]
0908 ~~~~
0909
0910 <div class="codetabs">
0911 <div data-lang="scala" markdown="1">
0912
0913 Refer to the [Interaction Scala docs](api/scala/org/apache/spark/ml/feature/Interaction.html)
0914 for more details on the API.
0915
0916 {% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %}
0917 </div>
0918
0919 <div data-lang="java" markdown="1">
0920
0921 Refer to the [Interaction Java docs](api/java/org/apache/spark/ml/feature/Interaction.html)
0922 for more details on the API.
0923
0924 {% include_example java/org/apache/spark/examples/ml/JavaInteractionExample.java %}
0925 </div>
0926
0927 <div data-lang="python" markdown="1">
0928
0929 Refer to the [Interaction Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Interaction)
0930 for more details on the API.
0931
0932 {% include_example python/ml/interaction_example.py %}
0933 </div>
0934 </div>
0935
0936 ## Normalizer
0937
0938 `Normalizer` is a `Transformer` which transforms a dataset of `Vector` rows, normalizing each `Vector` to have unit norm. It takes parameter `p`, which specifies the [p-norm](http://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) used for normalization. ($p = 2$ by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.
0939
0940 **Examples**
0941
0942 The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit $L^1$ norm and unit $L^\infty$ norm.
0943
0944 <div class="codetabs">
0945 <div data-lang="scala" markdown="1">
0946
0947 Refer to the [Normalizer Scala docs](api/scala/org/apache/spark/ml/feature/Normalizer.html)
0948 for more details on the API.
0949
0950 {% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %}
0951 </div>
0952
0953 <div data-lang="java" markdown="1">
0954
0955 Refer to the [Normalizer Java docs](api/java/org/apache/spark/ml/feature/Normalizer.html)
0956 for more details on the API.
0957
0958 {% include_example java/org/apache/spark/examples/ml/JavaNormalizerExample.java %}
0959 </div>
0960
0961 <div data-lang="python" markdown="1">
0962
0963 Refer to the [Normalizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
0964 for more details on the API.
0965
0966 {% include_example python/ml/normalizer_example.py %}
0967 </div>
0968 </div>
0969
0970
0971 ## StandardScaler
0972
0973 `StandardScaler` transforms a dataset of `Vector` rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
0974
0975 * `withStd`: True by default. Scales the data to unit standard deviation.
0976 * `withMean`: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
0977
0978 `StandardScaler` is an `Estimator` which can be `fit` on a dataset to produce a `StandardScalerModel`; this amounts to computing summary statistics. The model can then transform a `Vector` column in a dataset to have unit standard deviation and/or zero mean features.
0979
0980 Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
0981
0982 **Examples**
0983
0984 The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
0985
0986 <div class="codetabs">
0987 <div data-lang="scala" markdown="1">
0988
0989 Refer to the [StandardScaler Scala docs](api/scala/org/apache/spark/ml/feature/StandardScaler.html)
0990 for more details on the API.
0991
0992 {% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %}
0993 </div>
0994
0995 <div data-lang="java" markdown="1">
0996
0997 Refer to the [StandardScaler Java docs](api/java/org/apache/spark/ml/feature/StandardScaler.html)
0998 for more details on the API.
0999
1000 {% include_example java/org/apache/spark/examples/ml/JavaStandardScalerExample.java %}
1001 </div>
1002
1003 <div data-lang="python" markdown="1">
1004
1005 Refer to the [StandardScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StandardScaler)
1006 for more details on the API.
1007
1008 {% include_example python/ml/standard_scaler_example.py %}
1009 </div>
1010 </div>
1011
1012
1013 ## RobustScaler
1014
1015 `RobustScaler` transforms a dataset of `Vector` rows, removing the median and scaling the data according to a specific quantile range (by default the IQR: Interquartile Range, quantile range between the 1st quartile and the 3rd quartile). Its behavior is quite similar to `StandardScaler`, however the median and the quantile range are used instead of mean and standard deviation, which make it robust to outliers. It takes parameters:
1016
1017 * `lower`: 0.25 by default. Lower quantile to calculate quantile range, shared by all features.
1018 * `upper`: 0.75 by default. Upper quantile to calculate quantile range, shared by all features.
1019 * `withScaling`: True by default. Scales the data to quantile range.
1020 * `withCentering`: False by default. Centers the data with median before scaling. It will build a dense output, so take care when applying to sparse input.
1021
1022 `RobustScaler` is an `Estimator` which can be `fit` on a dataset to produce a `RobustScalerModel`; this amounts to computing quantile statistics. The model can then transform a `Vector` column in a dataset to have unit quantile range and/or zero median features.
1023
1024 Note that if the quantile range of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
1025
1026 **Examples**
1027
1028 The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit quantile range.
1029
1030 <div class="codetabs">
1031 <div data-lang="scala" markdown="1">
1032
1033 Refer to the [RobustScaler Scala docs](api/scala/org/apache/spark/ml/feature/RobustScaler.html)
1034 for more details on the API.
1035
1036 {% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %}
1037 </div>
1038
1039 <div data-lang="java" markdown="1">
1040
1041 Refer to the [RobustScaler Java docs](api/java/org/apache/spark/ml/feature/RobustScaler.html)
1042 for more details on the API.
1043
1044 {% include_example java/org/apache/spark/examples/ml/JavaRobustScalerExample.java %}
1045 </div>
1046
1047 <div data-lang="python" markdown="1">
1048
1049 Refer to the [RobustScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RobustScaler)
1050 for more details on the API.
1051
1052 {% include_example python/ml/robust_scaler_example.py %}
1053 </div>
1054 </div>
1055
1056
1057 ## MinMaxScaler
1058
1059 `MinMaxScaler` transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
1060
1061 * `min`: 0.0 by default. Lower bound after transformation, shared by all features.
1062 * `max`: 1.0 by default. Upper bound after transformation, shared by all features.
1063
1064 `MinMaxScaler` computes summary statistics on a data set and produces a `MinMaxScalerModel`. The model can then transform each feature individually such that it is in the given range.
1065
1066 The rescaled value for a feature E is calculated as,
1067 `\begin{equation}
1068 Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
1069 \end{equation}`
1070 For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$`
1071
1072 Note that since zero values will probably be transformed to non-zero values, output of the transformer will be `DenseVector` even for sparse input.
1073
1074 **Examples**
1075
1076 The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].
1077
1078 <div class="codetabs">
1079 <div data-lang="scala" markdown="1">
1080
1081 Refer to the [MinMaxScaler Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScaler.html)
1082 and the [MinMaxScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html)
1083 for more details on the API.
1084
1085 {% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %}
1086 </div>
1087
1088 <div data-lang="java" markdown="1">
1089
1090 Refer to the [MinMaxScaler Java docs](api/java/org/apache/spark/ml/feature/MinMaxScaler.html)
1091 and the [MinMaxScalerModel Java docs](api/java/org/apache/spark/ml/feature/MinMaxScalerModel.html)
1092 for more details on the API.
1093
1094 {% include_example java/org/apache/spark/examples/ml/JavaMinMaxScalerExample.java %}
1095 </div>
1096
1097 <div data-lang="python" markdown="1">
1098
1099 Refer to the [MinMaxScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MinMaxScaler)
1100 and the [MinMaxScalerModel Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MinMaxScalerModel)
1101 for more details on the API.
1102
1103 {% include_example python/ml/min_max_scaler_example.py %}
1104 </div>
1105 </div>
1106
1107
1108 ## MaxAbsScaler
1109
1110 `MaxAbsScaler` transforms a dataset of `Vector` rows, rescaling each feature to range [-1, 1]
1111 by dividing through the maximum absolute value in each feature. It does not shift/center the
1112 data, and thus does not destroy any sparsity.
1113
1114 `MaxAbsScaler` computes summary statistics on a data set and produces a `MaxAbsScalerModel`. The
1115 model can then transform each feature individually to range [-1, 1].
1116
1117 **Examples**
1118
1119 The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [-1, 1].
1120
1121 <div class="codetabs">
1122 <div data-lang="scala" markdown="1">
1123
1124 Refer to the [MaxAbsScaler Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html)
1125 and the [MaxAbsScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html)
1126 for more details on the API.
1127
1128 {% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %}
1129 </div>
1130
1131 <div data-lang="java" markdown="1">
1132
1133 Refer to the [MaxAbsScaler Java docs](api/java/org/apache/spark/ml/feature/MaxAbsScaler.html)
1134 and the [MaxAbsScalerModel Java docs](api/java/org/apache/spark/ml/feature/MaxAbsScalerModel.html)
1135 for more details on the API.
1136
1137 {% include_example java/org/apache/spark/examples/ml/JavaMaxAbsScalerExample.java %}
1138 </div>
1139
1140 <div data-lang="python" markdown="1">
1141
1142 Refer to the [MaxAbsScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MaxAbsScaler)
1143 and the [MaxAbsScalerModel Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MaxAbsScalerModel)
1144 for more details on the API.
1145
1146 {% include_example python/ml/max_abs_scaler_example.py %}
1147 </div>
1148 </div>
1149
1150 ## Bucketizer
1151
1152 `Bucketizer` transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:
1153
1154 * `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
1155
1156 Note that if you have no idea of the upper and lower bounds of the targeted column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
1157
1158 Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
1159
1160 More details can be found in the API docs for [Bucketizer](api/scala/org/apache/spark/ml/feature/Bucketizer.html).
1161
1162 **Examples**
1163
1164 The following example demonstrates how to bucketize a column of `Double`s into another index-wised column.
1165
1166 <div class="codetabs">
1167 <div data-lang="scala" markdown="1">
1168
1169 Refer to the [Bucketizer Scala docs](api/scala/org/apache/spark/ml/feature/Bucketizer.html)
1170 for more details on the API.
1171
1172 {% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %}
1173 </div>
1174
1175 <div data-lang="java" markdown="1">
1176
1177 Refer to the [Bucketizer Java docs](api/java/org/apache/spark/ml/feature/Bucketizer.html)
1178 for more details on the API.
1179
1180 {% include_example java/org/apache/spark/examples/ml/JavaBucketizerExample.java %}
1181 </div>
1182
1183 <div data-lang="python" markdown="1">
1184
1185 Refer to the [Bucketizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Bucketizer)
1186 for more details on the API.
1187
1188 {% include_example python/ml/bucketizer_example.py %}
1189 </div>
1190 </div>
1191
1192 ## ElementwiseProduct
1193
1194 ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `w`, to yield a result vector.
1195
1196 `\[ \begin{pmatrix}
1197 v_1 \\
1198 \vdots \\
1199 v_N
1200 \end{pmatrix} \circ \begin{pmatrix}
1201 w_1 \\
1202 \vdots \\
1203 w_N
1204 \end{pmatrix}
1205 = \begin{pmatrix}
1206 v_1 w_1 \\
1207 \vdots \\
1208 v_N w_N
1209 \end{pmatrix}
1210 \]`
1211
1212 **Examples**
1213
1214 This example below demonstrates how to transform vectors using a transforming vector value.
1215
1216 <div class="codetabs">
1217 <div data-lang="scala" markdown="1">
1218
1219 Refer to the [ElementwiseProduct Scala docs](api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html)
1220 for more details on the API.
1221
1222 {% include_example scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala %}
1223 </div>
1224
1225 <div data-lang="java" markdown="1">
1226
1227 Refer to the [ElementwiseProduct Java docs](api/java/org/apache/spark/ml/feature/ElementwiseProduct.html)
1228 for more details on the API.
1229
1230 {% include_example java/org/apache/spark/examples/ml/JavaElementwiseProductExample.java %}
1231 </div>
1232
1233 <div data-lang="python" markdown="1">
1234
1235 Refer to the [ElementwiseProduct Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ElementwiseProduct)
1236 for more details on the API.
1237
1238 {% include_example python/ml/elementwise_product_example.py %}
1239 </div>
1240 </div>
1241
1242 ## SQLTransformer
1243
1244 `SQLTransformer` implements the transformations which are defined by SQL statement.
1245 Currently, we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
1246 where `"__THIS__"` represents the underlying table of the input dataset.
1247 The select clause specifies the fields, constants, and expressions to display in
1248 the output, and can be any select clause that Spark SQL supports. Users can also
1249 use Spark SQL built-in function and UDFs to operate on these selected columns.
1250 For example, `SQLTransformer` supports statements like:
1251
1252 * `SELECT a, a + b AS a_b FROM __THIS__`
1253 * `SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5`
1254 * `SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b`
1255
1256 **Examples**
1257
1258 Assume that we have the following DataFrame with columns `id`, `v1` and `v2`:
1259
1260 ~~~~
1261 id | v1 | v2
1262 ----|-----|-----
1263 0 | 1.0 | 3.0
1264 2 | 2.0 | 5.0
1265 ~~~~
1266
1267 This is the output of the `SQLTransformer` with statement `"SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__"`:
1268
1269 ~~~~
1270 id | v1 | v2 | v3 | v4
1271 ----|-----|-----|-----|-----
1272 0 | 1.0 | 3.0 | 4.0 | 3.0
1273 2 | 2.0 | 5.0 | 7.0 |10.0
1274 ~~~~
1275
1276 <div class="codetabs">
1277 <div data-lang="scala" markdown="1">
1278
1279 Refer to the [SQLTransformer Scala docs](api/scala/org/apache/spark/ml/feature/SQLTransformer.html)
1280 for more details on the API.
1281
1282 {% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %}
1283 </div>
1284
1285 <div data-lang="java" markdown="1">
1286
1287 Refer to the [SQLTransformer Java docs](api/java/org/apache/spark/ml/feature/SQLTransformer.html)
1288 for more details on the API.
1289
1290 {% include_example java/org/apache/spark/examples/ml/JavaSQLTransformerExample.java %}
1291 </div>
1292
1293 <div data-lang="python" markdown="1">
1294
1295 Refer to the [SQLTransformer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.SQLTransformer) for more details on the API.
1296
1297 {% include_example python/ml/sql_transformer.py %}
1298 </div>
1299 </div>
1300
1301 ## VectorAssembler
1302
1303 `VectorAssembler` is a transformer that combines a given list of columns into a single vector
1304 column.
1305 It is useful for combining raw features and features generated by different feature transformers
1306 into a single feature vector, in order to train ML models like logistic regression and decision
1307 trees.
1308 `VectorAssembler` accepts the following input column types: all numeric types, boolean type,
1309 and vector type.
1310 In each row, the values of the input columns will be concatenated into a vector in the specified
1311 order.
1312
1313 **Examples**
1314
1315 Assume that we have a DataFrame with the columns `id`, `hour`, `mobile`, `userFeatures`,
1316 and `clicked`:
1317
1318 ~~~
1319 id | hour | mobile | userFeatures | clicked
1320 ----|------|--------|------------------|---------
1321 0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0
1322 ~~~
1323
1324 `userFeatures` is a vector column that contains three user features.
1325 We want to combine `hour`, `mobile`, and `userFeatures` into a single feature vector
1326 called `features` and use it to predict `clicked` or not.
1327 If we set `VectorAssembler`'s input columns to `hour`, `mobile`, and `userFeatures` and
1328 output column to `features`, after transformation we should get the following DataFrame:
1329
1330 ~~~
1331 id | hour | mobile | userFeatures | clicked | features
1332 ----|------|--------|------------------|---------|-----------------------------
1333 0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
1334 ~~~
1335
1336 <div class="codetabs">
1337 <div data-lang="scala" markdown="1">
1338
1339 Refer to the [VectorAssembler Scala docs](api/scala/org/apache/spark/ml/feature/VectorAssembler.html)
1340 for more details on the API.
1341
1342 {% include_example scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala %}
1343 </div>
1344
1345 <div data-lang="java" markdown="1">
1346
1347 Refer to the [VectorAssembler Java docs](api/java/org/apache/spark/ml/feature/VectorAssembler.html)
1348 for more details on the API.
1349
1350 {% include_example java/org/apache/spark/examples/ml/JavaVectorAssemblerExample.java %}
1351 </div>
1352
1353 <div data-lang="python" markdown="1">
1354
1355 Refer to the [VectorAssembler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler)
1356 for more details on the API.
1357
1358 {% include_example python/ml/vector_assembler_example.py %}
1359 </div>
1360 </div>
1361
1362 ## VectorSizeHint
1363
1364 It can sometimes be useful to explicitly specify the size of the vectors for a column of
1365 `VectorType`. For example, `VectorAssembler` uses size information from its input columns to
1366 produce size information and metadata for its output column. While in some cases this information
1367 can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
1368 not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the
1369 vector size for a column so that `VectorAssembler`, or other transformers that might
1370 need to know vector size, can use that column as an input.
1371
1372 To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
1373 transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
1374 the vector size. Downstream operations on the resulting dataframe can get this size using the
1375 metadata.
1376
1377 `VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
1378 behaviour when the vector column contains nulls or vectors of the wrong size. By default
1379 `handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
1380 also be set to "skip", indicating that rows containing invalid values should be filtered out from
1381 the resulting dataframe, or "optimistic", indicating that the column should not be checked for
1382 invalid values and all rows should be kept. Note that the use of "optimistic" can cause the
1383 resulting dataframe to be in an inconsistent state, meaning the metadata for the column
1384 `VectorSizeHint` was applied to does not match the contents of that column. Users should take care
1385 to avoid this kind of inconsistent state.
1386
1387 <div class="codetabs">
1388 <div data-lang="scala" markdown="1">
1389
1390 Refer to the [VectorSizeHint Scala docs](api/scala/org/apache/spark/ml/feature/VectorSizeHint.html)
1391 for more details on the API.
1392
1393 {% include_example scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala %}
1394 </div>
1395
1396 <div data-lang="java" markdown="1">
1397
1398 Refer to the [VectorSizeHint Java docs](api/java/org/apache/spark/ml/feature/VectorSizeHint.html)
1399 for more details on the API.
1400
1401 {% include_example java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java %}
1402 </div>
1403
1404 <div data-lang="python" markdown="1">
1405
1406 Refer to the [VectorSizeHint Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorSizeHint)
1407 for more details on the API.
1408
1409 {% include_example python/ml/vector_size_hint_example.py %}
1410 </div>
1411 </div>
1412
1413 ## QuantileDiscretizer
1414
1415 `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
1416 categorical features. The number of bins is set by the `numBuckets` parameter. It is possible
1417 that the number of buckets used will be smaller than this value, for example, if there are too few
1418 distinct values of the input to create enough distinct quantiles.
1419
1420 NaN values:
1421 NaN values will be removed from the column during `QuantileDiscretizer` fitting. This will produce
1422 a `Bucketizer` model for making predictions. During the transformation, `Bucketizer`
1423 will raise an error when it finds NaN values in the dataset, but the user can also choose to either
1424 keep or remove NaN values within the dataset by setting `handleInvalid`. If the user chooses to keep
1425 NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets
1426 are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
1427
1428 Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
1429 [approxQuantile](api/scala/org/apache/spark/sql/DataFrameStatFunctions.html) for a
1430 detailed description). The precision of the approximation can be controlled with the
1431 `relativeError` parameter. When set to zero, exact quantiles are calculated
1432 (**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds
1433 will be `-Infinity` and `+Infinity` covering all real values.
1434
1435 **Examples**
1436
1437 Assume that we have a DataFrame with the columns `id`, `hour`:
1438
1439 ~~~
1440 id | hour
1441 ----|------
1442 0 | 18.0
1443 ----|------
1444 1 | 19.0
1445 ----|------
1446 2 | 8.0
1447 ----|------
1448 3 | 5.0
1449 ----|------
1450 4 | 2.2
1451 ~~~
1452
1453 `hour` is a continuous feature with `Double` type. We want to turn the continuous feature into
1454 a categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
1455
1456 ~~~
1457 id | hour | result
1458 ----|------|------
1459 0 | 18.0 | 2.0
1460 ----|------|------
1461 1 | 19.0 | 2.0
1462 ----|------|------
1463 2 | 8.0 | 1.0
1464 ----|------|------
1465 3 | 5.0 | 1.0
1466 ----|------|------
1467 4 | 2.2 | 0.0
1468 ~~~
1469
1470 <div class="codetabs">
1471 <div data-lang="scala" markdown="1">
1472
1473 Refer to the [QuantileDiscretizer Scala docs](api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html)
1474 for more details on the API.
1475
1476 {% include_example scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala %}
1477 </div>
1478
1479 <div data-lang="java" markdown="1">
1480
1481 Refer to the [QuantileDiscretizer Java docs](api/java/org/apache/spark/ml/feature/QuantileDiscretizer.html)
1482 for more details on the API.
1483
1484 {% include_example java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java %}
1485 </div>
1486
1487 <div data-lang="python" markdown="1">
1488
1489 Refer to the [QuantileDiscretizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.QuantileDiscretizer)
1490 for more details on the API.
1491
1492 {% include_example python/ml/quantile_discretizer_example.py %}
1493 </div>
1494
1495 </div>
1496
1497
1498 ## Imputer
1499
1500 The `Imputer` estimator completes missing values in a dataset, either using the mean or the
1501 median of the columns in which the missing values are located. The input columns should be of
1502 numeric type. Currently `Imputer` does not support categorical features and possibly
1503 creates incorrect values for columns containing categorical features. Imputer can impute custom values
1504 other than 'NaN' by `.setMissingValue(custom_value)`. For example, `.setMissingValue(0)` will impute
1505 all occurrences of (0).
1506
1507 **Note** all `null` values in the input columns are treated as missing, and so are also imputed.
1508
1509 **Examples**
1510
1511 Suppose that we have a DataFrame with the columns `a` and `b`:
1512
1513 ~~~
1514 a | b
1515 ------------|-----------
1516 1.0 | Double.NaN
1517 2.0 | Double.NaN
1518 Double.NaN | 3.0
1519 4.0 | 4.0
1520 5.0 | 5.0
1521 ~~~
1522
1523 In this example, Imputer will replace all occurrences of `Double.NaN` (the default for the missing value)
1524 with the mean (the default imputation strategy) computed from the other values in the corresponding columns.
1525 In this example, the surrogate values for columns `a` and `b` are 3.0 and 4.0 respectively. After
1526 transformation, the missing values in the output columns will be replaced by the surrogate value for
1527 the relevant column.
1528
1529 ~~~
1530 a | b | out_a | out_b
1531 ------------|------------|-------|-------
1532 1.0 | Double.NaN | 1.0 | 4.0
1533 2.0 | Double.NaN | 2.0 | 4.0
1534 Double.NaN | 3.0 | 3.0 | 3.0
1535 4.0 | 4.0 | 4.0 | 4.0
1536 5.0 | 5.0 | 5.0 | 5.0
1537 ~~~
1538
1539 <div class="codetabs">
1540 <div data-lang="scala" markdown="1">
1541
1542 Refer to the [Imputer Scala docs](api/scala/org/apache/spark/ml/feature/Imputer.html)
1543 for more details on the API.
1544
1545 {% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
1546 </div>
1547
1548 <div data-lang="java" markdown="1">
1549
1550 Refer to the [Imputer Java docs](api/java/org/apache/spark/ml/feature/Imputer.html)
1551 for more details on the API.
1552
1553 {% include_example java/org/apache/spark/examples/ml/JavaImputerExample.java %}
1554 </div>
1555
1556 <div data-lang="python" markdown="1">
1557
1558 Refer to the [Imputer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Imputer)
1559 for more details on the API.
1560
1561 {% include_example python/ml/imputer_example.py %}
1562 </div>
1563 </div>
1564
1565 # Feature Selectors
1566
1567 ## VectorSlicer
1568
1569 `VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a
1570 sub-array of the original features. It is useful for extracting features from a vector column.
1571
1572 `VectorSlicer` accepts a vector column with specified indices, then outputs a new vector column
1573 whose values are selected via those indices. There are two types of indices,
1574
1575 1. Integer indices that represent the indices into the vector, `setIndices()`.
1576
1577 2. String indices that represent the names of features into the vector, `setNames()`.
1578 *This requires the vector column to have an `AttributeGroup` since the implementation matches on
1579 the name field of an `Attribute`.*
1580
1581 Specification by integer and string are both acceptable. Moreover, you can use integer index and
1582 string name simultaneously. At least one feature must be selected. Duplicate features are not
1583 allowed, so there can be no overlap between selected indices and names. Note that if names of
1584 features are selected, an exception will be thrown if empty input attributes are encountered.
1585
1586 The output vector will order features with the selected indices first (in the order given),
1587 followed by the selected names (in the order given).
1588
1589 **Examples**
1590
1591 Suppose that we have a DataFrame with the column `userFeatures`:
1592
1593 ~~~
1594 userFeatures
1595 ------------------
1596 [0.0, 10.0, 0.5]
1597 ~~~
1598
1599 `userFeatures` is a vector column that contains three user features. Assume that the first column
1600 of `userFeatures` are all zeros, so we want to remove it and select only the last two columns.
1601 The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector
1602 column named `features`:
1603
1604 ~~~
1605 userFeatures | features
1606 ------------------|-----------------------------
1607 [0.0, 10.0, 0.5] | [10.0, 0.5]
1608 ~~~
1609
1610 Suppose also that we have potential input attributes for the `userFeatures`, i.e.
1611 `["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
1612
1613 ~~~
1614 userFeatures | features
1615 ------------------|-----------------------------
1616 [0.0, 10.0, 0.5] | [10.0, 0.5]
1617 ["f1", "f2", "f3"] | ["f2", "f3"]
1618 ~~~
1619
1620 <div class="codetabs">
1621 <div data-lang="scala" markdown="1">
1622
1623 Refer to the [VectorSlicer Scala docs](api/scala/org/apache/spark/ml/feature/VectorSlicer.html)
1624 for more details on the API.
1625
1626 {% include_example scala/org/apache/spark/examples/ml/VectorSlicerExample.scala %}
1627 </div>
1628
1629 <div data-lang="java" markdown="1">
1630
1631 Refer to the [VectorSlicer Java docs](api/java/org/apache/spark/ml/feature/VectorSlicer.html)
1632 for more details on the API.
1633
1634 {% include_example java/org/apache/spark/examples/ml/JavaVectorSlicerExample.java %}
1635 </div>
1636
1637 <div data-lang="python" markdown="1">
1638
1639 Refer to the [VectorSlicer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorSlicer)
1640 for more details on the API.
1641
1642 {% include_example python/ml/vector_slicer_example.py %}
1643 </div>
1644 </div>
1645
1646 ## RFormula
1647
1648 `RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html).
1649 Currently we support a limited subset of the R operators, including '~', '.', ':', '+', and '-'.
1650 The basic operators are:
1651
1652 * `~` separate target and terms
1653 * `+` concat terms, "+ 0" means removing intercept
1654 * `-` remove a term, "- 1" means removing intercept
1655 * `:` interaction (multiplication for numeric values, or binarized categorical values)
1656 * `.` all columns except target
1657
1658 Suppose `a` and `b` are double columns, we use the following simple examples to illustrate the effect of `RFormula`:
1659
1660 * `y ~ a + b` means model `y ~ w0 + w1 * a + w2 * b` where `w0` is the intercept and `w1, w2` are coefficients.
1661 * `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients.
1662
1663 `RFormula` produces a vector column of features and a double or string column of label.
1664 Like when formulas are used in R for linear regression, numeric columns will be cast to doubles.
1665 As to string input columns, they will first be transformed with [StringIndexer](ml-features.html#stringindexer) using ordering determined by `stringOrderType`,
1666 and the last category after ordering is dropped, then the doubles will be one-hot encoded.
1667
1668 Suppose a string feature column containing values `{'b', 'a', 'b', 'a', 'c', 'b'}`, we set `stringOrderType` to control the encoding:
1669 ~~~
1670 stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula
1671 ----------------|---------------------------------------|---------------------------------
1672 'frequencyDesc' | most frequent category ('b') | least frequent category ('c')
1673 'frequencyAsc' | least frequent category ('c') | most frequent category ('b')
1674 'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')
1675 'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c')
1676 ~~~
1677
1678 If the label column is of type string, it will be first transformed to double with [StringIndexer](ml-features.html#stringindexer) using `frequencyDesc` ordering.
1679 If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
1680
1681 **Note:** The ordering option `stringOrderType` is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in `StringIndexer`.
1682
1683 **Examples**
1684
1685 Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`:
1686
1687 ~~~
1688 id | country | hour | clicked
1689 ---|---------|------|---------
1690 7 | "US" | 18 | 1.0
1691 8 | "CA" | 12 | 0.0
1692 9 | "NZ" | 15 | 0.0
1693 ~~~
1694
1695 If we use `RFormula` with a formula string of `clicked ~ country + hour`, which indicates that we want to
1696 predict `clicked` based on `country` and `hour`, after transformation we should get the following DataFrame:
1697
1698 ~~~
1699 id | country | hour | clicked | features | label
1700 ---|---------|------|---------|------------------|-------
1701 7 | "US" | 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0
1702 8 | "CA" | 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0
1703 9 | "NZ" | 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0
1704 ~~~
1705
1706 <div class="codetabs">
1707 <div data-lang="scala" markdown="1">
1708
1709 Refer to the [RFormula Scala docs](api/scala/org/apache/spark/ml/feature/RFormula.html)
1710 for more details on the API.
1711
1712 {% include_example scala/org/apache/spark/examples/ml/RFormulaExample.scala %}
1713 </div>
1714
1715 <div data-lang="java" markdown="1">
1716
1717 Refer to the [RFormula Java docs](api/java/org/apache/spark/ml/feature/RFormula.html)
1718 for more details on the API.
1719
1720 {% include_example java/org/apache/spark/examples/ml/JavaRFormulaExample.java %}
1721 </div>
1722
1723 <div data-lang="python" markdown="1">
1724
1725 Refer to the [RFormula Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula)
1726 for more details on the API.
1727
1728 {% include_example python/ml/rformula_example.py %}
1729 </div>
1730 </div>
1731
1732 ## ChiSqSelector
1733
1734 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
1735 categorical features. ChiSqSelector uses the
1736 [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
1737 features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
1738 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
1739 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
1740 * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
1741 * `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
1742 * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
1743 By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
1744 The user can choose a selection method using `setSelectorType`.
1745
1746 **Examples**
1747
1748 Assume that we have a DataFrame with the columns `id`, `features`, and `clicked`, which is used as
1749 our target to be predicted:
1750
1751 ~~~
1752 id | features | clicked
1753 ---|-----------------------|---------
1754 7 | [0.0, 0.0, 18.0, 1.0] | 1.0
1755 8 | [0.0, 1.0, 12.0, 0.0] | 0.0
1756 9 | [1.0, 0.0, 15.0, 0.1] | 0.0
1757 ~~~
1758
1759 If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our label `clicked` the
1760 last column in our `features` is chosen as the most useful feature:
1761
1762 ~~~
1763 id | features | clicked | selectedFeatures
1764 ---|-----------------------|---------|------------------
1765 7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0]
1766 8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0]
1767 9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1]
1768 ~~~
1769
1770 <div class="codetabs">
1771 <div data-lang="scala" markdown="1">
1772
1773 Refer to the [ChiSqSelector Scala docs](api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
1774 for more details on the API.
1775
1776 {% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %}
1777 </div>
1778
1779 <div data-lang="java" markdown="1">
1780
1781 Refer to the [ChiSqSelector Java docs](api/java/org/apache/spark/ml/feature/ChiSqSelector.html)
1782 for more details on the API.
1783
1784 {% include_example java/org/apache/spark/examples/ml/JavaChiSqSelectorExample.java %}
1785 </div>
1786
1787 <div data-lang="python" markdown="1">
1788
1789 Refer to the [ChiSqSelector Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ChiSqSelector)
1790 for more details on the API.
1791
1792 {% include_example python/ml/chisq_selector_example.py %}
1793 </div>
1794 </div>
1795
1796 # Locality Sensitive Hashing
1797 [Locality Sensitive Hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.
1798
1799 The general idea of LSH is to use a family of functions ("LSH families") to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. An LSH family is formally defined as follows.
1800
1801 In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties:
1802 `\[
1803 \forall p, q \in M,\\
1804 d(p,q) \leq r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
1805 d(p,q) \geq r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
1806 \]`
1807 This LSH family is called `(r1, r2, p1, p2)`-sensitive.
1808
1809 In Spark, different LSH families are implemented in separate classes (e.g., `MinHash`), and APIs for feature transformation, approximate similarity join and approximate nearest neighbor are provided in each class.
1810
1811 In LSH, we define a false positive as a pair of distant input features (with `$d(p,q) \geq r2$`) which are hashed into the same bucket, and we define a false negative as a pair of nearby features (with `$d(p,q) \leq r1$`) which are hashed into different buckets.
1812
1813 ## LSH Operations
1814
1815 We describe the major types of operations which LSH can be used for. A fitted LSH model has methods for each of these operations.
1816
1817 ### Feature Transformation
1818 Feature transformation is the basic functionality to add hashed values as a new column. This can be useful for dimensionality reduction. Users can specify input and output column names by setting `inputCol` and `outputCol`.
1819
1820 LSH also supports multiple LSH hash tables. Users can specify the number of hash tables by setting `numHashTables`. This is also used for [OR-amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) in approximate similarity join and approximate nearest neighbor. Increasing the number of hash tables will increase the accuracy but will also increase communication cost and running time.
1821
1822 The type of `outputCol` is `Seq[Vector]` where the dimension of the array equals `numHashTables`, and the dimensions of the vectors are currently set to 1. In future releases, we will implement AND-amplification so that users can specify the dimensions of these vectors.
1823
1824 ### Approximate Similarity Join
1825 Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
1826
1827 Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as `outputCol`.
1828
1829 In the joined dataset, the origin datasets can be queried in `datasetA` and `datasetB`. A distance column will be added to the output dataset to show the true distance between each pair of rows returned.
1830
1831 ### Approximate Nearest Neighbor Search
1832 Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.
1833
1834 Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as `outputCol`.
1835
1836 A distance column will be added to the output dataset to show the true distance between each output row and the searched key.
1837
1838 **Note:** Approximate nearest neighbor search will return fewer than `k` rows when there are not enough candidates in the hash bucket.
1839
1840 ## LSH Algorithms
1841
1842 ### Bucketed Random Projection for Euclidean Distance
1843
1844 [Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is an LSH family for Euclidean distance. The Euclidean distance is defined as follows:
1845 `\[
1846 d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
1847 \]`
1848 Its LSH family projects feature vectors `$\mathbf{x}$` onto a random unit vector `$\mathbf{v}$` and portions the projected results into hash buckets:
1849 `\[
1850 h(\mathbf{x}) = \Big\lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \Big\rfloor
1851 \]`
1852 where `r` is a user-defined bucket length. The bucket length can be used to control the average size of hash buckets (and thus the number of buckets). A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives).
1853
1854 Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors.
1855
1856 <div class="codetabs">
1857 <div data-lang="scala" markdown="1">
1858
1859 Refer to the [BucketedRandomProjectionLSH Scala docs](api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
1860 for more details on the API.
1861
1862 {% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
1863 </div>
1864
1865 <div data-lang="java" markdown="1">
1866
1867 Refer to the [BucketedRandomProjectionLSH Java docs](api/java/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
1868 for more details on the API.
1869
1870 {% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
1871 </div>
1872
1873 <div data-lang="python" markdown="1">
1874
1875 Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
1876 for more details on the API.
1877
1878 {% include_example python/ml/bucketed_random_projection_lsh_example.py %}
1879 </div>
1880
1881 </div>
1882
1883 ### MinHash for Jaccard Distance
1884 [MinHash](https://en.wikipedia.org/wiki/MinHash) is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union:
1885 `\[
1886 d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
1887 \]`
1888 MinHash applies a random hash function `g` to each element in the set and take the minimum of all hashed values:
1889 `\[
1890 h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
1891 \]`
1892
1893 The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values.
1894
1895 **Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry.
1896
1897 <div class="codetabs">
1898 <div data-lang="scala" markdown="1">
1899
1900 Refer to the [MinHashLSH Scala docs](api/scala/org/apache/spark/ml/feature/MinHashLSH.html)
1901 for more details on the API.
1902
1903 {% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
1904 </div>
1905
1906 <div data-lang="java" markdown="1">
1907
1908 Refer to the [MinHashLSH Java docs](api/java/org/apache/spark/ml/feature/MinHashLSH.html)
1909 for more details on the API.
1910
1911 {% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
1912 </div>
1913
1914 <div data-lang="python" markdown="1">
1915
1916 Refer to the [MinHashLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.MinHashLSH)
1917 for more details on the API.
1918
1919 {% include_example python/ml/min_hash_lsh_example.py %}
1920 </div>
1921 </div>