0001 ---
0002 layout: global
0003 title: Feature Extraction and Transformation - RDD-based API
0004 displayTitle: Feature Extraction and Transformation - RDD-based API
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025
0026 ## TF-IDF
0027
0028 **Note** We recommend using the DataFrame-based API, which is detailed in the [ML user guide on
0029 TF-IDF](ml-features.html#tf-idf).
0030
0031 [Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature
0032 vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
0033 Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
0034 Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`,
0035 while document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`.
0036 If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that
0037 appear very often but carry little information about the document, e.g., "a", "the", and "of".
0038 If a term appears very often across the corpus, it means it doesn't carry special information about
0039 a particular document.
0040 Inverse document frequency is a numerical measure of how much information a term provides:
0041 `\[
0042 IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
0043 \]`
0044 where `$|D|$` is the total number of documents in the corpus.
0045 Since logarithm is used, if a term appears in all documents, its IDF value becomes 0.
0046 Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus.
0047 The TF-IDF measure is simply the product of TF and IDF:
0048 `\[
0049 TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
0050 \]`
0051 There are several variants on the definition of term frequency and document frequency.
0052 In `spark.mllib`, we separate TF and IDF to make them flexible.
0053
0054 Our implementation of term frequency utilizes the
0055 [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
0056 A raw feature is mapped into an index (term) by applying a hash function.
0057 Then term frequencies are calculated based on the mapped indices.
0058 This approach avoids the need to compute a global term-to-index map,
0059 which can be expensive for a large corpus, but it suffers from potential hash collisions,
0060 where different raw features may become the same term after hashing.
0061 To reduce the chance of collision, we can increase the target feature dimension, i.e.,
0062 the number of buckets of the hash table.
0063 The default feature dimension is `$2^{20} = 1,048,576$`.
0064
0065 **Note:** `spark.mllib` doesn't provide tools for text segmentation.
0066 We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
0067 [scalanlp/chalk](https://github.com/scalanlp/chalk).
0068
0069 <div class="codetabs">
0070 <div data-lang="scala" markdown="1">
0071
0072 TF and IDF are implemented in [HashingTF](api/scala/org/apache/spark/mllib/feature/HashingTF.html)
0073 and [IDF](api/scala/org/apache/spark/mllib/feature/IDF.html).
0074 `HashingTF` takes an `RDD[Iterable[_]]` as the input.
0075 Each record could be an iterable of strings or other types.
0076
0077 Refer to the [`HashingTF` Scala docs](api/scala/org/apache/spark/mllib/feature/HashingTF.html) for details on the API.
0078
0079 {% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}
0080 </div>
0081 <div data-lang="python" markdown="1">
0082
0083 TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
0084 and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
0085 `HashingTF` takes an RDD of list as the input.
0086 Each record could be an iterable of strings or other types.
0087
0088
0089 Refer to the [`HashingTF` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF) for details on the API.
0090
0091 {% include_example python/mllib/tf_idf_example.py %}
0092 </div>
0093 </div>
0094
0095 ## Word2Vec
0096
0097 [Word2Vec](https://code.google.com/p/word2vec/) computes distributed vector representation of words.
0098 The main advantage of the distributed
0099 representations is that similar words are close in the vector space, which makes generalization to
0100 novel patterns easier and model estimation more robust. Distributed vector representation is
0101 showed to be useful in many natural language processing applications such as named entity
0102 recognition, disambiguation, parsing, tagging and machine translation.
0103
0104 ### Model
0105
0106 In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is
0107 to learn word vector representations that are good at predicting its context in the same sentence.
0108 Mathematically, given a sequence of training words `$w_1, w_2, \dots, w_T$`, the objective of the
0109 skip-gram model is to maximize the average log-likelihood
0110 `\[
0111 \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
0112 \]`
0113 where $k$ is the size of the training window.
0114
0115 In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are
0116 vector representations of $w$ as word and context respectively. The probability of correctly
0117 predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
0118 `\[
0119 p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}
0120 \]`
0121 where $V$ is the vocabulary size.
0122
0123 The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
0124 is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
0125 we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
0126 $O(\log(V))$
0127
0128 ### Example
0129
0130 The example below demonstrates how to load a text file, parse it as an RDD of `Seq[String]`,
0131 construct a `Word2Vec` instance and then fit a `Word2VecModel` with the input data. Finally,
0132 we display the top 40 synonyms of the specified word. To run the example, first download
0133 the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your preferred directory.
0134 Here we assume the extracted file is `text8` and in same directory as you run the spark shell.
0135
0136 <div class="codetabs">
0137 <div data-lang="scala" markdown="1">
0138 Refer to the [`Word2Vec` Scala docs](api/scala/org/apache/spark/mllib/feature/Word2Vec.html) for details on the API.
0139
0140 {% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %}
0141 </div>
0142 <div data-lang="python" markdown="1">
0143 Refer to the [`Word2Vec` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Word2Vec) for more details on the API.
0144
0145 {% include_example python/mllib/word2vec_example.py %}
0146 </div>
0147 </div>
0148
0149 ## StandardScaler
0150
0151 Standardizes features by scaling to unit variance and/or removing the mean using column summary
0152 statistics on the samples in the training set. This is a very common pre-processing step.
0153
0154 For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models
0155 typically work better when all features have unit variance and/or zero mean.
0156
0157 Standardization can improve the convergence rate during the optimization process, and also prevents
0158 against features with very large variances exerting an overly large influence during model training.
0159
0160 ### Model Fitting
0161
0162 [`StandardScaler`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) has the
0163 following parameters in the constructor:
0164
0165 * `withMean` False by default. Centers the data with mean before scaling. It will build a dense
0166 output, so take care when applying to sparse input.
0167 * `withStd` True by default. Scales the data to unit standard deviation.
0168
0169 We provide a [`fit`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) method in
0170 `StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
0171 return a model which can transform the input dataset into unit standard deviation and/or zero mean features
0172 depending how we configure the `StandardScaler`.
0173
0174 This model implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
0175 which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
0176 an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0177
0178 Note that if the variance of a feature is zero, it will return default `0.0` value in the `Vector`
0179 for that feature.
0180
0181 ### Example
0182
0183 The example below demonstrates how to load a dataset in libsvm format, and standardize the features
0184 so that the new features have unit standard deviation and/or zero mean.
0185
0186 <div class="codetabs">
0187 <div data-lang="scala" markdown="1">
0188 Refer to the [`StandardScaler` Scala docs](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) for details on the API.
0189
0190 {% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %}
0191 </div>
0192
0193 <div data-lang="python" markdown="1">
0194 Refer to the [`StandardScaler` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.StandardScaler) for more details on the API.
0195
0196 {% include_example python/mllib/standard_scaler_example.py %}
0197 </div>
0198 </div>
0199
0200 ## Normalizer
0201
0202 Normalizer scales individual samples to have unit $L^p$ norm. This is a common operation for text
0203 classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
0204 is the cosine similarity of the vectors.
0205
0206 [`Normalizer`](api/scala/org/apache/spark/mllib/feature/Normalizer.html) has the following
0207 parameter in the constructor:
0208
0209 * `p` Normalization in $L^p$ space, $p = 2$ by default.
0210
0211 `Normalizer` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
0212 which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
0213 an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0214
0215 Note that if the norm of the input is zero, it will return the input vector.
0216
0217 ### Example
0218
0219 The example below demonstrates how to load a dataset in libsvm format, and normalizes the features
0220 with $L^2$ norm, and $L^\infty$ norm.
0221
0222 <div class="codetabs">
0223 <div data-lang="scala" markdown="1">
0224 Refer to the [`Normalizer` Scala docs](api/scala/org/apache/spark/mllib/feature/Normalizer.html) for details on the API.
0225
0226 {% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %}
0227 </div>
0228
0229 <div data-lang="python" markdown="1">
0230 Refer to the [`Normalizer` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer) for more details on the API.
0231
0232 {% include_example python/mllib/normalizer_example.py %}
0233 </div>
0234 </div>
0235
0236 ## ChiSqSelector
0237
0238 [Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant
0239 features for use in model construction. It reduces the size of the feature space, which can improve
0240 both speed and statistical learning behavior.
0241
0242 [`ChiSqSelector`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) implements
0243 Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
0244 [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
0245 features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
0246
0247 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
0248 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
0249 * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
0250 * `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
0251 * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
0252
0253 By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
0254 The user can choose a selection method using `setSelectorType`.
0255
0256 The number of features to select can be tuned using a held-out validation set.
0257
0258 ### Model Fitting
0259
0260 The [`fit`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) method takes
0261 an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
0262 returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
0263 The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
0264 an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
0265
0266 Note that the user can also construct a `ChiSqSelectorModel` by hand by providing an array of selected feature indices (which must be sorted in ascending order).
0267
0268 ### Example
0269
0270 The following example shows the basic use of ChiSqSelector. The data set used has a feature matrix consisting of greyscale values that vary from 0 to 255 for each feature.
0271
0272 <div class="codetabs">
0273 <div data-lang="scala" markdown="1">
0274
0275 Refer to the [`ChiSqSelector` Scala docs](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html)
0276 for details on the API.
0277
0278 {% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %}
0279 </div>
0280
0281 <div data-lang="java" markdown="1">
0282
0283 Refer to the [`ChiSqSelector` Java docs](api/java/org/apache/spark/mllib/feature/ChiSqSelector.html)
0284 for details on the API.
0285
0286 {% include_example java/org/apache/spark/examples/mllib/JavaChiSqSelectorExample.java %}
0287 </div>
0288 </div>
0289
0290 ## ElementwiseProduct
0291
0292 `ElementwiseProduct` multiplies each input vector by a provided "weight" vector, using element-wise
0293 multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This
0294 represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29)
0295 between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector.
0296
0297 Denoting the `scalingVec` as "`w`", this transformation may be written as:
0298
0299 `\[ \begin{pmatrix}
0300 v_1 \\
0301 \vdots \\
0302 v_N
0303 \end{pmatrix} \circ \begin{pmatrix}
0304 w_1 \\
0305 \vdots \\
0306 w_N
0307 \end{pmatrix}
0308 = \begin{pmatrix}
0309 v_1 w_1 \\
0310 \vdots \\
0311 v_N w_N
0312 \end{pmatrix}
0313 \]`
0314
0315 [`ElementwiseProduct`](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) has the following parameter in the constructor:
0316
0317 * `scalingVec`: the transforming vector.
0318
0319 `ElementwiseProduct` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0320
0321 ### Example
0322
0323 This example below demonstrates how to transform vectors using a transforming vector value.
0324
0325 <div class="codetabs">
0326 <div data-lang="scala" markdown="1">
0327
0328 Refer to the [`ElementwiseProduct` Scala docs](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
0329
0330 {% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %}
0331 </div>
0332
0333 <div data-lang="java" markdown="1">
0334 Refer to the [`ElementwiseProduct` Java docs](api/java/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
0335
0336 {% include_example java/org/apache/spark/examples/mllib/JavaElementwiseProductExample.java %}
0337 </div>
0338
0339 <div data-lang="python" markdown="1">
0340 Refer to the [`ElementwiseProduct` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.ElementwiseProduct) for more details on the API.
0341
0342 {% include_example python/mllib/elementwise_product_example.py %}
0343 </div>
0344 </div>
0345
0346
0347 ## PCA
0348
0349 A feature transformer that projects vectors to a low-dimensional space using PCA.
0350 Details you can read at [dimensionality reduction](mllib-dimensionality-reduction.html).