Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Feature Extraction and Transformation - RDD-based API
0004 displayTitle: Feature Extraction and Transformation - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 
0026 ## TF-IDF
0027 
0028 **Note** We recommend using the DataFrame-based API, which is detailed in the [ML user guide on 
0029 TF-IDF](ml-features.html#tf-idf).
0030 
0031 [Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature 
0032 vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
0033 Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
0034 Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`,
0035 while document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`.
0036 If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that
0037 appear very often but carry little information about the document, e.g., "a", "the", and "of".
0038 If a term appears very often across the corpus, it means it doesn't carry special information about
0039 a particular document.
0040 Inverse document frequency is a numerical measure of how much information a term provides:
0041 `\[
0042 IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
0043 \]`
0044 where `$|D|$` is the total number of documents in the corpus.
0045 Since logarithm is used, if a term appears in all documents, its IDF value becomes 0.
0046 Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus.
0047 The TF-IDF measure is simply the product of TF and IDF:
0048 `\[
0049 TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
0050 \]`
0051 There are several variants on the definition of term frequency and document frequency.
0052 In `spark.mllib`, we separate TF and IDF to make them flexible.
0053 
0054 Our implementation of term frequency utilizes the
0055 [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
0056 A raw feature is mapped into an index (term) by applying a hash function.
0057 Then term frequencies are calculated based on the mapped indices.
0058 This approach avoids the need to compute a global term-to-index map,
0059 which can be expensive for a large corpus, but it suffers from potential hash collisions,
0060 where different raw features may become the same term after hashing.
0061 To reduce the chance of collision, we can increase the target feature dimension, i.e., 
0062 the number of buckets of the hash table.
0063 The default feature dimension is `$2^{20} = 1,048,576$`.
0064 
0065 **Note:** `spark.mllib` doesn't provide tools for text segmentation.
0066 We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and 
0067 [scalanlp/chalk](https://github.com/scalanlp/chalk).
0068 
0069 <div class="codetabs">
0070 <div data-lang="scala" markdown="1">
0071 
0072 TF and IDF are implemented in [HashingTF](api/scala/org/apache/spark/mllib/feature/HashingTF.html)
0073 and [IDF](api/scala/org/apache/spark/mllib/feature/IDF.html).
0074 `HashingTF` takes an `RDD[Iterable[_]]` as the input.
0075 Each record could be an iterable of strings or other types.
0076 
0077 Refer to the [`HashingTF` Scala docs](api/scala/org/apache/spark/mllib/feature/HashingTF.html) for details on the API.
0078 
0079 {% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}
0080 </div>
0081 <div data-lang="python" markdown="1">
0082 
0083 TF and IDF are implemented in [HashingTF](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)
0084 and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF).
0085 `HashingTF` takes an RDD of list as the input.
0086 Each record could be an iterable of strings or other types.
0087 
0088 
0089 Refer to the [`HashingTF` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF) for details on the API.
0090 
0091 {% include_example python/mllib/tf_idf_example.py %}
0092 </div>
0093 </div>
0094 
0095 ## Word2Vec 
0096 
0097 [Word2Vec](https://code.google.com/p/word2vec/) computes distributed vector representation of words.
0098 The main advantage of the distributed
0099 representations is that similar words are close in the vector space, which makes generalization to 
0100 novel patterns easier and model estimation more robust. Distributed vector representation is 
0101 showed to be useful in many natural language processing applications such as named entity 
0102 recognition, disambiguation, parsing, tagging and machine translation.
0103 
0104 ### Model
0105 
0106 In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is
0107 to learn word vector representations that are good at predicting its context in the same sentence. 
0108 Mathematically, given a sequence of training words `$w_1, w_2, \dots, w_T$`, the objective of the
0109 skip-gram model is to maximize the average log-likelihood 
0110 `\[
0111 \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
0112 \]`
0113 where $k$ is the size of the training window.  
0114 
0115 In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are 
0116 vector representations of $w$ as word and context respectively. The probability of correctly 
0117 predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is 
0118 `\[
0119 p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}
0120 \]`
0121 where $V$ is the vocabulary size. 
0122 
0123 The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
0124 is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, 
0125 we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
0126 $O(\log(V))$
0127 
0128 ### Example 
0129 
0130 The example below demonstrates how to load a text file, parse it as an RDD of `Seq[String]`,
0131 construct a `Word2Vec` instance and then fit a `Word2VecModel` with the input data. Finally,
0132 we display the top 40 synonyms of the specified word. To run the example, first download
0133 the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your preferred directory.
0134 Here we assume the extracted file is `text8` and in same directory as you run the spark shell.  
0135 
0136 <div class="codetabs">
0137 <div data-lang="scala" markdown="1">
0138 Refer to the [`Word2Vec` Scala docs](api/scala/org/apache/spark/mllib/feature/Word2Vec.html) for details on the API.
0139 
0140 {% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %}
0141 </div>
0142 <div data-lang="python" markdown="1">
0143 Refer to the [`Word2Vec` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Word2Vec) for more details on the API.
0144 
0145 {% include_example python/mllib/word2vec_example.py %}
0146 </div>
0147 </div>
0148 
0149 ## StandardScaler
0150 
0151 Standardizes features by scaling to unit variance and/or removing the mean using column summary
0152 statistics on the samples in the training set. This is a very common pre-processing step.
0153 
0154 For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models
0155 typically work better when all features have unit variance and/or zero mean.
0156 
0157 Standardization can improve the convergence rate during the optimization process, and also prevents
0158 against features with very large variances exerting an overly large influence during model training.
0159 
0160 ### Model Fitting
0161 
0162 [`StandardScaler`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) has the
0163 following parameters in the constructor:
0164 
0165 * `withMean` False by default. Centers the data with mean before scaling. It will build a dense
0166 output, so take care when applying to sparse input.
0167 * `withStd` True by default. Scales the data to unit standard deviation.
0168 
0169 We provide a [`fit`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) method in
0170 `StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
0171 return a model which can transform the input dataset into unit standard deviation and/or zero mean features
0172 depending how we configure the `StandardScaler`.
0173 
0174 This model implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
0175 which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
0176 an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0177 
0178 Note that if the variance of a feature is zero, it will return default `0.0` value in the `Vector`
0179 for that feature.
0180 
0181 ### Example
0182 
0183 The example below demonstrates how to load a dataset in libsvm format, and standardize the features
0184 so that the new features have unit standard deviation and/or zero mean.
0185 
0186 <div class="codetabs">
0187 <div data-lang="scala" markdown="1">
0188 Refer to the [`StandardScaler` Scala docs](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) for details on the API.
0189 
0190 {% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %}
0191 </div>
0192 
0193 <div data-lang="python" markdown="1">
0194 Refer to the [`StandardScaler` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.StandardScaler) for more details on the API.
0195 
0196 {% include_example python/mllib/standard_scaler_example.py %}
0197 </div>
0198 </div>
0199 
0200 ## Normalizer
0201 
0202 Normalizer scales individual samples to have unit $L^p$ norm. This is a common operation for text
0203 classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
0204 is the cosine similarity of the vectors.
0205 
0206 [`Normalizer`](api/scala/org/apache/spark/mllib/feature/Normalizer.html) has the following
0207 parameter in the constructor:
0208 
0209 * `p` Normalization in $L^p$ space, $p = 2$ by default.
0210 
0211 `Normalizer` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
0212 which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
0213 an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0214 
0215 Note that if the norm of the input is zero, it will return the input vector.
0216 
0217 ### Example
0218 
0219 The example below demonstrates how to load a dataset in libsvm format, and normalizes the features
0220 with $L^2$ norm, and $L^\infty$ norm.
0221 
0222 <div class="codetabs">
0223 <div data-lang="scala" markdown="1">
0224 Refer to the [`Normalizer` Scala docs](api/scala/org/apache/spark/mllib/feature/Normalizer.html) for details on the API.
0225 
0226 {% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %}
0227 </div>
0228 
0229 <div data-lang="python" markdown="1">
0230 Refer to the [`Normalizer` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer) for more details on the API.
0231 
0232 {% include_example python/mllib/normalizer_example.py %}
0233 </div>
0234 </div>
0235 
0236 ## ChiSqSelector
0237 
0238 [Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant
0239 features for use in model construction. It reduces the size of the feature space, which can improve
0240 both speed and statistical learning behavior.
0241 
0242 [`ChiSqSelector`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) implements
0243 Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
0244 [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
0245 features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
0246 
0247 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
0248 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
0249 * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
0250 * `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
0251 * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
0252 
0253 By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
0254 The user can choose a selection method using `setSelectorType`.
0255 
0256 The number of features to select can be tuned using a held-out validation set.
0257 
0258 ### Model Fitting
0259 
0260 The [`fit`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) method takes
0261 an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
0262 returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
0263 The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
0264 an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
0265 
0266 Note that the user can also construct a `ChiSqSelectorModel` by hand by providing an array of selected feature indices (which must be sorted in ascending order).
0267 
0268 ### Example
0269 
0270 The following example shows the basic use of ChiSqSelector. The data set used has a feature matrix consisting of greyscale values that vary from 0 to 255 for each feature.
0271 
0272 <div class="codetabs">
0273 <div data-lang="scala" markdown="1">
0274 
0275 Refer to the [`ChiSqSelector` Scala docs](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html)
0276 for details on the API.
0277 
0278 {% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %}
0279 </div>
0280 
0281 <div data-lang="java" markdown="1">
0282 
0283 Refer to the [`ChiSqSelector` Java docs](api/java/org/apache/spark/mllib/feature/ChiSqSelector.html)
0284 for details on the API.
0285 
0286 {% include_example java/org/apache/spark/examples/mllib/JavaChiSqSelectorExample.java %}
0287 </div>
0288 </div>
0289 
0290 ## ElementwiseProduct
0291 
0292 `ElementwiseProduct` multiplies each input vector by a provided "weight" vector, using element-wise
0293 multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This
0294 represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29)
0295 between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector.
0296 
0297 Denoting the `scalingVec` as "`w`", this transformation may be written as:
0298 
0299 `\[ \begin{pmatrix}
0300 v_1 \\
0301 \vdots \\
0302 v_N
0303 \end{pmatrix} \circ \begin{pmatrix}
0304                     w_1 \\
0305                     \vdots \\
0306                     w_N
0307                     \end{pmatrix}
0308 = \begin{pmatrix}
0309   v_1 w_1 \\
0310   \vdots \\
0311   v_N w_N
0312   \end{pmatrix}
0313 \]`
0314 
0315 [`ElementwiseProduct`](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) has the following parameter in the constructor:
0316 
0317 * `scalingVec`: the transforming vector.
0318 
0319 `ElementwiseProduct` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
0320 
0321 ### Example
0322 
0323 This example below demonstrates how to transform vectors using a transforming vector value.
0324 
0325 <div class="codetabs">
0326 <div data-lang="scala" markdown="1">
0327 
0328 Refer to the [`ElementwiseProduct` Scala docs](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
0329 
0330 {% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %}
0331 </div>
0332 
0333 <div data-lang="java" markdown="1">
0334 Refer to the [`ElementwiseProduct` Java docs](api/java/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
0335 
0336 {% include_example java/org/apache/spark/examples/mllib/JavaElementwiseProductExample.java %}
0337 </div>
0338 
0339 <div data-lang="python" markdown="1">
0340 Refer to the [`ElementwiseProduct` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.ElementwiseProduct) for more details on the API.
0341 
0342 {% include_example python/mllib/elementwise_product_example.py %}
0343 </div>
0344 </div>
0345 
0346 
0347 ## PCA
0348 
0349 A feature transformer that projects vectors to a low-dimensional space using PCA.
0350 Details you can read at [dimensionality reduction](mllib-dimensionality-reduction.html).