the-tree/docs/mllib-statistics.md

0001 ---
0002 layout: global
0003 title: Basic Statistics - RDD-based API
0004 displayTitle: Basic Statistics - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025
0026 `\[
0027 \newcommand{\R}{\mathbb{R}}
0028 \newcommand{\E}{\mathbb{E}}
0029 \newcommand{\x}{\mathbf{x}}
0030 \newcommand{\y}{\mathbf{y}}
0031 \newcommand{\wv}{\mathbf{w}}
0032 \newcommand{\av}{\mathbf{\alpha}}
0033 \newcommand{\bv}{\mathbf{b}}
0034 \newcommand{\N}{\mathbb{N}}
0035 \newcommand{\id}{\mathbf{I}}
0036 \newcommand{\ind}{\mathbf{1}}
0037 \newcommand{\0}{\mathbf{0}}
0038 \newcommand{\unit}{\mathbf{e}}
0039 \newcommand{\one}{\mathbf{1}}
0040 \newcommand{\zero}{\mathbf{0}}
0041 \]`
0042
0043 ## Summary statistics
0044
0045 We provide column summary statistics for `RDD[Vector]` through the function `colStats`
0046 available in `Statistics`.
0047
0048 <div class="codetabs">
0049 <div data-lang="scala" markdown="1">
0050
0051 [`colStats()`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) returns an instance of
0052 [`MultivariateStatisticalSummary`](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
0053 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0054 total count.
0055
0056 Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
0057
0058 {% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
0059 </div>
0060
0061 <div data-lang="java" markdown="1">
0062
0063 [`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
0064 [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
0065 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0066 total count.
0067
0068 Refer to the [`MultivariateStatisticalSummary` Java docs](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
0069
0070 {% include_example java/org/apache/spark/examples/mllib/JavaSummaryStatisticsExample.java %}
0071 </div>
0072
0073 <div data-lang="python" markdown="1">
0074 [`colStats()`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics.colStats) returns an instance of
0075 [`MultivariateStatisticalSummary`](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary),
0076 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0077 total count.
0078
0079 Refer to the [`MultivariateStatisticalSummary` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary) for more details on the API.
0080
0081 {% include_example python/mllib/summary_statistics_example.py %}
0082 </div>
0083
0084 </div>
0085
0086 ## Correlations
0087
0088 Calculating the correlation between two series of data is a common operation in Statistics. In `spark.mllib`
0089 we provide the flexibility to calculate pairwise correlations among many series. The supported
0090 correlation methods are currently Pearson's and Spearman's correlation.
0091
0092 <div class="codetabs">
0093 <div data-lang="scala" markdown="1">
0094 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0095 calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
0096 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
0097
0098 Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
0099
0100 {% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
0101 </div>
0102
0103 <div data-lang="java" markdown="1">
0104 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0105 calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or
0106 a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
0107
0108 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
0109
0110 {% include_example java/org/apache/spark/examples/mllib/JavaCorrelationsExample.java %}
0111 </div>
0112
0113 <div data-lang="python" markdown="1">
0114 [`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
0115 calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
0116 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
0117
0118 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0119
0120 {% include_example python/mllib/correlations_example.py %}
0121 </div>
0122
0123 </div>
0124
0125 ## Stratified sampling
0126
0127 Unlike the other statistics functions, which reside in `spark.mllib`, stratified sampling methods,
0128 `sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified
0129 sampling, the keys can be thought of as a label and the value as a specific attribute. For example
0130 the key can be man or woman, or document ids, and the respective values can be the list of ages
0131 of the people in the population or the list of words in the documents. The `sampleByKey` method
0132 will flip a coin to decide whether an observation will be sampled or not, therefore requires one
0133 pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant
0134 more resources than the per-stratum simple random sampling used in `sampleByKey`, but will provide
0135 the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in
0136 python.
0137
0138 <div class="codetabs">
0139 <div data-lang="scala" markdown="1">
0140 [`sampleByKeyExact()`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) allows users to
0141 sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
0142 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
0143 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
0144 size, whereas sampling with replacement requires two additional passes.
0145
0146 {% include_example scala/org/apache/spark/examples/mllib/StratifiedSamplingExample.scala %}
0147 </div>
0148
0149 <div data-lang="java" markdown="1">
0150 [`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
0151 sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
0152 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
0153 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
0154 size, whereas sampling with replacement requires two additional passes.
0155
0156 {% include_example java/org/apache/spark/examples/mllib/JavaStratifiedSamplingExample.java %}
0157 </div>
0158 <div data-lang="python" markdown="1">
0159 [`sampleByKey()`](api/python/pyspark.html#pyspark.RDD.sampleByKey) allows users to
0160 sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the
0161 desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the
0162 set of keys.
0163
0164 *Note:* `sampleByKeyExact()` is currently not supported in Python.
0165
0166 {% include_example python/mllib/stratified_sampling_example.py %}
0167 </div>
0168
0169 </div>
0170
0171 ## Hypothesis testing
0172
0173 Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
0174 significant, whether this result occurred by chance or not. `spark.mllib` currently supports Pearson's
0175 chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
0176 whether the goodness of fit or the independence test is conducted. The goodness of fit test requires
0177 an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
0178
0179 `spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared
0180 independence tests.
0181
0182 <div class="codetabs">
0183 <div data-lang="scala" markdown="1">
0184 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0185 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0186 hypothesis tests.
0187
0188 {% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingExample.scala %}
0189 </div>
0190
0191 <div data-lang="java" markdown="1">
0192 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0193 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0194 hypothesis tests.
0195
0196 Refer to the [`ChiSqTestResult` Java docs](api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html) for details on the API.
0197
0198 {% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingExample.java %}
0199 </div>
0200
0201 <div data-lang="python" markdown="1">
0202 [`Statistics`](api/python/index.html#pyspark.mllib.stat.Statistics$) provides methods to
0203 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0204 hypothesis tests.
0205
0206 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0207
0208 {% include_example python/mllib/hypothesis_testing_example.py %}
0209 </div>
0210
0211 </div>
0212
0213 Additionally, `spark.mllib` provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test
0214 for equality of probability distributions. By providing the name of a theoretical distribution
0215 (currently solely supported for the normal distribution) and its parameters, or a function to
0216 calculate the cumulative distribution according to a given theoretical distribution, the user can
0217 test the null hypothesis that their sample is drawn from that distribution. In the case that the
0218 user tests against the normal distribution (`distName="norm"`), but does not provide distribution
0219 parameters, the test initializes to the standard normal distribution and logs an appropriate
0220 message.
0221
0222 <div class="codetabs">
0223 <div data-lang="scala" markdown="1">
0224 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0225 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0226 and interpret the hypothesis tests.
0227
0228 Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
0229
0230 {% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
0231 </div>
0232
0233 <div data-lang="java" markdown="1">
0234 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0235 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0236 and interpret the hypothesis tests.
0237
0238 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
0239
0240 {% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingKolmogorovSmirnovTestExample.java %}
0241 </div>
0242
0243 <div data-lang="python" markdown="1">
0244 [`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
0245 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0246 and interpret the hypothesis tests.
0247
0248 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0249
0250 {% include_example python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py %}
0251 </div>
0252 </div>
0253
0254 ### Streaming Significance Testing
0255 `spark.mllib` provides online implementations of some tests to support use cases
0256 like A/B testing. These tests may be performed on a Spark Streaming
0257 `DStream[(Boolean, Double)]` where the first element of each tuple
0258 indicates control group (`false`) or treatment group (`true`) and the
0259 second element is the value of an observation.
0260
0261 Streaming significance testing supports the following parameters:
0262
0263 * `peacePeriod` - The number of initial data points from the stream to
0264 ignore, used to mitigate novelty effects.
0265 * `windowSize` - The number of past batches to perform hypothesis
0266 testing over. Setting to `0` will perform cumulative processing using
0267 all prior batches.
0268
0269
0270 <div class="codetabs">
0271 <div data-lang="scala" markdown="1">
0272 [`StreamingTest`](api/scala/org/apache/spark/mllib/stat/test/StreamingTest.html)
0273 provides streaming hypothesis testing.
0274
0275 {% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
0276 </div>
0277
0278 <div data-lang="java" markdown="1">
0279 [`StreamingTest`](api/java/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
0280 provides streaming hypothesis testing.
0281
0282 {% include_example java/org/apache/spark/examples/mllib/JavaStreamingTestExample.java %}
0283 </div>
0284 </div>
0285
0286
0287 ## Random data generation
0288
0289 Random data generation is useful for randomized algorithms, prototyping, and performance testing.
0290 `spark.mllib` supports generating random RDDs with i.i.d. values drawn from a given distribution:
0291 uniform, standard normal, or Poisson.
0292
0293 <div class="codetabs">
0294 <div data-lang="scala" markdown="1">
0295 [`RandomRDDs`](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) provides factory
0296 methods to generate random double RDDs or vector RDDs.
0297 The following example generates a random double RDD, whose values follows the standard normal
0298 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0299
0300 Refer to the [`RandomRDDs` Scala docs](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) for details on the API.
0301
0302 {% highlight scala %}
0303 import org.apache.spark.SparkContext
0304 import org.apache.spark.mllib.random.RandomRDDs._
0305
0306 val sc: SparkContext = ...
0307
0308 // Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0309 // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0310 val u = normalRDD(sc, 1000000L, 10)
0311 // Apply a transform to get a random double RDD following `N(1, 4)`.
0312 val v = u.map(x => 1.0 + 2.0 * x)
0313 {% endhighlight %}
0314 </div>
0315
0316 <div data-lang="java" markdown="1">
0317 [`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
0318 methods to generate random double RDDs or vector RDDs.
0319 The following example generates a random double RDD, whose values follows the standard normal
0320 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0321
0322 Refer to the [`RandomRDDs` Java docs](api/java/org/apache/spark/mllib/random/RandomRDDs) for details on the API.
0323
0324 {% highlight java %}
0325 import org.apache.spark.SparkContext;
0326 import org.apache.spark.api.JavaDoubleRDD;
0327 import static org.apache.spark.mllib.random.RandomRDDs.*;
0328
0329 JavaSparkContext jsc = ...
0330
0331 // Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0332 // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0333 JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
0334 // Apply a transform to get a random double RDD following `N(1, 4)`.
0335 JavaDoubleRDD v = u.mapToDouble(x -> 1.0 + 2.0 * x);
0336 {% endhighlight %}
0337 </div>
0338
0339 <div data-lang="python" markdown="1">
0340 [`RandomRDDs`](api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs) provides factory
0341 methods to generate random double RDDs or vector RDDs.
0342 The following example generates a random double RDD, whose values follows the standard normal
0343 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0344
0345 Refer to the [`RandomRDDs` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs) for more details on the API.
0346
0347 {% highlight python %}
0348 from pyspark.mllib.random import RandomRDDs
0349
0350 sc = ... # SparkContext
0351
0352 # Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0353 # standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0354 u = RandomRDDs.normalRDD(sc, 1000000L, 10)
0355 # Apply a transform to get a random double RDD following `N(1, 4)`.
0356 v = u.map(lambda x: 1.0 + 2.0 * x)
0357 {% endhighlight %}
0358 </div>
0359 </div>
0360
0361 ## Kernel density estimation
0362
0363 [Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique
0364 useful for visualizing empirical probability distributions without requiring assumptions about the
0365 particular distribution that the observed samples are drawn from. It computes an estimate of the
0366 probability density function of a random variables, evaluated at a given set of points. It achieves
0367 this estimate by expressing the PDF of the empirical distribution at a particular point as the
0368 mean of PDFs of normal distributions centered around each of the samples.
0369
0370 <div class="codetabs">
0371
0372 <div data-lang="scala" markdown="1">
0373 [`KernelDensity`](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) provides methods
0374 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0375 to do so.
0376
0377 Refer to the [`KernelDensity` Scala docs](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
0378
0379 {% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
0380 </div>
0381
0382 <div data-lang="java" markdown="1">
0383 [`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
0384 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0385 to do so.
0386
0387 Refer to the [`KernelDensity` Java docs](api/java/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
0388
0389 {% include_example java/org/apache/spark/examples/mllib/JavaKernelDensityEstimationExample.java %}
0390 </div>
0391
0392 <div data-lang="python" markdown="1">
0393 [`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods
0394 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0395 to do so.
0396
0397 Refer to the [`KernelDensity` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) for more details on the API.
0398
0399 {% include_example python/mllib/kernel_density_estimation_example.py %}
0400 </div>
0401
0402 </div>