Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Basic Statistics - RDD-based API
0004 displayTitle: Basic Statistics - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 * Table of contents
0023 {:toc}
0024 
0025 
0026 `\[
0027 \newcommand{\R}{\mathbb{R}}
0028 \newcommand{\E}{\mathbb{E}}
0029 \newcommand{\x}{\mathbf{x}}
0030 \newcommand{\y}{\mathbf{y}}
0031 \newcommand{\wv}{\mathbf{w}}
0032 \newcommand{\av}{\mathbf{\alpha}}
0033 \newcommand{\bv}{\mathbf{b}}
0034 \newcommand{\N}{\mathbb{N}}
0035 \newcommand{\id}{\mathbf{I}}
0036 \newcommand{\ind}{\mathbf{1}}
0037 \newcommand{\0}{\mathbf{0}}
0038 \newcommand{\unit}{\mathbf{e}}
0039 \newcommand{\one}{\mathbf{1}}
0040 \newcommand{\zero}{\mathbf{0}}
0041 \]`
0042 
0043 ## Summary statistics
0044 
0045 We provide column summary statistics for `RDD[Vector]` through the function `colStats`
0046 available in `Statistics`.
0047 
0048 <div class="codetabs">
0049 <div data-lang="scala" markdown="1">
0050 
0051 [`colStats()`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) returns an instance of
0052 [`MultivariateStatisticalSummary`](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
0053 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0054 total count.
0055 
0056 Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
0057 
0058 {% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
0059 </div>
0060 
0061 <div data-lang="java" markdown="1">
0062 
0063 [`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of
0064 [`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
0065 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0066 total count.
0067 
0068 Refer to the [`MultivariateStatisticalSummary` Java docs](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
0069 
0070 {% include_example java/org/apache/spark/examples/mllib/JavaSummaryStatisticsExample.java %}
0071 </div>
0072 
0073 <div data-lang="python" markdown="1">
0074 [`colStats()`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics.colStats) returns an instance of
0075 [`MultivariateStatisticalSummary`](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary),
0076 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
0077 total count.
0078 
0079 Refer to the [`MultivariateStatisticalSummary` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary) for more details on the API.
0080 
0081 {% include_example python/mllib/summary_statistics_example.py %}
0082 </div>
0083 
0084 </div>
0085 
0086 ## Correlations
0087 
0088 Calculating the correlation between two series of data is a common operation in Statistics. In `spark.mllib`
0089 we provide the flexibility to calculate pairwise correlations among many series. The supported
0090 correlation methods are currently Pearson's and Spearman's correlation.
0091 
0092 <div class="codetabs">
0093 <div data-lang="scala" markdown="1">
0094 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0095 calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
0096 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
0097 
0098 Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
0099 
0100 {% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
0101 </div>
0102 
0103 <div data-lang="java" markdown="1">
0104 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0105 calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or
0106 a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
0107 
0108 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
0109 
0110 {% include_example java/org/apache/spark/examples/mllib/JavaCorrelationsExample.java %}
0111 </div>
0112 
0113 <div data-lang="python" markdown="1">
0114 [`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
0115 calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
0116 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
0117 
0118 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0119 
0120 {% include_example python/mllib/correlations_example.py %}
0121 </div>
0122 
0123 </div>
0124 
0125 ## Stratified sampling
0126 
0127 Unlike the other statistics functions, which reside in `spark.mllib`, stratified sampling methods,
0128 `sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified
0129 sampling, the keys can be thought of as a label and the value as a specific attribute. For example
0130 the key can be man or woman, or document ids, and the respective values can be the list of ages
0131 of the people in the population or the list of words in the documents. The `sampleByKey` method
0132 will flip a coin to decide whether an observation will be sampled or not, therefore requires one
0133 pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant
0134 more resources than the per-stratum simple random sampling used in `sampleByKey`, but will provide
0135 the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in
0136 python.
0137 
0138 <div class="codetabs">
0139 <div data-lang="scala" markdown="1">
0140 [`sampleByKeyExact()`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) allows users to
0141 sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
0142 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
0143 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
0144 size, whereas sampling with replacement requires two additional passes.
0145 
0146 {% include_example scala/org/apache/spark/examples/mllib/StratifiedSamplingExample.scala %}
0147 </div>
0148 
0149 <div data-lang="java" markdown="1">
0150 [`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
0151 sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
0152 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
0153 keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
0154 size, whereas sampling with replacement requires two additional passes.
0155 
0156 {% include_example java/org/apache/spark/examples/mllib/JavaStratifiedSamplingExample.java %}
0157 </div>
0158 <div data-lang="python" markdown="1">
0159 [`sampleByKey()`](api/python/pyspark.html#pyspark.RDD.sampleByKey) allows users to
0160 sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the
0161 desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the
0162 set of keys.
0163 
0164 *Note:* `sampleByKeyExact()` is currently not supported in Python.
0165 
0166 {% include_example python/mllib/stratified_sampling_example.py %}
0167 </div>
0168 
0169 </div>
0170 
0171 ## Hypothesis testing
0172 
0173 Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
0174 significant, whether this result occurred by chance or not. `spark.mllib` currently supports Pearson's
0175 chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
0176 whether the goodness of fit or the independence test is conducted. The goodness of fit test requires
0177 an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
0178 
0179 `spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared
0180 independence tests.
0181 
0182 <div class="codetabs">
0183 <div data-lang="scala" markdown="1">
0184 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0185 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0186 hypothesis tests.
0187 
0188 {% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingExample.scala %}
0189 </div>
0190 
0191 <div data-lang="java" markdown="1">
0192 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0193 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0194 hypothesis tests.
0195 
0196 Refer to the [`ChiSqTestResult` Java docs](api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html) for details on the API.
0197 
0198 {% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingExample.java %}
0199 </div>
0200 
0201 <div data-lang="python" markdown="1">
0202 [`Statistics`](api/python/index.html#pyspark.mllib.stat.Statistics$) provides methods to
0203 run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
0204 hypothesis tests.
0205 
0206 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0207 
0208 {% include_example python/mllib/hypothesis_testing_example.py %}
0209 </div>
0210 
0211 </div>
0212 
0213 Additionally, `spark.mllib` provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test
0214 for equality of probability distributions. By providing the name of a theoretical distribution
0215 (currently solely supported for the normal distribution) and its parameters, or a function to
0216 calculate the cumulative distribution according to a given theoretical distribution, the user can
0217 test the null hypothesis that their sample is drawn from that distribution. In the case that the
0218 user tests against the normal distribution (`distName="norm"`), but does not provide distribution
0219 parameters, the test initializes to the standard normal distribution and logs an appropriate
0220 message.
0221 
0222 <div class="codetabs">
0223 <div data-lang="scala" markdown="1">
0224 [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
0225 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0226 and interpret the hypothesis tests.
0227 
0228 Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
0229 
0230 {% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
0231 </div>
0232 
0233 <div data-lang="java" markdown="1">
0234 [`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
0235 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0236 and interpret the hypothesis tests.
0237 
0238 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
0239 
0240 {% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingKolmogorovSmirnovTestExample.java %}
0241 </div>
0242 
0243 <div data-lang="python" markdown="1">
0244 [`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
0245 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
0246 and interpret the hypothesis tests.
0247 
0248 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
0249 
0250 {% include_example python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py %}
0251 </div>
0252 </div>
0253 
0254 ### Streaming Significance Testing
0255 `spark.mllib` provides online implementations of some tests to support use cases
0256 like A/B testing. These tests may be performed on a Spark Streaming
0257 `DStream[(Boolean, Double)]` where the first element of each tuple
0258 indicates control group (`false`) or treatment group (`true`) and the
0259 second element is the value of an observation.
0260 
0261 Streaming significance testing supports the following parameters:
0262 
0263 * `peacePeriod` - The number of initial data points from the stream to
0264 ignore, used to mitigate novelty effects.
0265 * `windowSize` - The number of past batches to perform hypothesis
0266 testing over. Setting to `0` will perform cumulative processing using
0267 all prior batches.
0268 
0269 
0270 <div class="codetabs">
0271 <div data-lang="scala" markdown="1">
0272 [`StreamingTest`](api/scala/org/apache/spark/mllib/stat/test/StreamingTest.html)
0273 provides streaming hypothesis testing.
0274 
0275 {% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
0276 </div>
0277 
0278 <div data-lang="java" markdown="1">
0279 [`StreamingTest`](api/java/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
0280 provides streaming hypothesis testing.
0281 
0282 {% include_example java/org/apache/spark/examples/mllib/JavaStreamingTestExample.java %}
0283 </div>
0284 </div>
0285 
0286 
0287 ## Random data generation
0288 
0289 Random data generation is useful for randomized algorithms, prototyping, and performance testing.
0290 `spark.mllib` supports generating random RDDs with i.i.d. values drawn from a given distribution:
0291 uniform, standard normal, or Poisson.
0292 
0293 <div class="codetabs">
0294 <div data-lang="scala" markdown="1">
0295 [`RandomRDDs`](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) provides factory
0296 methods to generate random double RDDs or vector RDDs.
0297 The following example generates a random double RDD, whose values follows the standard normal
0298 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0299 
0300 Refer to the [`RandomRDDs` Scala docs](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) for details on the API.
0301 
0302 {% highlight scala %}
0303 import org.apache.spark.SparkContext
0304 import org.apache.spark.mllib.random.RandomRDDs._
0305 
0306 val sc: SparkContext = ...
0307 
0308 // Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0309 // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0310 val u = normalRDD(sc, 1000000L, 10)
0311 // Apply a transform to get a random double RDD following `N(1, 4)`.
0312 val v = u.map(x => 1.0 + 2.0 * x)
0313 {% endhighlight %}
0314 </div>
0315 
0316 <div data-lang="java" markdown="1">
0317 [`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
0318 methods to generate random double RDDs or vector RDDs.
0319 The following example generates a random double RDD, whose values follows the standard normal
0320 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0321 
0322 Refer to the [`RandomRDDs` Java docs](api/java/org/apache/spark/mllib/random/RandomRDDs) for details on the API.
0323 
0324 {% highlight java %}
0325 import org.apache.spark.SparkContext;
0326 import org.apache.spark.api.JavaDoubleRDD;
0327 import static org.apache.spark.mllib.random.RandomRDDs.*;
0328 
0329 JavaSparkContext jsc = ...
0330 
0331 // Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0332 // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0333 JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
0334 // Apply a transform to get a random double RDD following `N(1, 4)`.
0335 JavaDoubleRDD v = u.mapToDouble(x -> 1.0 + 2.0 * x);
0336 {% endhighlight %}
0337 </div>
0338 
0339 <div data-lang="python" markdown="1">
0340 [`RandomRDDs`](api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs) provides factory
0341 methods to generate random double RDDs or vector RDDs.
0342 The following example generates a random double RDD, whose values follows the standard normal
0343 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
0344 
0345 Refer to the [`RandomRDDs` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs) for more details on the API.
0346 
0347 {% highlight python %}
0348 from pyspark.mllib.random import RandomRDDs
0349 
0350 sc = ... # SparkContext
0351 
0352 # Generate a random double RDD that contains 1 million i.i.d. values drawn from the
0353 # standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
0354 u = RandomRDDs.normalRDD(sc, 1000000L, 10)
0355 # Apply a transform to get a random double RDD following `N(1, 4)`.
0356 v = u.map(lambda x: 1.0 + 2.0 * x)
0357 {% endhighlight %}
0358 </div>
0359 </div>
0360 
0361 ## Kernel density estimation
0362 
0363 [Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique
0364 useful for visualizing empirical probability distributions without requiring assumptions about the
0365 particular distribution that the observed samples are drawn from. It computes an estimate of the
0366 probability density function of a random variables, evaluated at a given set of points. It achieves
0367 this estimate by expressing the PDF of the empirical distribution at a particular point as the
0368 mean of PDFs of normal distributions centered around each of the samples.
0369 
0370 <div class="codetabs">
0371 
0372 <div data-lang="scala" markdown="1">
0373 [`KernelDensity`](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) provides methods
0374 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0375 to do so.
0376 
0377 Refer to the [`KernelDensity` Scala docs](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
0378 
0379 {% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
0380 </div>
0381 
0382 <div data-lang="java" markdown="1">
0383 [`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
0384 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0385 to do so.
0386 
0387 Refer to the [`KernelDensity` Java docs](api/java/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
0388 
0389 {% include_example java/org/apache/spark/examples/mllib/JavaKernelDensityEstimationExample.java %}
0390 </div>
0391 
0392 <div data-lang="python" markdown="1">
0393 [`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods
0394 to compute kernel density estimates from an RDD of samples. The following example demonstrates how
0395 to do so.
0396 
0397 Refer to the [`KernelDensity` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) for more details on the API.
0398 
0399 {% include_example python/mllib/kernel_density_estimation_example.py %}
0400 </div>
0401 
0402 </div>