the-tree/docs/ml-statistics.md

0001 ---
0002 layout: global
0003 title: Basic Statistics
0004 displayTitle: Basic Statistics
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022
0023 `\[
0024 \newcommand{\R}{\mathbb{R}}
0025 \newcommand{\E}{\mathbb{E}}
0026 \newcommand{\x}{\mathbf{x}}
0027 \newcommand{\y}{\mathbf{y}}
0028 \newcommand{\wv}{\mathbf{w}}
0029 \newcommand{\av}{\mathbf{\alpha}}
0030 \newcommand{\bv}{\mathbf{b}}
0031 \newcommand{\N}{\mathbb{N}}
0032 \newcommand{\id}{\mathbf{I}}
0033 \newcommand{\ind}{\mathbf{1}}
0034 \newcommand{\0}{\mathbf{0}}
0035 \newcommand{\unit}{\mathbf{e}}
0036 \newcommand{\one}{\mathbf{1}}
0037 \newcommand{\zero}{\mathbf{0}}
0038 \]`
0039
0040 **Table of Contents**
0041
0042 * This will become a table of contents (this text will be scraped).
0043 {:toc}
0044
0045 ## Correlation
0046
0047 Calculating the correlation between two series of data is a common operation in Statistics. In `spark.ml`
0048 we provide the flexibility to calculate pairwise correlations among many series. The supported
0049 correlation methods are currently Pearson's and Spearman's correlation.
0050
0051 <div class="codetabs">
0052 <div data-lang="scala" markdown="1">
0053 [`Correlation`](api/scala/org/apache/spark/ml/stat/Correlation$.html)
0054 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0055 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0056
0057 {% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala %}
0058 </div>
0059
0060 <div data-lang="java" markdown="1">
0061 [`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html)
0062 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0063 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0064
0065 {% include_example java/org/apache/spark/examples/ml/JavaCorrelationExample.java %}
0066 </div>
0067
0068 <div data-lang="python" markdown="1">
0069 [`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$)
0070 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0071 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0072
0073 {% include_example python/ml/correlation_example.py %}
0074 </div>
0075
0076 </div>
0077
0078 ## Hypothesis testing
0079
0080 Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
0081 significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's
0082 Chi-squared ( $\chi^2$) tests for independence.
0083
0084 `ChiSquareTest` conducts Pearson's independence test for every feature against the label.
0085 For each feature, the (feature, label) pairs are converted into a contingency matrix for which
0086 the Chi-squared statistic is computed. All label and feature values must be categorical.
0087
0088 <div class="codetabs">
0089 <div data-lang="scala" markdown="1">
0090 Refer to the [`ChiSquareTest` Scala docs](api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html) for details on the API.
0091
0092 {% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
0093 </div>
0094
0095 <div data-lang="java" markdown="1">
0096 Refer to the [`ChiSquareTest` Java docs](api/java/org/apache/spark/ml/stat/ChiSquareTest.html) for details on the API.
0097
0098 {% include_example java/org/apache/spark/examples/ml/JavaChiSquareTestExample.java %}
0099 </div>
0100
0101 <div data-lang="python" markdown="1">
0102 Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat.ChiSquareTest$) for details on the API.
0103
0104 {% include_example python/ml/chi_square_test_example.py %}
0105 </div>
0106
0107 </div>
0108
0109 ## Summarizer
0110
0111 We provide vector column summary statistics for `Dataframe` through `Summarizer`.
0112 Available metrics are the column-wise max, min, mean, sum, variance, std, and number of nonzeros,
0113 as well as the total count.
0114
0115 <div class="codetabs">
0116 <div data-lang="scala" markdown="1">
0117 The following example demonstrates using [`Summarizer`](api/scala/org/apache/spark/ml/stat/Summarizer$.html)
0118 to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
0119
0120 {% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
0121 </div>
0122
0123 <div data-lang="java" markdown="1">
0124 The following example demonstrates using [`Summarizer`](api/java/org/apache/spark/ml/stat/Summarizer.html)
0125 to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
0126
0127 {% include_example java/org/apache/spark/examples/ml/JavaSummarizerExample.java %}
0128 </div>
0129
0130 <div data-lang="python" markdown="1">
0131 Refer to the [`Summarizer` Python docs](api/python/index.html#pyspark.ml.stat.Summarizer$) for details on the API.
0132
0133 {% include_example python/ml/summarizer_example.py %}
0134 </div>
0135
0136 </div>