Back to home page

OSCL-LXR

 
 

    


0001 ---
0002 layout: global
0003 title: Basic Statistics
0004 displayTitle: Basic Statistics
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012  
0013      http://www.apache.org/licenses/LICENSE-2.0
0014  
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021 
0022 
0023 `\[
0024 \newcommand{\R}{\mathbb{R}}
0025 \newcommand{\E}{\mathbb{E}}
0026 \newcommand{\x}{\mathbf{x}}
0027 \newcommand{\y}{\mathbf{y}}
0028 \newcommand{\wv}{\mathbf{w}}
0029 \newcommand{\av}{\mathbf{\alpha}}
0030 \newcommand{\bv}{\mathbf{b}}
0031 \newcommand{\N}{\mathbb{N}}
0032 \newcommand{\id}{\mathbf{I}}
0033 \newcommand{\ind}{\mathbf{1}}
0034 \newcommand{\0}{\mathbf{0}}
0035 \newcommand{\unit}{\mathbf{e}}
0036 \newcommand{\one}{\mathbf{1}}
0037 \newcommand{\zero}{\mathbf{0}}
0038 \]`
0039 
0040 **Table of Contents**
0041 
0042 * This will become a table of contents (this text will be scraped).
0043 {:toc}
0044 
0045 ## Correlation
0046 
0047 Calculating the correlation between two series of data is a common operation in Statistics. In `spark.ml`
0048 we provide the flexibility to calculate pairwise correlations among many series. The supported
0049 correlation methods are currently Pearson's and Spearman's correlation.
0050 
0051 <div class="codetabs">
0052 <div data-lang="scala" markdown="1">
0053 [`Correlation`](api/scala/org/apache/spark/ml/stat/Correlation$.html)
0054 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0055 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0056 
0057 {% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala %}
0058 </div>
0059 
0060 <div data-lang="java" markdown="1">
0061 [`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html)
0062 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0063 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0064 
0065 {% include_example java/org/apache/spark/examples/ml/JavaCorrelationExample.java %}
0066 </div>
0067 
0068 <div data-lang="python" markdown="1">
0069 [`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$)
0070 computes the correlation matrix for the input Dataset of Vectors using the specified method.
0071 The output will be a DataFrame that contains the correlation matrix of the column of vectors.
0072 
0073 {% include_example python/ml/correlation_example.py %}
0074 </div>
0075 
0076 </div>
0077 
0078 ## Hypothesis testing
0079 
0080 Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
0081 significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's
0082 Chi-squared ( $\chi^2$) tests for independence.
0083 
0084 `ChiSquareTest` conducts Pearson's independence test for every feature against the label.
0085 For each feature, the (feature, label) pairs are converted into a contingency matrix for which
0086 the Chi-squared statistic is computed. All label and feature values must be categorical.
0087 
0088 <div class="codetabs">
0089 <div data-lang="scala" markdown="1">
0090 Refer to the [`ChiSquareTest` Scala docs](api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html) for details on the API.
0091 
0092 {% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
0093 </div>
0094 
0095 <div data-lang="java" markdown="1">
0096 Refer to the [`ChiSquareTest` Java docs](api/java/org/apache/spark/ml/stat/ChiSquareTest.html) for details on the API.
0097 
0098 {% include_example java/org/apache/spark/examples/ml/JavaChiSquareTestExample.java %}
0099 </div>
0100 
0101 <div data-lang="python" markdown="1">
0102 Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat.ChiSquareTest$) for details on the API.
0103 
0104 {% include_example python/ml/chi_square_test_example.py %}
0105 </div>
0106 
0107 </div>
0108 
0109 ## Summarizer
0110 
0111 We provide vector column summary statistics for `Dataframe` through `Summarizer`.
0112 Available metrics are the column-wise max, min, mean, sum, variance, std, and number of nonzeros,
0113 as well as the total count.
0114 
0115 <div class="codetabs">
0116 <div data-lang="scala" markdown="1">
0117 The following example demonstrates using [`Summarizer`](api/scala/org/apache/spark/ml/stat/Summarizer$.html)
0118 to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
0119 
0120 {% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
0121 </div>
0122 
0123 <div data-lang="java" markdown="1">
0124 The following example demonstrates using [`Summarizer`](api/java/org/apache/spark/ml/stat/Summarizer.html)
0125 to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
0126 
0127 {% include_example java/org/apache/spark/examples/ml/JavaSummarizerExample.java %}
0128 </div>
0129 
0130 <div data-lang="python" markdown="1">
0131 Refer to the [`Summarizer` Python docs](api/python/index.html#pyspark.ml.stat.Summarizer$) for details on the API.
0132 
0133 {% include_example python/ml/summarizer_example.py %}
0134 </div>
0135 
0136 </div>