MATH05, Descriptive statistics

Back to the previous page ｜ Statistics
List of posts to read before reading this article

Review : Descriptive statistics with Excel
Counting
Arithmetic mean(average), median
Variance and Standard deviation
Maximum, minimum
Quartile
Skewness and Kurtosis
Moment
Describe all at once
Statistical analysis

Review : Descriptive statistics with Excel

descriptive_statistics covariance_analysis correlation_analysis

Counting

Generating random number

import numpy as np

np.random.seed(2019)
rv = np.random.RandomState(2019)
np.round(rv.normal(168,15,(100,)))

array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
       163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
       120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
       171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
       175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
       173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
       158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
       170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
       152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
       148.])

import numpy as np

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])
len(x)

100

Visualization

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])

sns.set();
plt.hist(x)

download (2)

Another method

import seaborn as sns
import numpy as np

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])

sns.set();
sns.distplot(x, rug=True, bins=10, kde=False)

download (6)

Arithmetic mean(average), median

${\bar {x}}={\frac {1}{n}}\left(\sum _{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}$

If the distribution is symmetrical with respect to the sample mean, the sample median is equal to the sample mean.
If the distribution is symmetrical and has only one extreme value(uni-modal), the sample peak value is the same as the sample mean.
When data is added that makes the symmetric distribution asymmetrical, the sample mean is most affected and the sample mode is least affected.

import numpy as np

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])
np.mean(x), np.median(x)

(165.76, 165.0)

x.mean()

165.76

Visualization

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])

sns.set();
plt.hist(x)
plt.axvline(x=np.mean(x), ls="--", c="r", linewidth=2, label="sample mean")
plt.axvline(x=np.median(x), ls="--", c="y", linewidth=2, label="sample median")
plt.legend()

download (5)

Variance and Standard deviation

Variance $\operatorname {Var} (X)=\sum _{i=1}^{n}p_{i}\cdot (x_{i}-\mu )^{2},$ $s^{2}={\frac {n}{n-1}}\sigma _{Y}^{2}={\frac {n}{n-1}}\left({\frac {1}{n}}\sum _{i=1}^{n}\left(Y_{i}-{\overline {Y}}\right)^{2}\right)={\frac {1}{n-1}}\sum _{i=1}^{n}\left(Y_{i}-{\overline {Y}}\right)^{2}$

import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
              
# np.var(x)         : for population variance(biased sample variance)
# np.var(x, ddof=1) : for unbiased sample variance
np.var(x), np.var(x, ddof=1)

(115.23224852071006, 119.84153846153846)

# x.var()       : for population variance(biased sample variance)
# x.var(ddof=1) : for unbiased sample variance
x.var(), x.var(ddof=1)

(115.23224852071006, 119.84153846153846)

Standard deviation $\sigma ={\sqrt {\sum _{i=1}^{N}p_{i}(x_{i}-\mu )^{2}}},{\rm {\ \ where\ \ }}\mu =\sum _{i=1}^{N}p_{i}x_{i}.$ $s=\pm {\sqrt {\frac {\Sigma w(x-{\overline {x}})^{2}}{n-1}}}=\pm {\sqrt {\frac {\Sigma w\nu ^{2}}{n-1}}}$

import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
              
# np.std(x)         : for population standard deviation(biased sample standard deviation)
# np.std(x, ddof=1) : for unbiased sample standard deviation
np.std(x), np.std(x, ddof=1)

(10.734628476137871, 10.947216014199157)

# x.std()       : for population standard deviation(biased sample standard deviation)
# x.std(ddof=1) : for unbiased sample standard deviation
x.std(), x.std(ddof=1)

(10.734628476137871, 10.947216014199157)

import numpy as np

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])

x.mean(), x.var(), x.std()

(165.76, 224.16240000000002, 14.972053967308561)

Visualization

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
              163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
              120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
              171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
              175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
              173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
              158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
              170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
              152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
              148.])

plt.scatter(range(len(x)), x)
plt.plot([0, 100], [165.76, 165.76], c ="k", label="mean")
plt.annotate('',
             xytext = (50, 165.76),
             xy = (50, 180.73205396730856),
             arrowprops = {'facecolor' : 'red', 
                           'edgecolor' : 'red', 
                           'shrink' : 0.05})
plt.annotate('',
             xytext = (50, 165.76),
             xy = (50, 150.78794603269142),
             arrowprops = {'facecolor' : 'red',
                           'edgecolor' : 'red',
                           'shrink' : 0.05 })
plt.text(55, 170, 'Standard Deviation', c='red')
plt.legend()

download (12)

Maximum, minimum

import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
np.max(x), np.min(x)

(23, -24)

x.max(), x.min()

(23, -24)

Quartile

import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])

print(np.percentile(x, 0),
      np.percentile(x, 25),
      np.percentile(x, 50),
      np.percentile(x, 75),
      np.percentile(x, 100))

Skewness and Kurtosis

Pearson’s moment coefficient of skewness $\gamma _{1}=\operatorname {E} \left[\left({\frac {X-\mu }{\sigma }}\right)^{3}\right]={\frac {\mu _{3}}{\sigma ^{3}}}={\frac {\operatorname {E} \left[(X-\mu )^{3}\right]}{(\operatorname {E} \left[(X-\mu )^{2}\right])^{3/2}}}={\frac {\kappa _{3}}{\kappa _{2}^{3/2}}}$ ${\begin{aligned}\gamma _{1}&=\operatorname {E} \left[\left({\frac {X-\mu }{\sigma }}\right)^{3}\right]\\&={\frac {\operatorname {E} [X^{3}]-3\mu \operatorname {E} [X^{2}]+3\mu ^{2}\operatorname {E} [X]-\mu ^{3}}{\sigma ^{3}}}\\&={\frac {\operatorname {E} [X^{3}]-3\mu (\operatorname {E} [X^{2}]-\mu \operatorname {E} [X])-\mu ^{3}}{\sigma ^{3}}}\\&={\frac {\operatorname {E} [X^{3}]-3\mu \sigma ^{2}-\mu ^{3}}{\sigma ^{3}}}.\end{aligned}}$

Sample skewness

from scipy import stats
import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
stats.skew(x)

-0.4762339485461929

Kurtosis $\operatorname {Kurt} [X]=\operatorname {E} \left[\left({\frac {X-\mu }{\sigma }}\right)^{4}\right]={\frac {\mu _{4}}{\sigma ^{4}}}={\frac {\operatorname {E} [(X-\mu )^{4}]}{(\operatorname {E} [(X-\mu )^{2}])^{2}}},$

from scipy import stats
import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
stats.kurtosis(x)

0.37443381660038977

Moment

$\mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.$

sample moment ${\frac {1}{n}}\sum _{i=1}^{n}X_{i}^{k}$

from scipy import stats
import numpy as np

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
stats.moment(x, 1), stats.moment(x, 2), stats.moment(x, 3), stats.moment(x, 4)

(0.0, 115.23224852071006, -589.0896677287208, 44807.32190968453)

Describe all at once

import numpy as np
from scipy import stats

x = np.array([18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
              2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13])
stats.describe(x)

DescribeResult(nobs=26, minmax=(-24, 23), mean=4.8076923076923075, variance=119.84153846153846, skewness=-0.4762339485461929, kurtosis=0.37443381660038977)

Statistical analysis

Data-Preprocessing

Dataset

Green region

Violet region

Orange region

Red region

List of posts followed by this article

Reference

OUTPUT

6626070
2997924

MATH05, Descriptive statistics

Contents

Review : Descriptive statistics with Excel

Counting

Arithmetic mean(average), median

Variance and Standard deviation

Maximum, minimum

Quartile

Skewness and Kurtosis

Moment

Describe all at once

Statistical analysis

6626070 2997924

MATH05, Descriptive statistics

Contents

Review : Descriptive statistics with Excel

Counting

Arithmetic mean(average), median

Variance and Standard deviation

Maximum, minimum

Quartile

Skewness and Kurtosis

Moment

Describe all at once

Statistical analysis

6626070
2997924