MATH05, Descriptive statistics
Back to the previous page | Statistics
List of posts to read before reading this article
Contents
- Review : Descriptive statistics with Excel
- Counting
- Arithmetic mean(average), median
- Variance and Standard deviation
- Maximum, minimum
- Quartile
- Skewness and Kurtosis
- Moment
- Describe all at once
- Statistical analysis
Review : Descriptive statistics with Excel
Counting
Generating random number
import numpy as np
np.random.seed(2019)
rv = np.random.RandomState(2019)
np.round(rv.normal(168,15,(100,)))
array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
import numpy as np
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
len(x)
100
Visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
sns.set();
plt.hist(x)
Another method
import seaborn as sns
import numpy as np
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
sns.set();
sns.distplot(x, rug=True, bins=10, kde=False)
Arithmetic mean(average), median
If the distribution is symmetrical with respect to the sample mean, the sample median is equal to the sample mean.
If the distribution is symmetrical and has only one extreme value(uni-modal), the sample peak value is the same as the sample mean.
When data is added that makes the symmetric distribution asymmetrical, the sample mean is most affected and the sample mode is least affected.
import numpy as np
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
np.mean(x), np.median(x)
(165.76, 165.0)
x.mean()
165.76
Visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
sns.set();
plt.hist(x)
plt.axvline(x=np.mean(x), ls="--", c="r", linewidth=2, label="sample mean")
plt.axvline(x=np.median(x), ls="--", c="y", linewidth=2, label="sample median")
plt.legend()
Variance and Standard deviation
Variance
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
# np.var(x) : for population variance(biased sample variance)
# np.var(x, ddof=1) : for unbiased sample variance
np.var(x), np.var(x, ddof=1)
(115.23224852071006, 119.84153846153846)
# x.var() : for population variance(biased sample variance)
# x.var(ddof=1) : for unbiased sample variance
x.var(), x.var(ddof=1)
(115.23224852071006, 119.84153846153846)
Standard deviation
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
# np.std(x) : for population standard deviation(biased sample standard deviation)
# np.std(x, ddof=1) : for unbiased sample standard deviation
np.std(x), np.std(x, ddof=1)
(10.734628476137871, 10.947216014199157)
# x.std() : for population standard deviation(biased sample standard deviation)
# x.std(ddof=1) : for unbiased sample standard deviation
x.std(), x.std(ddof=1)
(10.734628476137871, 10.947216014199157)
import numpy as np
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
x.mean(), x.var(), x.std()
(165.76, 224.16240000000002, 14.972053967308561)
Visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.array([165., 180., 190., 188., 163., 178., 177., 172., 164., 182., 143.,
163., 168., 160., 172., 165., 208., 175., 181., 160., 154., 169.,
120., 184., 180., 175., 174., 175., 160., 155., 156., 161., 184.,
171., 150., 154., 153., 177., 184., 172., 156., 153., 145., 150.,
175., 165., 190., 156., 196., 161., 185., 159., 153., 155., 173.,
173., 191., 162., 152., 158., 190., 136., 171., 173., 146., 158.,
158., 159., 169., 145., 193., 178., 160., 153., 142., 143., 172.,
170., 130., 165., 177., 190., 164., 167., 172., 160., 184., 158.,
152., 175., 158., 156., 171., 164., 165., 160., 162., 140., 172.,
148.])
plt.scatter(range(len(x)), x)
plt.plot([0, 100], [165.76, 165.76], c ="k", label="mean")
plt.annotate('',
xytext = (50, 165.76),
xy = (50, 180.73205396730856),
arrowprops = {'facecolor' : 'red',
'edgecolor' : 'red',
'shrink' : 0.05})
plt.annotate('',
xytext = (50, 165.76),
xy = (50, 150.78794603269142),
arrowprops = {'facecolor' : 'red',
'edgecolor' : 'red',
'shrink' : 0.05 })
plt.text(55, 170, 'Standard Deviation', c='red')
plt.legend()
Maximum, minimum
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
np.max(x), np.min(x)
(23, -24)
x.max(), x.min()
(23, -24)
Quartile
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
print(np.percentile(x, 0),
np.percentile(x, 25),
np.percentile(x, 50),
np.percentile(x, 75),
np.percentile(x, 100))
-24.0
0.0
5.0
10.0
23.0
Skewness and Kurtosis
Pearson’s moment coefficient of skewness
Sample skewness
from scipy import stats
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
stats.skew(x)
-0.4762339485461929
Kurtosis
from scipy import stats
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
stats.kurtosis(x)
0.37443381660038977
Moment
sample moment
from scipy import stats
import numpy as np
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
stats.moment(x, 1), stats.moment(x, 2), stats.moment(x, 3), stats.moment(x, 4)
(0.0, 115.23224852071006, -589.0896677287208, 44807.32190968453)
Describe all at once
import numpy as np
from scipy import stats
x = np.array([18, 5, 10, 23, 19, -8, 10, 0, 0, 5, 2, 15, 8,
2, 5, 4, 15, -1, 4, -7, -24, 7, 9, -6, 23, -13])
stats.describe(x)
DescribeResult(nobs=26, minmax=(-24, 23), mean=4.8076923076923075, variance=119.84153846153846, skewness=-0.4762339485461929, kurtosis=0.37443381660038977)
Statistical analysis
Data-Preprocessing
Dataset
Green region
Violet region
Orange region
Red region
List of posts followed by this article
Reference
- deterministic and stochastic data, random variable
- expectation and transform of random variable
- (standard) variance
- multivariate random variables
- covariance and correlation
- conditional expectation, prediction