MATH05, Analysis of variance
Back to the previous page | Statistics
List of posts to read before reading this article
Contents
- One-way analysis of variance
- Tukey’s multi-comparison method, Holm-Bonferroni Method
- One-way ANOVA for regression analysis, model with constant
- One-way ANOVA for regression analysis, model without constant
- Comparing Model Using F Test
- Comparison of importance of variables using F test
- Adjusted Coefficient of Determination(Adjusted R2)
- Information Criterion
- Two-way analysis of variance
- Multivariate analysis of variance (MANOVA)
One-way analysis of variance
Load Dataset
Dataset description
Get and sort sample data
Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:
- Group I: Patients received 50% nitrous oxide and 50% oxygen mixture continuously for 24 h.
- Group II: Patients received a 50% nitrous oxide and 50% oxygen mixture only dirng the operation.
- Group III: Patients received no nitrous oxide but received 35-50% oxygen for 24 h.
The data show red cell folate levels for the three groups after 24h’ ventilation.
import urllib
# Get the data
inFile = 'altman_910.txt'
url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'
url = url_base + inFile
data = genfromtxt(urllib.request.urlopen(url), delimiter=',')
# Sort them into groups, according to column 1
group1 = data[data[:,1]==1,0]
group2 = data[data[:,1]==2,0]
group3 = data[data[:,1]==3,0]
Data
data
OUTPUT
array([[243., 1.],
[251., 1.],
[275., 1.],
[291., 1.],
[347., 1.],
[354., 1.],
[380., 1.],
[392., 1.],
[206., 2.],
[210., 2.],
[226., 2.],
[249., 2.],
[255., 2.],
[273., 2.],
[285., 2.],
[295., 2.],
[309., 2.],
[241., 3.],
[258., 3.],
[270., 3.],
[293., 3.],
[328., 3.]])
group1
OUTPUT
array([243., 251., 275., 291., 347., 354., 380., 392.])
group2
OUTPUT
array([206., 210., 226., 249., 255., 273., 285., 295., 309.])
group3
OUTPUT
array([241., 258., 270., 293., 328.])
Levene test for equal-variance
# check if the variances are equal with the "Levene"-test
from scipy import stats
(W,p) = stats.levene(group1, group2, group3)
if p<0.05:
print('Warning: the p-value of the Levene test is <0.05: p={0}'.format(p))
OUTPUT
Warning: the p-value of the Levene test is <0.05: p=0.045846812634186246
One-way ANOVA with scipy
F_statistic, pVal = stats.f_oneway(group1, group2, group3)
print('The results from the one-way ANOVA, with the data from Altman 910: F={0:.1f}, p={1:.5f}'.format(F_statistic, pVal))
if pVal < 0.05:
print('One of the groups is significantly different.')
OUTPUT
The results from the one-way ANOVA, with the data from Altman 910: F=3.7, p=0.04359
One of the groups is significantly different.
One-way ANOVA with statsmodel
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# move dataset to dataframe of pandas
df = pd.DataFrame(data, columns=['value', 'treatment'])
# the "C" indicates categorical data
model = ols('value ~ C(treatment)', df).fit()
print(anova_lm(model))
OUTPUT
df sum_sq mean_sq F PR(>F)
C(treatment) 2.0 15515.766414 7757.883207 3.711336 0.043589
Residual 19.0 39716.097222 2090.320906 NaN NaN
Tukey’s multi-comparison method, Holm-Bonferroni Method
Setting up the data, and running an ANOVA
import numpy as np
import scipy.stats as stats
# Create four random groups of data with a mean difference of 1
group1 = np.random.normal(10, 3, 50)
group2 = np.random.normal(11, 3, 50)
group3 = np.random.normal(12, 3, 50)
group4 = np.random.normal(13, 3, 50)
# Show the results for Anova
F_statistic, pVal = stats.f_oneway(group1, group2, group3, group4)
print ('P value:')
print (pVal)
P value:
1.6462001201818463e-08
import pandas as pd
# Put into dataframe
df = pd.DataFrame()
df['treatment1'] = group1
df['treatment2'] = group2
df['treatment3'] = group3
df['treatment4'] = group4
# Stack the data (and rename columns):
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id',
'level_1': 'treatment',
0:'result'})
# Show the first 8 rows:
print (stacked_data.head(8))
id treatment result
0 0 treatment1 12.980445
1 0 treatment2 8.444603
2 0 treatment3 10.713692
3 0 treatment4 10.777762
4 1 treatment1 14.350560
5 1 treatment2 9.436072
6 1 treatment3 12.715509
7 1 treatment4 15.016419
SUPPLEMENT
df.stack()
OUTPUT
0 treatment1 10.130439
treatment2 8.089465
treatment3 18.145643
treatment4 18.152929
1 treatment1 11.967849
treatment2 15.090930
treatment3 16.328056
treatment4 15.949430
2 treatment1 9.953008
treatment2 14.802174
treatment3 16.048117
treatment4 9.917037
3 treatment1 9.994950
treatment2 11.442346
treatment3 12.059078
treatment4 13.461622
4 treatment1 10.532563
treatment2 14.644867
treatment3 10.803938
treatment4 11.966377
5 treatment1 7.448921
treatment2 11.672544
treatment3 9.885759
treatment4 11.060560
6 treatment1 8.750608
treatment2 13.462668
treatment3 9.172649
treatment4 8.315324
7 treatment1 11.423393
treatment2 9.432116
...
42 treatment3 6.160599
treatment4 11.738915
43 treatment1 8.201168
treatment2 9.803940
treatment3 8.422076
treatment4 9.180611
44 treatment1 10.158036
treatment2 11.387816
treatment3 10.132787
treatment4 15.716954
45 treatment1 7.867300
treatment2 7.031535
treatment3 9.547684
treatment4 19.755411
46 treatment1 5.016449
treatment2 12.700640
treatment3 15.266099
treatment4 14.259488
47 treatment1 0.745349
treatment2 13.105580
treatment3 11.237676
treatment4 21.625642
48 treatment1 11.615028
treatment2 11.318771
treatment3 15.321465
treatment4 15.041703
49 treatment1 5.710994
treatment2 15.220437
treatment3 15.089208
treatment4 10.121185
Length: 200, dtype: float64
df.stack().reset_index()
OUTPUT
level_0 level_1 0
0 0 treatment1 10.130439
1 0 treatment2 8.089465
2 0 treatment3 18.145643
3 0 treatment4 18.152929
4 1 treatment1 11.967849
5 1 treatment2 15.090930
6 1 treatment3 16.328056
7 1 treatment4 15.949430
8 2 treatment1 9.953008
9 2 treatment2 14.802174
10 2 treatment3 16.048117
11 2 treatment4 9.917037
12 3 treatment1 9.994950
13 3 treatment2 11.442346
14 3 treatment3 12.059078
15 3 treatment4 13.461622
16 4 treatment1 10.532563
17 4 treatment2 14.644867
18 4 treatment3 10.803938
19 4 treatment4 11.966377
20 5 treatment1 7.448921
21 5 treatment2 11.672544
22 5 treatment3 9.885759
23 5 treatment4 11.060560
24 6 treatment1 8.750608
25 6 treatment2 13.462668
26 6 treatment3 9.172649
27 6 treatment4 8.315324
28 7 treatment1 11.423393
29 7 treatment2 9.432116
... ... ... ...
170 42 treatment3 6.160599
171 42 treatment4 11.738915
172 43 treatment1 8.201168
173 43 treatment2 9.803940
174 43 treatment3 8.422076
175 43 treatment4 9.180611
176 44 treatment1 10.158036
177 44 treatment2 11.387816
178 44 treatment3 10.132787
179 44 treatment4 15.716954
180 45 treatment1 7.867300
181 45 treatment2 7.031535
182 45 treatment3 9.547684
183 45 treatment4 19.755411
184 46 treatment1 5.016449
185 46 treatment2 12.700640
186 46 treatment3 15.266099
187 46 treatment4 14.259488
188 47 treatment1 0.745349
189 47 treatment2 13.105580
190 47 treatment3 11.237676
191 47 treatment4 21.625642
192 48 treatment1 11.615028
193 48 treatment2 11.318771
194 48 treatment3 15.321465
195 48 treatment4 15.041703
196 49 treatment1 5.710994
197 49 treatment2 15.220437
198 49 treatment3 15.089208
199 49 treatment4 10.121185
200 rows × 3 columns
Tukey’s multi-comparison method
from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)
# Set up the data for comparison (creates a specialised object)
MultiComp = MultiComparison(stacked_data['result'],
stacked_data['treatment'])
# Show all pair-wise comparisons:
# Print the comparisons
results = MultiComp.tukeyhsd()
print(results.summary())
results.plot_simultaneous()
Multiple Comparison of Means - Tukey HSD,FWER=0.05
====================================================
group1 group2 meandiff lower upper reject
----------------------------------------------------
treatment1 treatment2 1.5021 -0.0392 3.0435 False
treatment1 treatment3 1.47 -0.0714 3.0113 False
treatment1 treatment4 3.8572 2.3159 5.3985 True
treatment2 treatment3 -0.0322 -1.5735 1.5091 False
treatment2 treatment4 2.355 0.8137 3.8963 True
treatment3 treatment4 2.3872 0.8459 3.9285 True
Holm-Bonferroni Method
comp = MultiComp.allpairtest(stats.ttest_rel, method='Holm')
print (comp[0])
Test Multiple Comparison ttest_rel
FWER=0.05 method=Holm
alphacSidak=0.01, alphacBonf=0.008
=====================================================
group1 group2 stat pval pval_corr reject
-----------------------------------------------------
treatment1 treatment2 -2.1234 0.0388 0.0776 False
treatment1 treatment3 -2.4304 0.0188 0.0564 False
treatment1 treatment4 -6.4443 0.0 0.0 True
treatment2 treatment3 0.0457 0.9637 0.9637 False
treatment2 treatment4 -3.7878 0.0004 0.0017 True
treatment3 treatment4 -5.0246 0.0 0.0 True
One-way ANOVA for regression analysis, model with constant
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression
X0, y, coef = make_regression(n_samples = 100, n_features=1, noise=30, coef=True, random_state=0)
dfX0 = pd.DataFrame(X0, columns=["X"])
dfX = sm.add_constant(dfX0)
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)
model = sm.OLS.from_formula("Y ~ X", data=df)
result = model.fit()
print("TSS = ", result.uncentered_tss) # TSS: total sum of square
print("ESS = ", result.mse_model) # ESS: explained sum of squares
print("RSS = ", result.ssr) # RSS: residual sum of squares
print("ESS + RSS = ", result.mse_model + result.ssr)
print("R squared = ", result.rsquared)
OUTPUT 1 : ANOVA table for fixed model, single factor
TSS = 291345.7578983061
ESS = 188589.61349210917
RSS = 102754.33755137534
ESS + RSS = 291343.9510434845
R squared = 0.6473091780922585
Visualization for OUTPUT 1
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
sns.distplot(y,
kde=False, fit=stats.norm, hist_kws={"color": "r", "alpha": 0.2}, fit_kws={"color": "r"},
label="TSS")
sns.distplot(result.fittedvalues,
kde=False, hist_kws={"color": "g", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "g"},
label="ESS")
sns.distplot(result.resid,
kde=False, hist_kws={"color": "b", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "b"},
label="RSS")
plt.legend()
plt.show()
OUTPUT 2 : Regression F-test and ANOVA Relationship
sm.stats.anova_lm(result)
df sum_sq mean_sq F PR(>F)
X 1.0 188589.613492 188589.613492 179.863766 6.601482e-24
Residual 98.0 102754.337551 1048.513648 NaN NaN
print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 0.647
Model: OLS Adj. R-squared: 0.644
Method: Least Squares F-statistic: 179.9
Date: Fri, 25 Oct 2019 Prob (F-statistic): 6.60e-24
Time: 19:14:57 Log-Likelihood: -488.64
No. Observations: 100 AIC: 981.3
Df Residuals: 98 BIC: 986.5
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -2.4425 3.244 -0.753 0.453 -8.880 3.995
X 43.0873 3.213 13.411 0.000 36.712 49.463
==============================================================================
Omnibus: 3.523 Durbin-Watson: 1.984
Prob(Omnibus): 0.172 Jarque-Bera (JB): 2.059
Skew: -0.073 Prob(JB): 0.357
Kurtosis: 2.312 Cond. No. 1.06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Visualization : Coefficient of Determination(R2) and Correlation Coefficient
mathworld : correlation coefficient
Coefficient of Determination(R2)
Correlation Coefficient
import seaborn as sns
import matplotlib.pyplot as plt
sns.jointplot(result.fittedvalues, y)
plt.show()
One-way ANOVA for regression analysis, model without constant
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression
X0, y, coef = make_regression(
n_samples=100, n_features=1, noise=30, bias=100, coef=True, random_state=0)
dfX = pd.DataFrame(X0, columns=["X"])
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)
model2 = sm.OLS.from_formula("Y ~ X + 0", data=df)
result2 = model2.fit()
result2.summary()
OUTPUT
Comparing Model Using F Test
OUTPUT
Comparison of importance of variables using F test
OUTPUT
Adjusted Coefficient of Determination(Adjusted R2)
OUTPUT
Information Criterion
OUTPUT
Two-way analysis of variance
Two-way ANOVA with repetition
Two-way ANOVA without repetition
Multivariate analysis of variance (MANOVA)
List of posts followed by this article
Reference