6626070
2997924

MATH05, Analysis of variance

Back to the previous pageStatistics
List of posts to read before reading this article


Contents


One-way analysis of variance

image

ANOVA_oneway

Load Dataset

Dataset description

Get and sort sample data

Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:

  • Group I: Patients received 50% nitrous oxide and 50% oxygen mixture continuously for 24 h.
  • Group II: Patients received a 50% nitrous oxide and 50% oxygen mixture only dirng the operation.
  • Group III: Patients received no nitrous oxide but received 35-50% oxygen for 24 h.

The data show red cell folate levels for the three groups after 24h’ ventilation.


import urllib

# Get the data
inFile = 'altman_910.txt'
url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'

url = url_base + inFile
data = genfromtxt(urllib.request.urlopen(url), delimiter=',')

# Sort them into groups, according to column 1
group1 = data[data[:,1]==1,0]
group2 = data[data[:,1]==2,0]
group3 = data[data[:,1]==3,0]
Data
data
OUTPUT
array([[243.,   1.],
       [251.,   1.],
       [275.,   1.],
       [291.,   1.],
       [347.,   1.],
       [354.,   1.],
       [380.,   1.],
       [392.,   1.],
       [206.,   2.],
       [210.,   2.],
       [226.,   2.],
       [249.,   2.],
       [255.,   2.],
       [273.,   2.],
       [285.,   2.],
       [295.,   2.],
       [309.,   2.],
       [241.,   3.],
       [258.,   3.],
       [270.,   3.],
       [293.,   3.],
       [328.,   3.]])


group1
OUTPUT
array([243., 251., 275., 291., 347., 354., 380., 392.])


group2
OUTPUT
array([206., 210., 226., 249., 255., 273., 285., 295., 309.])


group3
OUTPUT
array([241., 258., 270., 293., 328.])



Levene test for equal-variance

# check if the variances are equal with the "Levene"-test
from scipy import stats

(W,p) = stats.levene(group1, group2, group3)
if p<0.05:
    
    print('Warning: the p-value of the Levene test is <0.05: p={0}'.format(p))
OUTPUT
Warning: the p-value of the Levene test is <0.05: p=0.045846812634186246


One-way ANOVA with scipy

F_statistic, pVal = stats.f_oneway(group1, group2, group3)

print('The results from the one-way ANOVA, with the data from Altman 910: F={0:.1f}, p={1:.5f}'.format(F_statistic, pVal))
if pVal < 0.05:
    print('One of the groups is significantly different.')
OUTPUT
The results from the one-way ANOVA, with the data from Altman 910: F=3.7, p=0.04359
One of the groups is significantly different.


One-way ANOVA with statsmodel

import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# move dataset to dataframe of pandas
df = pd.DataFrame(data, columns=['value', 'treatment'])    

# the "C" indicates categorical data
model = ols('value ~ C(treatment)', df).fit()

print(anova_lm(model))
OUTPUT
                df        sum_sq      mean_sq         F    PR(>F)
C(treatment)   2.0  15515.766414  7757.883207  3.711336  0.043589
Residual      19.0  39716.097222  2090.320906       NaN       NaN





Tukey’s multi-comparison method, Holm-Bonferroni Method

URL

Setting up the data, and running an ANOVA

import numpy as np
import scipy.stats as stats

# Create four random groups of data with a mean difference of 1
group1 = np.random.normal(10, 3, 50)
group2 = np.random.normal(11, 3, 50)
group3 = np.random.normal(12, 3, 50)
group4 = np.random.normal(13, 3, 50)

# Show the results for Anova
F_statistic, pVal = stats.f_oneway(group1, group2, group3, group4)
print ('P value:')
print (pVal)
P value:
1.6462001201818463e-08
import pandas as pd

# Put into dataframe
df = pd.DataFrame()
df['treatment1'] = group1
df['treatment2'] = group2
df['treatment3'] = group3
df['treatment4'] = group4

# Stack the data (and rename columns):
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id',
                                            'level_1': 'treatment',
                                            0:'result'})
# Show the first 8 rows:
print (stacked_data.head(8))
   id   treatment     result
0   0  treatment1  12.980445
1   0  treatment2   8.444603
2   0  treatment3  10.713692
3   0  treatment4  10.777762
4   1  treatment1  14.350560
5   1  treatment2   9.436072
6   1  treatment3  12.715509
7   1  treatment4  15.016419
SUPPLEMENT
df.stack()
OUTPUT
0   treatment1    10.130439
    treatment2     8.089465
    treatment3    18.145643
    treatment4    18.152929
1   treatment1    11.967849
    treatment2    15.090930
    treatment3    16.328056
    treatment4    15.949430
2   treatment1     9.953008
    treatment2    14.802174
    treatment3    16.048117
    treatment4     9.917037
3   treatment1     9.994950
    treatment2    11.442346
    treatment3    12.059078
    treatment4    13.461622
4   treatment1    10.532563
    treatment2    14.644867
    treatment3    10.803938
    treatment4    11.966377
5   treatment1     7.448921
    treatment2    11.672544
    treatment3     9.885759
    treatment4    11.060560
6   treatment1     8.750608
    treatment2    13.462668
    treatment3     9.172649
    treatment4     8.315324
7   treatment1    11.423393
    treatment2     9.432116
                    ...    
42  treatment3     6.160599
    treatment4    11.738915
43  treatment1     8.201168
    treatment2     9.803940
    treatment3     8.422076
    treatment4     9.180611
44  treatment1    10.158036
    treatment2    11.387816
    treatment3    10.132787
    treatment4    15.716954
45  treatment1     7.867300
    treatment2     7.031535
    treatment3     9.547684
    treatment4    19.755411
46  treatment1     5.016449
    treatment2    12.700640
    treatment3    15.266099
    treatment4    14.259488
47  treatment1     0.745349
    treatment2    13.105580
    treatment3    11.237676
    treatment4    21.625642
48  treatment1    11.615028
    treatment2    11.318771
    treatment3    15.321465
    treatment4    15.041703
49  treatment1     5.710994
    treatment2    15.220437
    treatment3    15.089208
    treatment4    10.121185
Length: 200, dtype: float64


df.stack().reset_index()
OUTPUT
	level_0	level_1		0
0	0	treatment1	10.130439
1	0	treatment2	8.089465
2	0	treatment3	18.145643
3	0	treatment4	18.152929
4	1	treatment1	11.967849
5	1	treatment2	15.090930
6	1	treatment3	16.328056
7	1	treatment4	15.949430
8	2	treatment1	9.953008
9	2	treatment2	14.802174
10	2	treatment3	16.048117
11	2	treatment4	9.917037
12	3	treatment1	9.994950
13	3	treatment2	11.442346
14	3	treatment3	12.059078
15	3	treatment4	13.461622
16	4	treatment1	10.532563
17	4	treatment2	14.644867
18	4	treatment3	10.803938
19	4	treatment4	11.966377
20	5	treatment1	7.448921
21	5	treatment2	11.672544
22	5	treatment3	9.885759
23	5	treatment4	11.060560
24	6	treatment1	8.750608
25	6	treatment2	13.462668
26	6	treatment3	9.172649
27	6	treatment4	8.315324
28	7	treatment1	11.423393
29	7	treatment2	9.432116
...	...	...		...
170	42	treatment3	6.160599
171	42	treatment4	11.738915
172	43	treatment1	8.201168
173	43	treatment2	9.803940
174	43	treatment3	8.422076
175	43	treatment4	9.180611
176	44	treatment1	10.158036
177	44	treatment2	11.387816
178	44	treatment3	10.132787
179	44	treatment4	15.716954
180	45	treatment1	7.867300
181	45	treatment2	7.031535
182	45	treatment3	9.547684
183	45	treatment4	19.755411
184	46	treatment1	5.016449
185	46	treatment2	12.700640
186	46	treatment3	15.266099
187	46	treatment4	14.259488
188	47	treatment1	0.745349
189	47	treatment2	13.105580
190	47	treatment3	11.237676
191	47	treatment4	21.625642
192	48	treatment1	11.615028
193	48	treatment2	11.318771
194	48	treatment3	15.321465
195	48	treatment4	15.041703
196	49	treatment1	5.710994
197	49	treatment2	15.220437
198	49	treatment3	15.089208
199	49	treatment4	10.121185
200 rows × 3 columns



Tukey’s multi-comparison method

from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)

# Set up the data for comparison (creates a specialised object)
MultiComp = MultiComparison(stacked_data['result'],
                            stacked_data['treatment'])

# Show all pair-wise comparisons:
# Print the comparisons
results = MultiComp.tukeyhsd()
print(results.summary())
results.plot_simultaneous()
 Multiple Comparison of Means - Tukey HSD,FWER=0.05 
====================================================
  group1     group2   meandiff  lower  upper  reject
----------------------------------------------------
treatment1 treatment2  1.5021  -0.0392 3.0435 False 
treatment1 treatment3   1.47   -0.0714 3.0113 False 
treatment1 treatment4  3.8572   2.3159 5.3985  True 
treatment2 treatment3 -0.0322  -1.5735 1.5091 False 
treatment2 treatment4  2.355    0.8137 3.8963  True 
treatment3 treatment4  2.3872   0.8459 3.9285  True 

image

Holm-Bonferroni Method

comp = MultiComp.allpairtest(stats.ttest_rel, method='Holm')
print (comp[0])
Test Multiple Comparison ttest_rel 
FWER=0.05 method=Holm
alphacSidak=0.01, alphacBonf=0.008
=====================================================
  group1     group2     stat   pval  pval_corr reject
-----------------------------------------------------
treatment1 treatment2 -2.1234 0.0388   0.0776  False 
treatment1 treatment3 -2.4304 0.0188   0.0564  False 
treatment1 treatment4 -6.4443  0.0      0.0     True 
treatment2 treatment3  0.0457 0.9637   0.9637  False 
treatment2 treatment4 -3.7878 0.0004   0.0017   True 
treatment3 treatment4 -5.0246  0.0      0.0     True 





One-way ANOVA for regression analysis, model with constant

import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression

X0, y, coef = make_regression(n_samples = 100, n_features=1, noise=30, coef=True, random_state=0)
dfX0 = pd.DataFrame(X0, columns=["X"])
dfX = sm.add_constant(dfX0)
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model = sm.OLS.from_formula("Y ~ X", data=df)
result = model.fit()

print("TSS = ", result.uncentered_tss)       # TSS: total sum of square
print("ESS = ", result.mse_model)            # ESS: explained sum of squares
print("RSS = ", result.ssr)                  # RSS: residual sum of squares
print("ESS + RSS = ", result.mse_model + result.ssr)
print("R squared = ", result.rsquared)
OUTPUT 1 : ANOVA table for fixed model, single factor
TSS =  291345.7578983061
ESS =  188589.61349210917
RSS =  102754.33755137534
ESS + RSS =  291343.9510434845
R squared =  0.6473091780922585

Visualization for OUTPUT 1
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.distplot(y,
             kde=False, fit=stats.norm, hist_kws={"color": "r", "alpha": 0.2}, fit_kws={"color": "r"},
             label="TSS")
sns.distplot(result.fittedvalues,
             kde=False, hist_kws={"color": "g", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "g"},
             label="ESS")
sns.distplot(result.resid,
             kde=False, hist_kws={"color": "b", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "b"},
             label="RSS")
plt.legend()
plt.show()

download


OUTPUT 2 : Regression F-test and ANOVA Relationship

캡처

sm.stats.anova_lm(result)
	        df	  sum_sq	      mean_sq	      F	                PR(>F)
X	        1.0	  188589.613492	      188589.613492   179.863766	6.601482e-24
Residual	98.0	  102754.337551	      1048.513648     NaN	        NaN


print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     179.9
Date:                Fri, 25 Oct 2019   Prob (F-statistic):           6.60e-24
Time:                        19:14:57   Log-Likelihood:                -488.64
No. Observations:                 100   AIC:                             981.3
Df Residuals:                      98   BIC:                             986.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.4425      3.244     -0.753      0.453      -8.880       3.995
X             43.0873      3.213     13.411      0.000      36.712      49.463
==============================================================================
Omnibus:                        3.523   Durbin-Watson:                   1.984
Prob(Omnibus):                  0.172   Jarque-Bera (JB):                2.059
Skew:                          -0.073   Prob(JB):                        0.357
Kurtosis:                       2.312   Cond. No.                         1.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Visualization : Coefficient of Determination(R2) and Correlation Coefficient

mathworld : correlation coefficient

Coefficient of Determination(R2)


Correlation Coefficient

For a Population, For a Sample,


import seaborn as sns
import matplotlib.pyplot as plt

sns.jointplot(result.fittedvalues, y)
plt.show()

download (1)






One-way ANOVA for regression analysis, model without constant

import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression

X0, y, coef = make_regression(
    n_samples=100, n_features=1, noise=30, bias=100, coef=True, random_state=0)
dfX = pd.DataFrame(X0, columns=["X"])
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model2 = sm.OLS.from_formula("Y ~ X + 0", data=df)
result2 = model2.fit()
result2.summary()
OUTPUT

캡처






Comparing Model Using F Test


OUTPUT





Comparison of importance of variables using F test


OUTPUT





Adjusted Coefficient of Determination(Adjusted R2)


OUTPUT





Information Criterion


OUTPUT






Two-way analysis of variance

Two-way ANOVA with repetition

ANOVA_twoway_with_repetition





Two-way ANOVA without repetition

ANOVA_twoway_without_repetition





Multivariate analysis of variance (MANOVA)





List of posts followed by this article


Reference


OUTPUT