MATH05, Analysis of variance

Back to the previous page ｜ Statistics
List of posts to read before reading this article

One-way analysis of variance
Two-way analysis of variance
- Two-way ANOVA with repetition
- Two-way ANOVA without repetition
Multivariate analysis of variance (MANOVA)

One-way analysis of variance

ANOVA_oneway

Load Dataset

Dataset description

Get and sort sample data

Twenty-two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:

Group I: Patients received 50% nitrous oxide and 50% oxygen mixture continuously for 24 h.
Group II: Patients received a 50% nitrous oxide and 50% oxygen mixture only dirng the operation.
Group III: Patients received no nitrous oxide but received 35-50% oxygen for 24 h.

The data show red cell folate levels for the three groups after 24h’ ventilation.

import urllib

# Get the data
inFile = 'altman_910.txt'
url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'

url = url_base + inFile
data = genfromtxt(urllib.request.urlopen(url), delimiter=',')

# Sort them into groups, according to column 1
group1 = data[data[:,1]==1,0]
group2 = data[data[:,1]==2,0]
group3 = data[data[:,1]==3,0]

Data

data

OUTPUT

array([[243.,   1.],
       [251.,   1.],
       [275.,   1.],
       [291.,   1.],
       [347.,   1.],
       [354.,   1.],
       [380.,   1.],
       [392.,   1.],
       [206.,   2.],
       [210.,   2.],
       [226.,   2.],
       [249.,   2.],
       [255.,   2.],
       [273.,   2.],
       [285.,   2.],
       [295.,   2.],
       [309.,   2.],
       [241.,   3.],
       [258.,   3.],
       [270.,   3.],
       [293.,   3.],
       [328.,   3.]])

group1

OUTPUT

array([243., 251., 275., 291., 347., 354., 380., 392.])

group2

OUTPUT

array([206., 210., 226., 249., 255., 273., 285., 295., 309.])

group3

OUTPUT

array([241., 258., 270., 293., 328.])

Levene test for equal-variance

# check if the variances are equal with the "Levene"-test
from scipy import stats

(W,p) = stats.levene(group1, group2, group3)
if p<0.05:
    
    print('Warning: the p-value of the Levene test is <0.05: p={0}'.format(p))

OUTPUT

Warning: the p-value of the Levene test is <0.05: p=0.045846812634186246

One-way ANOVA with scipy

F_statistic, pVal = stats.f_oneway(group1, group2, group3)

print('The results from the one-way ANOVA, with the data from Altman 910: F={0:.1f}, p={1:.5f}'.format(F_statistic, pVal))
if pVal < 0.05:
    print('One of the groups is significantly different.')

OUTPUT

The results from the one-way ANOVA, with the data from Altman 910: F=3.7, p=0.04359
One of the groups is significantly different.

One-way ANOVA with statsmodel

import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# move dataset to dataframe of pandas
df = pd.DataFrame(data, columns=['value', 'treatment'])    

# the "C" indicates categorical data
model = ols('value ~ C(treatment)', df).fit()

print(anova_lm(model))

OUTPUT

                df        sum_sq      mean_sq         F    PR(>F)
C(treatment)   2.0  15515.766414  7757.883207  3.711336  0.043589
Residual      19.0  39716.097222  2090.320906       NaN       NaN

Tukey’s multi-comparison method, Holm-Bonferroni Method

URL

Setting up the data, and running an ANOVA

import numpy as np
import scipy.stats as stats

# Create four random groups of data with a mean difference of 1
group1 = np.random.normal(10, 3, 50)
group2 = np.random.normal(11, 3, 50)
group3 = np.random.normal(12, 3, 50)
group4 = np.random.normal(13, 3, 50)

# Show the results for Anova
F_statistic, pVal = stats.f_oneway(group1, group2, group3, group4)
print ('P value:')
print (pVal)

P value:
1.6462001201818463e-08

import pandas as pd

# Put into dataframe
df = pd.DataFrame()
df['treatment1'] = group1
df['treatment2'] = group2
df['treatment3'] = group3
df['treatment4'] = group4

# Stack the data (and rename columns):
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id',
                                            'level_1': 'treatment',
                                            0:'result'})
# Show the first 8 rows:
print (stacked_data.head(8))

   id   treatment     result
 0  treatment1  12.980445
 0  treatment2   8.444603
 0  treatment3  10.713692
 0  treatment4  10.777762
 1  treatment1  14.350560
 1  treatment2   9.436072
 1  treatment3  12.715509
 1  treatment4  15.016419

SUPPLEMENT

df.stack()

OUTPUT

0   treatment1    10.130439
    treatment2     8.089465
    treatment3    18.145643
    treatment4    18.152929
1   treatment1    11.967849
    treatment2    15.090930
    treatment3    16.328056
    treatment4    15.949430
2   treatment1     9.953008
    treatment2    14.802174
    treatment3    16.048117
    treatment4     9.917037
3   treatment1     9.994950
    treatment2    11.442346
    treatment3    12.059078
    treatment4    13.461622
4   treatment1    10.532563
    treatment2    14.644867
    treatment3    10.803938
    treatment4    11.966377
5   treatment1     7.448921
    treatment2    11.672544
    treatment3     9.885759
    treatment4    11.060560
6   treatment1     8.750608
    treatment2    13.462668
    treatment3     9.172649
    treatment4     8.315324
7   treatment1    11.423393
    treatment2     9.432116
                    ...    
42  treatment3     6.160599
    treatment4    11.738915
43  treatment1     8.201168
    treatment2     9.803940
    treatment3     8.422076
    treatment4     9.180611
44  treatment1    10.158036
    treatment2    11.387816
    treatment3    10.132787
    treatment4    15.716954
45  treatment1     7.867300
    treatment2     7.031535
    treatment3     9.547684
    treatment4    19.755411
46  treatment1     5.016449
    treatment2    12.700640
    treatment3    15.266099
    treatment4    14.259488
47  treatment1     0.745349
    treatment2    13.105580
    treatment3    11.237676
    treatment4    21.625642
48  treatment1    11.615028
    treatment2    11.318771
    treatment3    15.321465
    treatment4    15.041703
49  treatment1     5.710994
    treatment2    15.220437
    treatment3    15.089208
    treatment4    10.121185
Length: 200, dtype: float64

df.stack().reset_index()

OUTPUT

	level_0	level_1		0
0	treatment1	10.130439
0	treatment2	8.089465
0	treatment3	18.145643
0	treatment4	18.152929
1	treatment1	11.967849
1	treatment2	15.090930
1	treatment3	16.328056
1	treatment4	15.949430
2	treatment1	9.953008
2	treatment2	14.802174
2	treatment3	16.048117
2	treatment4	9.917037
3	treatment1	9.994950
3	treatment2	11.442346
3	treatment3	12.059078
3	treatment4	13.461622
4	treatment1	10.532563
4	treatment2	14.644867
4	treatment3	10.803938
4	treatment4	11.966377
5	treatment1	7.448921
5	treatment2	11.672544
5	treatment3	9.885759
5	treatment4	11.060560
6	treatment1	8.750608
6	treatment2	13.462668
6	treatment3	9.172649
6	treatment4	8.315324
7	treatment1	11.423393
7	treatment2	9.432116
...	...	...		...
42	treatment3	6.160599
42	treatment4	11.738915
43	treatment1	8.201168
43	treatment2	9.803940
43	treatment3	8.422076
43	treatment4	9.180611
44	treatment1	10.158036
44	treatment2	11.387816
44	treatment3	10.132787
44	treatment4	15.716954
45	treatment1	7.867300
45	treatment2	7.031535
45	treatment3	9.547684
45	treatment4	19.755411
46	treatment1	5.016449
46	treatment2	12.700640
46	treatment3	15.266099
46	treatment4	14.259488
47	treatment1	0.745349
47	treatment2	13.105580
47	treatment3	11.237676
47	treatment4	21.625642
48	treatment1	11.615028
48	treatment2	11.318771
48	treatment3	15.321465
48	treatment4	15.041703
49	treatment1	5.710994
49	treatment2	15.220437
49	treatment3	15.089208
49	treatment4	10.121185
rows × 3 columns

Tukey’s multi-comparison method

from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)

# Set up the data for comparison (creates a specialised object)
MultiComp = MultiComparison(stacked_data['result'],
                            stacked_data['treatment'])

# Show all pair-wise comparisons:
# Print the comparisons
results = MultiComp.tukeyhsd()
print(results.summary())
results.plot_simultaneous()

 Multiple Comparison of Means - Tukey HSD,FWER=0.05 
====================================================
  group1     group2   meandiff  lower  upper  reject
----------------------------------------------------
treatment1 treatment2  1.5021  -0.0392 3.0435 False 
treatment1 treatment3   1.47   -0.0714 3.0113 False 
treatment1 treatment4  3.8572   2.3159 5.3985  True 
treatment2 treatment3 -0.0322  -1.5735 1.5091 False 
treatment2 treatment4  2.355    0.8137 3.8963  True 
treatment3 treatment4  2.3872   0.8459 3.9285  True 

Holm-Bonferroni Method

comp = MultiComp.allpairtest(stats.ttest_rel, method='Holm')
print (comp[0])

Test Multiple Comparison ttest_rel 
FWER=0.05 method=Holm
alphacSidak=0.01, alphacBonf=0.008
=====================================================
  group1     group2     stat   pval  pval_corr reject
-----------------------------------------------------
treatment1 treatment2 -2.1234 0.0388   0.0776  False 
treatment1 treatment3 -2.4304 0.0188   0.0564  False 
treatment1 treatment4 -6.4443  0.0      0.0     True 
treatment2 treatment3  0.0457 0.9637   0.9637  False 
treatment2 treatment4 -3.7878 0.0004   0.0017   True 
treatment3 treatment4 -5.0246  0.0      0.0     True 

One-way ANOVA for regression analysis, model with constant

import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression

X0, y, coef = make_regression(n_samples = 100, n_features=1, noise=30, coef=True, random_state=0)
dfX0 = pd.DataFrame(X0, columns=["X"])
dfX = sm.add_constant(dfX0)
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model = sm.OLS.from_formula("Y ~ X", data=df)
result = model.fit()

print("TSS = ", result.uncentered_tss)       # TSS: total sum of square
print("ESS = ", result.mse_model)            # ESS: explained sum of squares
print("RSS = ", result.ssr)                  # RSS: residual sum of squares
print("ESS + RSS = ", result.mse_model + result.ssr)
print("R squared = ", result.rsquared)

OUTPUT 1 : ANOVA table for fixed model, single factor

{\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}

SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2},

SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2},

SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,

R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}\,

TSS =  291345.7578983061
ESS =  188589.61349210917
RSS =  102754.33755137534
ESS + RSS =  291343.9510434845
R squared =  0.6473091780922585

Visualization for OUTPUT 1

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.distplot(y,
             kde=False, fit=stats.norm, hist_kws={"color": "r", "alpha": 0.2}, fit_kws={"color": "r"},
             label="TSS")
sns.distplot(result.fittedvalues,
             kde=False, hist_kws={"color": "g", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "g"},
             label="ESS")
sns.distplot(result.resid,
             kde=False, hist_kws={"color": "b", "alpha": 0.2}, fit=stats.norm, fit_kws={"color": "b"},
             label="RSS")
plt.legend()
plt.show()

download

OUTPUT 2 : Regression F-test and ANOVA Relationship

sm.stats.anova_lm(result)

	        df	  sum_sq	      mean_sq	      F	                PR(>F)
X	        1.0	  188589.613492	      188589.613492   179.863766	6.601482e-24
Residual	98.0	  102754.337551	      1048.513648     NaN	        NaN

print(result.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     179.9
Date:                Fri, 25 Oct 2019   Prob (F-statistic):           6.60e-24
Time:                        19:14:57   Log-Likelihood:                -488.64
No. Observations:                 100   AIC:                             981.3
Df Residuals:                      98   BIC:                             986.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.4425      3.244     -0.753      0.453      -8.880       3.995
X             43.0873      3.213     13.411      0.000      36.712      49.463
==============================================================================
Omnibus:                        3.523   Durbin-Watson:                   1.984
Prob(Omnibus):                  0.172   Jarque-Bera (JB):                2.059
Skew:                          -0.073   Prob(JB):                        0.357
Kurtosis:                       2.312   Cond. No.                         1.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Visualization : Coefficient of Determination(R2) and Correlation Coefficient

mathworld : correlation coefficient

Coefficient of Determination(R2)

{\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}

SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2},

SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2},

SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,

R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}\,

Correlation Coefficient

For a Population,

\rho _{X,Y}=\operatorname {corr} (X,Y)={\operatorname {cov} (X,Y) \over \sigma _{X}\sigma _{Y}}={\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})] \over \sigma _{X}\sigma _{Y}}

For a Sample,

r_{xy}\quad {\overset {\underset {\mathrm {def} }{}}{=}}\quad {\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{(n-1)s_{x}s_{y}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sqrt {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}\sum \limits _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}},

import seaborn as sns
import matplotlib.pyplot as plt

sns.jointplot(result.fittedvalues, y)
plt.show()

download (1)

One-way ANOVA for regression analysis, model without constant

import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import make_regression

X0, y, coef = make_regression(
    n_samples=100, n_features=1, noise=30, bias=100, coef=True, random_state=0)
dfX = pd.DataFrame(X0, columns=["X"])
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model2 = sm.OLS.from_formula("Y ~ X + 0", data=df)
result2 = model2.fit()
result2.summary()

OUTPUT

Comparing Model Using F Test

OUTPUT

Comparison of importance of variables using F test

OUTPUT

Adjusted Coefficient of Determination(Adjusted R2)

OUTPUT

Information Criterion

OUTPUT

Two-way analysis of variance

Two-way ANOVA with repetition

ANOVA_twoway_with_repetition

Two-way ANOVA without repetition

ANOVA_twoway_without_repetition

Multivariate analysis of variance (MANOVA)

List of posts followed by this article

Reference

분산 분석과 모형 성능

OUTPUT

6626070
2997924

MATH05, Analysis of variance

Contents

One-way analysis of variance

Tukey’s multi-comparison method, Holm-Bonferroni Method

One-way ANOVA for regression analysis, model with constant

One-way ANOVA for regression analysis, model without constant

Comparing Model Using F Test

Comparison of importance of variables using F test

Adjusted Coefficient of Determination(Adjusted R2)

Information Criterion

Two-way analysis of variance

Two-way ANOVA with repetition

Two-way ANOVA without repetition

Multivariate analysis of variance (MANOVA)

6626070 2997924

MATH05, Analysis of variance

Contents

One-way analysis of variance

Tukey’s multi-comparison method, Holm-Bonferroni Method

One-way ANOVA for regression analysis, model with constant

One-way ANOVA for regression analysis, model without constant

Comparing Model Using F Test

Comparison of importance of variables using F test

Adjusted Coefficient of Determination(Adjusted R2)

Information Criterion

Two-way analysis of variance

Two-way ANOVA with repetition

Two-way ANOVA without repetition

Multivariate analysis of variance (MANOVA)

6626070
2997924