Before going to start any test statistics, following things shall be
looked Download Practice Files
Trying to check normality of data
•
Are
the values normally distributed or not?
(if you have large numbers of data (>30) no need to worry about this issue), just a rule of thumb
(if you have large numbers of data (>30) no need to worry about this issue), just a rule of thumb
•
Are
there outliers in the data?
•
If
data were collected over a period of time, is there evidence for serial
correlation?
To check above 3
things:
das.df<-read.table('F:/R class/das.txt', header=T)
das.df
attach(das.df)
names(das.df)
Lets see the
distribution of the data
das.df<-read.table('E:/R Exercise/R class/das.txt', header=T)
das.df
attach(das.df)
names(das.df)
Trying to check normality of data
par(mfrow=c(2,2)) # partitioning the plot area into 2x2
plot(y)
boxplot(y)
hist(y,main="")
qqnorm(y) # quntile-quantile plot (QQ-plot)
qqline(y,lty=2) # fitting QQ-line
shapiro.test(y) # Test for normality
The p-value is a measure of the credibility of the Null Hypothesis. If something is sufficiently
unlikely to have occurred by chance (say,p<0.05). We say that it is
statistically significant. Here p-value is >0.05, meaning alternative hypothesis is true, i.e. does not
distributed normally
Outlier
Here, a value in 'Y' vector is changed, and outlier is displayed in
plot
y2<-y # to demo outlier
y2[52]<-3.75
plot(y2)
boxplot(y2)
qqnorm(y2) # quntile-quantile plot
(QQ-plot)
qqline(y2,lty=2) # fitting QQ-line
hist(y2) # one bar is bit far
summary(y) # To check the inter-quartile
distance
Outlier: A rule of thumb is: an outlier is a value more than 1.5 times
the inter-quartile range above the third quartile or below the first quartile.
One Samples Test
Suppose we have a
single sample. The questions we might want to answer are these:
•
What
is the mean value?
•
Is
the mean value significantly different from current expectation or theory
(popullation mean)?
•
What
is the level of uncertainty associated with our estimate of the mean value?
T-test is used to
compare means.
•
Non-normality,
outliers and serial correlation can all invalidate inferences made by standard
parametric tests like Student’s t-test.
One Sample T-test
x<-c(1,1,2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5) # a vector
mean(x)
t.test(x,mu=5.7) #
mu is population mean, suppose here we know it
more
example
light<-read.table('E:/R
Exercise/R class/light.txt',header=T)
light
attach(light)
names(light)
hist(speed, main="") #
histogram plotting, with graph title (main) blank
The histogram is negatively skewed, so T-test is not recommended
summary(speed)
Here instead of T-test, Wilcoxon signed rank test should be performed.
mean(speed)
wilcox.test(speed,mu=990) # instead of t.test, mu= population mean,
which is supplied
In data, the speed values are already subtracted with 299000
During the study in 1879, speed of light was thought to be 299990 (i.e.
population mean)
As 299000 is subtracted from it, 990 is population mean here.
As p-value<0.05, H1 (Alternative Hypothesis) is accepted, mean speed
of light is not equal to 990,
What is the inference then?
light$orginal<-speed+299000
head(light)
The speed values are already subtracted with 299000, so the same value
is add
attach(light)
mean(light)
wilcox.test(orginal,mu=299990)
Speed 299990 is established speed of light during that time
t.test(orginal,mu=299990)
t.test is not recommended here as non-normality, still t.test seems
significant, but GARBAGE IN = GARBAGE
OUT
Two sample T-test
Comparing to variances
•
Before
starting a test to compare two sample means, we need to test whether the sample
variances are significantly different.
•
If
variances are not significantly different, proceed for
further test statistics
•
There
are two simple test for comparing tow sample means:
•
Student’s t-test (when samples are independent, variances constant, and errors are
normally distributed)
•
Wilcoxon’s rank-sum test (when samples are independent but erros are not normally distributed
(eg. They are ranks or scores or some sort).
t.data<-read.table('E:/R Exercise/R class/t.test.data.txt',header=T)
attach(t.data)
names(t.data)
t.data
mean(gardenA)
mean(gardenB)
boxplot(t.data) # direct box plot
boxplot(t.data,notch=T) #
direct box plot
Notice the notches do not overlap, referring significant difference in
medians
var.test(gardenA, gardenB) #
testing variance difference
If variances are significantly different, Student's T-Test is wrong statistics
t.test(gardenA, gardenB)
Sign is irrelevant in T-test (for t-value)
Note that, because the means are significantly different, the
confidence interval on the difference does not include zero (in fact, it goes
from -3.085 up to -0.915). You might present the result like this: Ozone concentration was significantly higher in garden B (mean
=5.0 pphm) than in garden A (mean =3.0 pphm;
t =3.873, p=0.0011 (2-tailed), d.f.=18).
Wilcoxon Rank-sum Test
This is a non-parametric test, alternative to Students T-test
t.data #
object in R created above
wilcox.test (gardenA, gardenB)
Compare p-values of both (above two) test, find which one has better
p-value. Non-paramatric test is better than paramatric one
Paired T-test
streams.df <- read.table('E:/R Exercise/R class/streams.txt',
header=T)
attach(streams.df)
streams.df
t.test (down,up, paired= T)
# Paired t-test
BUT
t.test(down,up) # This is Student's t-test
Check the difference in result
The same paired test can be done in the same way we treated the one
sample data
d<-(up-down)
d
t.test(d)
They have identical result
Z-test
A rule of thumb is
that if no. of samples exceeds 30, then z-test is performed in stead of t-test.
But some scholars do not prefer z-test rather use t-test The result will be
slightly different
for.z<-read.table("E:/R Exercise/R
class/t.test.data.txt",header=T)
attach(for.z)
names(for.z)
for.z
sd(for.z)
install.packages('BSDA')
library(BSDA) # load library to test Z (if need
install first)
test.res<-z.test(gardenA,gardenB, alternative='two.side',
sigma.x=sd(gardenA), sigma.y=sd(gardenB))
test.res
Chi-square test for Independence
library(MASS) # load the MASS package (default loaded)
data(survey) # this data file comes with R installation
head(survey)
levels(survey$Smoke) #checking levels of data in Smoke variable
levels(survey$Exer)
survey1.data<-survey[1:15,] # extracting and saving data from row 1 to 15, all column of data
Here problem or question is 'whether the students smoking habit is independent of their exercise level at 0.05 significant levels
Check how many levels are there in both variables
attach(survey1.data)
levels(Smoke)
levels(Exer)
First extract the column level sums values required for test and assign in new contingency table
chitable=table(survey$Smoke, survey$Exer)
chitable
Now contingency table is ready and we can test Chi
chisq.test(chitable)
The p-value is not exact one as we have very low value at least in one cell
chisq.test(chitable)$expected # For expected values
Fisher's exact test
This is alternative to Chi-test if number is below 5 in any one cell of Contingency Table
ant.df<-read.table('E:/R Exercise/R class/antintree.txt')
ant.df
attach(ant.df)
fisher.test(ant.df)
Null hypothesis accepted, means no correlation between tree and ant colony or they are independent variables
Fisher Test compulsorily doesn't need contingency table, example
table<-read.table("E:/R Exercise/R class/fisher.txt",header=T)
attach(table)
head(table)
fisher.test(tree,nests)
Result
If the observed frequency will be greater than the expected frequency = positively correlated
Low Expected frequency = negatively correlated
Is it significant, check p-value
result.df<-read.table('E:/R Exercise/R class/yields.txt', header=T)
result.df
attach(result.df)
sapply(list(sand,clay,loam),mean) # to calculate means
mean (result.df) # just to compare with above command
y<-c(sand,clay,loam) # assigning means matrix to a vector 'y'
y # checking 'y'
soil<-factor(rep(1:3,c(10,10,10))) # constructing a vector 'soil' assigning values 1 to 3 repeating 10 times for each
soil # checking 'soil'
sapply(list(sand,clay,loam),var) # checking consistency of variance before going to ANOVA
The variances differ by more than a factor 2. But is this significant?
It can be tested for heteroscedasticity using the Fligner-Killeen test of homogeneity of variance.
fligner.test(y~soil) # Fligner-Killeen test
Difference is not significant, here we have Categorical explanatory variable with three levels (soil types-clay, sand, loam)
Variance significance can also be check with box plot
boxplot(result.df,name=c('sand','clay','loam'), ylab='Yield')
Above data need to reconstruct to test ANOVA
result1.df<-read.table('E:/R Exercise/R class/yields1.txt', header=T)
result1.df
attach(result1.df)
barplot(tapply(yield,list(type), mean)) # bar plot for mean values of each 'type' levels
plot(result1.df) # box plot
tapply(yield,list(type),mean) # checking means
Alternate way to calculate above three means
mean(yield [type=='sand'])
mean(yield [type=='clay'])
mean(yield [type=='loam'])
model.aov<-aov(yield~type) # performing Analysis of Variance
summary(model.aov)
model.aov
Here P-value is significant. Alternative hypothesis is accepted, significant difference in yield with soil types
Since there is a significant effect of 'type', we need to test which of the soil type is different from each other.
install.packages('multcomp')
library(multcomp) # loading function library
mc <- glht(model.aov, linfct = mcp(type = "Tukey")) # General Linear Hypotheses
linfct -- a specification of the linear hypotheses to be tested
type -- the name of the specific procedure
summary(mc)
Compare result with boxplot
plot(result1.df) # box plot, see that inter-quartiles are not overlapping, meaning significant difference
Two-way ANOVA
• Two-way ANOVA is used when you have a continuous response variable and two categorical predictor variables. To test for an eventual interaction between the two predictors, you need replicates at the levels of both.
daphnia.df<-read.table('E:/R Exercise/R class/daphnia.txt', header=T)
daphnia.df[1:5,] # checking first 5 row to see the data types and heading names
attach(daphnia.df)
sapply(list(Water),levels) # checking levels of variable
sapply(list(Detergent),levels) # checking levels of variable
sapply(list(Daphnia),levels) # checking levels of variable
names(daphnia.df)
boxplot (Growth.rate, Detergent) # Making a simple plot
twoway.aov<- aov(Growth.rate~Water+Detergent+Daphnia) # Performing the analysis without interaction between variables
summary(twoway.aov)
model1<- aov(Growth.rate~Water*Detergent*Daphnia) # Performing the analysis: '*' give interaction
summary(model1)
summary.lm(twoway.aov) # viewing models in lm instead of ANOVA
summary.lm(model1) # viewing model1 in lm result instead ANOVA
Multivariate Analysis of Variance (MANOVA)
• When 2 or more response variables are measured
• Either analyse separately for each variable (conventional way) or together
• But there are occasions where you want to treat the group of response variables as one multivariate response
• The function for this is MANOVA.
data.df<-read.table('E:/R Exercise/R class/manova.txt', header=T)
data.df[1:5,] # check the variables and types
attach(data.df)
In this data set, there are 3 response and 2 predictor variables
y<-cbind(tear,gloss,opacity) # extracting Reponses to new data frame
y
sapply(list(rate),levels) # checking levels of variable
sapply(list(additive),levels)
model<-manova(y~rate*additive)
summary (model) # overall test result
summary.aov(model) # give individual variable result
See the difference between separate analysis (ANOVA) and combined (MANOVA)
Chi-square Test
Pearson’s chi-sq
test is the best-known of several chi-sq tests-statistics.
It is used to assess
two types of comparison: test of goodness of fit and test of
independence.
Goodness of fit establishes whether or not an
observed frequency distribution differs from a theoretical distribution
(expected one)
Test of independence assesses whether paired
observations on two variables, expressed in contingency table, are independent
of each other
Assumptions
•
Random
samples
•
Sample
size – large sample size
•
Expected
cell count > 5 (in all cells)
•
Independence
– cannot be used data with correlated variables
Chi-square test for Goodness of
fit
Pearsons chi-squared
b.df<-read.table('E:/R
Exercise/R class/count.txt',header=T)
b.df
chisq.test(b.df) #
Pearson's Chi-squared test
If you need to extract the frequencies expected under the null
hypothesis of independence then use
chisq.test(b.df,correct=T)$expected
chisq.test(b.df,correct=F) # without Yates' Continuity correction
?chisq.test #
for help
Chi-square test for Independence
library(MASS) # load the MASS package (default loaded)
data(survey) # this data file comes with R installation
head(survey)
levels(survey$Smoke) #checking levels of data in Smoke variable
levels(survey$Exer)
survey1.data<-survey[1:15,] # extracting and saving data from row 1 to 15, all column of data
Here problem or question is 'whether the students smoking habit is independent of their exercise level at 0.05 significant levels
Check how many levels are there in both variables
attach(survey1.data)
levels(Smoke)
levels(Exer)
First extract the column level sums values required for test and assign in new contingency table
chitable=table(survey$Smoke, survey$Exer)
chitable
Now contingency table is ready and we can test Chi
chisq.test(chitable)
The p-value is not exact one as we have very low value at least in one cell
chisq.test(chitable)$expected # For expected values
Fisher's exact test
This is alternative to Chi-test if number is below 5 in any one cell of Contingency Table
ant.df<-read.table('E:/R Exercise/R class/antintree.txt')
ant.df
attach(ant.df)
fisher.test(ant.df)
Null hypothesis accepted, means no correlation between tree and ant colony or they are independent variables
Fisher Test compulsorily doesn't need contingency table, example
table<-read.table("E:/R Exercise/R class/fisher.txt",header=T)
attach(table)
head(table)
fisher.test(tree,nests)
Result
If the observed frequency will be greater than the expected frequency = positively correlated
Low Expected frequency = negatively correlated
Is it significant, check p-value
Power Analysis (Power of Test)
The power of a test is the probability of rejecting the null hypothesis when it is false.
It has to do with Type II errors: β is the probability of accepting the null hypothesis when it is false.
Most statisticians work with α=0.05 and β =0.2
Now the power of a test is defined as 1−β=0.8 under the standard assumptions. i.e. There is 80 % chance that we reject the false Null hypothesis.
power.t.test # power calculations for one- and two-sample t tests;
power.anova.test # power calculations for balanced one-way ANOVA tests.
?power.t.test
power.t.test (delta=2,sd=3.5,power=0.8)
Where,
n=(the number of observations per group)
delta =(the difference in means we want to be able to detect; We need to think/hard about this value)
sd =(the standard deviation of the sample)
sig.level = the significance level (Type I error probability)
Where we often accept the default value of 5%)
power = (the power you want the test to have where we often accept the default value of 80%)
type = (the type of T test you want to carry out: two-sample, one-sample or paired)
alternative= whether you want to do a one- or a two-tailed test, where you will typically want to do the default, two-tailed test
This sounds like a lot of work, but we typically use all of the defaults so only need to specify the difference, delta and sd to work
Out the sample size 'n' that will give us the power we want.
Analysis of Variance (ANOVA)
·
ANOVA
is the technique we use when all the explanatory variables are
CATEGORICAL.
·
The
explanatory variables are called factors, each factors has two or
more levels.
·
When
there is a single factor with 3 or more levels we use one-way
ANOVA.
·
If
we had a single factor with just 2 levels, we would use Student’s t test or similar
·
Where
there are 2 or more factors, then we use two-way or three-way ANOVA, depending on the number
of explanatory variables.
·
Where
there is replication at each combination of levels in a multi-way ANOVA, the
experiment is called a factorial design, and this allow us to study interactions
between variables, in which we test whether the response to one factor
depends on the level of another factor.
One-way ANOVA
•
The
idea of ANOVA is to compare two or more means, but it does this by comparing
variances.
•
One-way
anova is used when you have a continuous response variable and
only one predictor variable which is catagorical and have
more than two levels.
Example:
•
We
have an experiment in which crop yields per unit were measured from 10 randomly
selected fields on each of three soil types. All fields were sown with
the same variety of seed and provided with same fertilizer and pest control
inputs. The question is whether soil types significantly affects crop yield,
and if so, to what extent.
result.df<-read.table('E:/R Exercise/R class/yields.txt', header=T)
result.df
attach(result.df)
sapply(list(sand,clay,loam),mean) # to calculate means
mean (result.df) # just to compare with above command
y<-c(sand,clay,loam) # assigning means matrix to a vector 'y'
y # checking 'y'
soil<-factor(rep(1:3,c(10,10,10))) # constructing a vector 'soil' assigning values 1 to 3 repeating 10 times for each
soil # checking 'soil'
sapply(list(sand,clay,loam),var) # checking consistency of variance before going to ANOVA
The variances differ by more than a factor 2. But is this significant?
It can be tested for heteroscedasticity using the Fligner-Killeen test of homogeneity of variance.
fligner.test(y~soil) # Fligner-Killeen test
Difference is not significant, here we have Categorical explanatory variable with three levels (soil types-clay, sand, loam)
Variance significance can also be check with box plot
boxplot(result.df,name=c('sand','clay','loam'), ylab='Yield')
Above data need to reconstruct to test ANOVA
result1.df<-read.table('E:/R Exercise/R class/yields1.txt', header=T)
result1.df
attach(result1.df)
barplot(tapply(yield,list(type), mean)) # bar plot for mean values of each 'type' levels
plot(result1.df) # box plot
tapply(yield,list(type),mean) # checking means
Alternate way to calculate above three means
mean(yield [type=='sand'])
mean(yield [type=='clay'])
mean(yield [type=='loam'])
model.aov<-aov(yield~type) # performing Analysis of Variance
summary(model.aov)
model.aov
Here P-value is significant. Alternative hypothesis is accepted, significant difference in yield with soil types
Since there is a significant effect of 'type', we need to test which of the soil type is different from each other.
install.packages('multcomp')
library(multcomp) # loading function library
mc <- glht(model.aov, linfct = mcp(type = "Tukey")) # General Linear Hypotheses
linfct -- a specification of the linear hypotheses to be tested
type -- the name of the specific procedure
summary(mc)
Compare result with boxplot
plot(result1.df) # box plot, see that inter-quartiles are not overlapping, meaning significant difference
Two-way ANOVA
• Two-way ANOVA is used when you have a continuous response variable and two categorical predictor variables. To test for an eventual interaction between the two predictors, you need replicates at the levels of both.
daphnia.df<-read.table('E:/R Exercise/R class/daphnia.txt', header=T)
daphnia.df[1:5,] # checking first 5 row to see the data types and heading names
attach(daphnia.df)
sapply(list(Water),levels) # checking levels of variable
sapply(list(Detergent),levels) # checking levels of variable
sapply(list(Daphnia),levels) # checking levels of variable
names(daphnia.df)
boxplot (Growth.rate, Detergent) # Making a simple plot
twoway.aov<- aov(Growth.rate~Water+Detergent+Daphnia) # Performing the analysis without interaction between variables
summary(twoway.aov)
model1<- aov(Growth.rate~Water*Detergent*Daphnia) # Performing the analysis: '*' give interaction
summary(model1)
summary.lm(twoway.aov) # viewing models in lm instead of ANOVA
summary.lm(model1) # viewing model1 in lm result instead ANOVA
Multivariate Analysis of Variance (MANOVA)
• When 2 or more response variables are measured
• Either analyse separately for each variable (conventional way) or together
• But there are occasions where you want to treat the group of response variables as one multivariate response
• The function for this is MANOVA.
data.df<-read.table('E:/R Exercise/R class/manova.txt', header=T)
data.df[1:5,] # check the variables and types
attach(data.df)
In this data set, there are 3 response and 2 predictor variables
y<-cbind(tear,gloss,opacity) # extracting Reponses to new data frame
y
sapply(list(rate),levels) # checking levels of variable
sapply(list(additive),levels)
model<-manova(y~rate*additive)
summary (model) # overall test result
summary.aov(model) # give individual variable result
See the difference between separate analysis (ANOVA) and combined (MANOVA)
Analysis of
Covariance (ANCOVA)
•
It
combines elements from regression and ANOVA.
•
The
response variable is continuous (as above case), and there is at least
on continuous explanatory variable and at least one categorical explanatory variable.
•
In
analysis, either lm (linear regression) or aov (analysis of variance) could be
used, choice affects only format of the summary result, not on result.
•
Lets
work with an example concerned an experiment on the impact of grazing on the
seed production of a biennial plant.
40 plants allocated to 2 treatment: grazed and ungrazed
40 plants allocated to 2 treatment: grazed and ungrazed
read.table(’F:/r
class/ipomopsis.txt, header=T)-> regrowth.df
read.table('E:/R Exercise/r
class/ipomopsis.txt', header=T)->regrowth.df
attach(regrowth.df)
names(regrowth.df)
regrowth.df[1:5,]
sapply(list(Grazing),levels)
Here 'Fruit' is response variable and 'Root' & 'Grazing' are
explanatory variables
x11() # create new plot window, without
replacing previous one
plot(Root[Grazing=='Grazed'],Fruit[Grazing=='Grazed'],
pch=17, col= 'blue', xlim=c(4,12),ylim=c(10,140))
par(new=T) # to add new parameter in graph without replace previous one
plot(Root[Grazing=='Ungrazed'],Fruit[Grazing=='Ungrazed'],
pch=16, col='red', xlim=c(4,12),ylim=c(10,140),xlab='',ylab='')
Lets use logical subscripts to draw linear regression lines for the two
grazing treatments separately, using 'abline'
abline(lm(Fruit
[Grazing=='Grazed']~Root[Grazing=='Grazed']), lty=2, col='blue')
abline(lm(Fruit
[Grazing=='Ungrazed']~Root[Grazing=='Ungrazed']), lty=1,lwd=2, col='red')
In analysis, we fit the most complicated model first, then simplify it
by removing non-significant terms until minimal adequate model is obtained
ancova<-lm(Fruit~Grazing*Root)
summary(ancova) #summary
for linear model
anova(ancova)
ancova2<-lm(Fruit~Grazing+Root)
summary(ancova2)
Lets check which one is better
anova(ancova,ancova2)
Our Null hypothesis is there: ancova = ancova2
Check the result, find p=0.75, means Null Hypothesis is accepted, and
there is no difference i.e. there is no significant effect of the interaction
term (Grazing:Root)
And in second model we have removed the interaction term
So we choose the simple mode here (simple is better)
Now our model is
ancova2<-lm(Fruit~Grazing+Root)
summary(ancova2)
Now let's check if 'Grazing' really has any effect on fruit production,
by remove it from model
ancova3<-lm(Fruit~Root)
summary(ancova3)
As above, let's check if there is significant difference between
'ancova2' and 'ancova3' (more simple model)
anova (ancova2,ancova3)
See the p-value, says that they are significantly different, so we
should not choose 'ancova3' model
Result shows that p-value decrease significantly, meaning that there is
effect of 'Grazing', so, should keep in model
Now our final simplest model is the 'ancova2' i.e.
lm(Fruit~Grazing+Root)
Now our result is like this
ancova.test<-lm(Fruit~Grazing+Root)
summary (ancova.test)
There is significant effect of grazing treatment in fruit production
and there is significant effect of root size in fruit production
Remember that the order of variables matters in ANCOVA, means
lm(Fruits~Grazing+Root) not equal to lm(Fruits~Root+Grazing)
There is no clear answer about the order, but till my understanding,
put the variable in first of which you are more concerned.
summary.aov(lm(Fruit~Grazing*Root))
summary.aov(lm(Fruit~Root*Grazing))
No comments:
Post a Comment