Test statistics

Before going to start any test statistics, following things shall be looked                    Download Practice Files
          Are the values normally distributed or not?
(if you have large numbers of data (>30) no need to worry about this issue), just a rule of thumb
          Are there outliers in the data?
          If data were collected over a period of time, is there evidence for serial correlation?
To check above 3 things:
das.df<-read.table('F:/R class/das.txt', header=T)
das.df
attach(das.df)
names(das.df)
Lets see the distribution of the data
das.df<-read.table('E:/R Exercise/R class/das.txt', header=T)
das.df
attach(das.df)
names(das.df)

Trying to check normality of data
par(mfrow=c(2,2))    # partitioning the plot area into 2x2
plot(y)
boxplot(y)
hist(y,main="")
qqnorm(y)                           # quntile-quantile plot (QQ-plot)
qqline(y,lty=2)            # fitting QQ-line
shapiro.test(y)          # Test for normality
The p-value is a measure of the credibility of the Null Hypothesis. If something is sufficiently unlikely to have occurred by chance (say,p<0.05). We say that it is statistically significant. Here p-value is >0.05, meaning alternative hypothesis is true,  i.e. does not distributed normally

Outlier
Here, a value in 'Y' vector is changed, and outlier is displayed in plot
y2<-y                # to demo outlier
y2[52]<-3.75
plot(y2)
boxplot(y2)
qqnorm(y2)           # quntile-quantile plot (QQ-plot)
qqline(y2,lty=2)           # fitting QQ-line
hist(y2)             # one bar is bit far
summary(y)           # To check the inter-quartile distance
Outlier: A rule of thumb is: an outlier is a value more than 1.5 times the inter-quartile range above the third quartile or below the first quartile.

One Samples Test
Suppose we have a single sample. The questions we might want to answer are these:
          What is the mean value?
          Is the mean value significantly different from current expectation or theory (popullation mean)?
          What is the level of uncertainty associated with our estimate of the mean value?
T-test is used to compare means.
          Non-normality, outliers and serial correlation can all invalidate inferences made by standard parametric tests like Student’s t-test.
One Sample T-test
x<-c(1,1,2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5)     # a vector
mean(x)
t.test(x,mu=5.7)                     # mu is population mean, suppose here we know it

more example
light<-read.table('E:/R Exercise/R class/light.txt',header=T)
light
attach(light)
names(light)
hist(speed, main="")                    # histogram plotting, with graph title (main) blank
The histogram is negatively skewed, so T-test is not recommended 
summary(speed)
Here instead of T-test, Wilcoxon signed rank test should be performed.
mean(speed)
wilcox.test(speed,mu=990)     # instead of t.test, mu= population mean, which is supplied
In data, the speed values are already subtracted with 299000
During the study in 1879, speed of light was thought to be 299990 (i.e. population mean)
As 299000 is subtracted from it, 990 is population mean here.
As p-value<0.05, H1 (Alternative Hypothesis) is accepted, mean speed of light is not equal to 990,
What is the inference then?
light$orginal<-speed+299000
head(light)              
The speed values are already subtracted with 299000, so the same value is add
attach(light)
mean(light)
wilcox.test(orginal,mu=299990)
Speed 299990 is established speed of light during that time
t.test(orginal,mu=299990)    
t.test is not recommended here as non-normality, still t.test seems significant, but GARBAGE IN = GARBAGE OUT

Two sample T-test
Comparing to variances
          Before starting a test to compare two sample means, we need to test whether the sample variances are significantly different.
          If variances are not significantly different, proceed for further test statistics
          There are two simple test for comparing tow sample means:
          Student’s t-test (when samples are independent, variances constant, and errors are normally distributed)
          Wilcoxon’s rank-sum test (when samples are independent but erros are not normally distributed (eg. They are ranks or scores or some sort).
t.data<-read.table('E:/R Exercise/R class/t.test.data.txt',header=T)
attach(t.data)
names(t.data)
t.data
mean(gardenA)
mean(gardenB)
boxplot(t.data)                  # direct box plot
boxplot(t.data,notch=T)               # direct box plot
Notice the notches do not overlap, referring significant difference in medians
var.test(gardenA, gardenB)       # testing variance difference
If variances are significantly different, Student's T-Test is wrong statistics
t.test(gardenA, gardenB)
Sign is irrelevant in T-test (for t-value)
Note that, because the means are significantly different, the confidence interval on the difference does not include zero (in fact, it goes from -3.085 up to -0.915). You might present the result like this: ‘Ozone concentration was significantly higher in garden B (mean =5.0 pphm) than in garden A (mean =3.0 pphm;            t =3.873, p=0.0011 (2-tailed), d.f.=18).

Wilcoxon Rank-sum Test
This is a non-parametric test, alternative to Student’s T-test
t.data                                                          # object in R created above
wilcox.test (gardenA, gardenB)
Compare p-values of both (above two) test, find which one has better p-value. Non-paramatric test is better than paramatric one

Paired T-test
streams.df <- read.table('E:/R Exercise/R class/streams.txt', header=T)
attach(streams.df)
streams.df
t.test (down,up, paired= T)               # Paired t-test
BUT
t.test(down,up)                                       # This is Student's t-test
Check the difference in result
The same paired test can be done in the same way we treated the one sample data
d<-(up-down)
d
t.test(d)
They have identical result

Z-test
A rule of thumb is that if no. of samples exceeds 30, then z-test is performed in stead of t-test. But some scholars do not prefer z-test rather use t-test The result will be slightly different
for.z<-read.table("E:/R Exercise/R class/t.test.data.txt",header=T)
attach(for.z)
names(for.z)
for.z
sd(for.z)
install.packages('BSDA')
library(BSDA)                                     # load library to test Z (if need install first)
test.res<-z.test(gardenA,gardenB, alternative='two.side', sigma.x=sd(gardenA), sigma.y=sd(gardenB))
test.res


Chi-square Test
Pearson’s chi-sq test is the best-known of several chi-sq tests-statistics.
It is used to assess two types of comparison: test of goodness of fit and test of independence.
Goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution (expected one)
Test of independence assesses whether paired observations on two variables, expressed in contingency table, are independent of each other
Assumptions
          Random samples
          Sample size – large sample size
          Expected cell count  > 5 (in all cells)
          Independence – cannot be used data with correlated variables
Chi-square test for Goodness of fit
Pearson’s chi-squared
b.df<-read.table('E:/R Exercise/R class/count.txt',header=T)
b.df
chisq.test(b.df)                                # Pearson's Chi-squared test
If you need to extract the frequencies expected under the null hypothesis of independence then use
chisq.test(b.df,correct=T)$expected
chisq.test(b.df,correct=F)         # without Yates' Continuity correction
?chisq.test                                             # for help

Chi-square test for Independence 
library(MASS)       # load the MASS package (default loaded)
data(survey)         # this data file comes with R installation
head(survey)
levels(survey$Smoke)   #checking levels of data in Smoke variable
levels(survey$Exer)
survey1.data<-survey[1:15,] # extracting and saving data from row 1 to 15, all column of data
Here problem or question is 'whether the students smoking habit is independent of their exercise level at 0.05 significant levels
Check how many levels are there in both variables
attach(survey1.data)
levels(Smoke)
levels(Exer)
First extract the column level sums values required for test and assign in new contingency table
chitable=table(survey$Smoke, survey$Exer) 
chitable
Now contingency table is ready and we can test Chi
chisq.test(chitable)
The p-value is not exact one as we have very low value at least in one cell
chisq.test(chitable)$expected # For expected values


Fisher's exact test
This is alternative to Chi-test if number is below 5 in any one cell of Contingency Table
ant.df<-read.table('E:/R Exercise/R class/antintree.txt')
ant.df
attach(ant.df)
fisher.test(ant.df)
Null hypothesis accepted, means no correlation between tree and ant colony or they are independent variables
Fisher Test compulsorily doesn't need contingency table, example
table<-read.table("E:/R Exercise/R class/fisher.txt",header=T)
attach(table)
head(table)
fisher.test(tree,nests)
Result
If the observed frequency will be greater than the expected frequency = positively correlated
Low Expected frequency = negatively correlated
Is it significant, check p-value

Power Analysis (Power of Test)
The power of a test is the probability of rejecting the null hypothesis when it is false.
It has to do with Type II errors: β is the probability of accepting the null hypothesis when it is false.
Most statisticians work with α=0.05 and β =0.2 
Now the power of a test is defined as 1−β=0.8 under the standard assumptions. i.e. There is 80 % chance that we reject the false Null hypothesis.
power.t.test # power calculations for one- and two-sample t tests;
power.anova.test # power calculations for balanced one-way ANOVA tests.
?power.t.test
power.t.test (delta=2,sd=3.5,power=0.8)
Where, 
n=(the number of observations per group)
delta =(the difference in means we want to be able to detect; We need to think/hard about this value)
sd =(the standard deviation of the sample)
sig.level = the significance level (Type I error probability) 
Where we often accept the default value of 5%)
power = (the power you want the test to have where we often accept the default value of 80%)
type = (the type of T test you want to carry out: two-sample, one-sample or paired)
alternative= whether you want to do a one- or a two-tailed test, where you will typically want to do the default, two-tailed test
This sounds like a lot of work, but we typically use all of the defaults so only need to specify the difference, delta and sd to work 
Out the sample size 'n' that will give us the power we want.

Analysis of Variance (ANOVA)
·      ANOVA is the technique we use when all the explanatory variables are CATEGORICAL.
·      The explanatory variables are called factors, each factors has two or more levels.
·      When there is a single factor with 3 or more levels we use one-way ANOVA.
·      If we had a single factor with just 2 levels, we would use Student’s t test  or similar
·      Where there are 2 or more factors, then we use two-way  or three-way ANOVA, depending on the number of explanatory variables.
·      Where there is replication at each combination of levels in a multi-way ANOVA, the experiment is called a factorial design, and this allow us to study interactions between variables, in which we test whether the response to one factor depends on the level of another factor.

One-way ANOVA
          The idea of ANOVA is to compare two or more means, but it does this by comparing variances.
          One-way anova is used when you have a continuous response variable and only one predictor variable which is catagorical and have more than two levels.
Example:
          We have an experiment in which crop yields per unit were measured from 10 randomly selected fields on each of three soil types. All fields were sown with the same variety of seed and provided with same fertilizer and pest control inputs. The question is whether soil types significantly affects crop yield, and if so, to what extent


result.df<-read.table('E:/R Exercise/R class/yields.txt', header=T)
result.df
attach(result.df)
sapply(list(sand,clay,loam),mean)   # to calculate means
mean (result.df) # just to compare with above command
y<-c(sand,clay,loam) # assigning means matrix to a vector 'y'
y # checking 'y'
soil<-factor(rep(1:3,c(10,10,10))) # constructing a vector 'soil' assigning values 1 to 3 repeating 10 times for each
soil # checking 'soil'
sapply(list(sand,clay,loam),var) # checking consistency of variance before going to ANOVA


The variances differ by more than a factor 2. But is this significant?
It can be tested for heteroscedasticity using the Fligner-Killeen test of homogeneity of variance.
fligner.test(y~soil) # Fligner-Killeen test
Difference is not significant, here we have Categorical explanatory variable with three levels (soil types-clay, sand, loam)
Variance significance can also be check with box plot
boxplot(result.df,name=c('sand','clay','loam'), ylab='Yield')


Above data need to reconstruct to test ANOVA
result1.df<-read.table('E:/R Exercise/R class/yields1.txt', header=T)
result1.df
attach(result1.df)
barplot(tapply(yield,list(type), mean)) # bar plot for mean values of each 'type' levels
plot(result1.df) # box plot
tapply(yield,list(type),mean) # checking means
Alternate way to calculate above three means
mean(yield [type=='sand'])
mean(yield [type=='clay'])
mean(yield [type=='loam'])
model.aov<-aov(yield~type)         # performing Analysis of Variance
summary(model.aov)
model.aov
Here P-value is significant. Alternative hypothesis is accepted, significant difference in yield with soil types
Since there is a significant effect of 'type', we need to test which of the soil type is different from each other.
install.packages('multcomp')
library(multcomp)                 # loading function library
mc <- glht(model.aov, linfct = mcp(type = "Tukey")) # General Linear Hypotheses
linfct -- a specification of the linear hypotheses to be tested
type -- the name of the specific procedure
summary(mc)
Compare result with boxplot
plot(result1.df)          # box plot, see that inter-quartiles are not overlapping, meaning significant difference


Two-way ANOVA
Two-way ANOVA is used when you have a continuous response variable and two categorical predictor variables. To test for an eventual interaction between the two predictors, you need replicates at the levels of both.
daphnia.df<-read.table('E:/R Exercise/R class/daphnia.txt', header=T)
daphnia.df[1:5,]            # checking first 5 row to see the data types and heading names
attach(daphnia.df)
sapply(list(Water),levels)  # checking levels of variable
sapply(list(Detergent),levels)   # checking levels of variable
sapply(list(Daphnia),levels)           # checking levels of variable
names(daphnia.df)
boxplot (Growth.rate, Detergent) # Making a simple plot

twoway.aov<- aov(Growth.rate~Water+Detergent+Daphnia) # Performing the analysis without interaction between variables
summary(twoway.aov) 
model1<- aov(Growth.rate~Water*Detergent*Daphnia) # Performing the analysis: '*' give interaction
summary(model1)
summary.lm(twoway.aov)   # viewing models in lm instead of ANOVA
summary.lm(model1) # viewing model1 in lm result instead ANOVA



Multivariate Analysis of Variance (MANOVA)
When 2 or more response variables are measured
Either analyse separately for each variable (conventional way) or together
But there are occasions where you want to treat the group of response variables as one multivariate response
The function for this is MANOVA.
data.df<-read.table('E:/R Exercise/R class/manova.txt', header=T)
data.df[1:5,] # check the variables and types
attach(data.df)
In this data set, there are 3 response and 2 predictor variables
y<-cbind(tear,gloss,opacity) # extracting Reponses to new data frame
y
sapply(list(rate),levels)          # checking levels of variable
sapply(list(additive),levels)
model<-manova(y~rate*additive)
summary (model)                  # overall test result
summary.aov(model)          # give individual variable result
See the difference between separate analysis (ANOVA) and combined (MANOVA)

Analysis of Covariance (ANCOVA)
          It combines elements from regression and ANOVA.
          The response variable is continuous (as above case), and there is at least on continuous explanatory variable and at least one categorical explanatory variable.
          In analysis, either lm (linear regression) or aov (analysis of variance) could be used, choice affects only format of the summary result, not on result.
          Lets work with an example concerned an experiment on the impact of grazing on the seed production of a biennial plant.
40 plants allocated to 2 treatment: grazed and ungrazed
read.table(’F:/r class/ipomopsis.txt, header=T)-> regrowth.df
read.table('E:/R Exercise/r class/ipomopsis.txt', header=T)->regrowth.df
attach(regrowth.df)
names(regrowth.df)
regrowth.df[1:5,]
sapply(list(Grazing),levels)

Here 'Fruit' is response variable and 'Root' & 'Grazing' are explanatory variables
x11()                                 # create new plot window, without replacing previous one
plot(Root[Grazing=='Grazed'],Fruit[Grazing=='Grazed'], pch=17, col= 'blue', xlim=c(4,12),ylim=c(10,140))
par(new=T)      # to add new parameter in graph without replace previous one
plot(Root[Grazing=='Ungrazed'],Fruit[Grazing=='Ungrazed'], pch=16, col='red', xlim=c(4,12),ylim=c(10,140),xlab='',ylab='')

Lets use logical subscripts to draw linear regression lines for the two grazing treatments separately, using 'abline'
abline(lm(Fruit [Grazing=='Grazed']~Root[Grazing=='Grazed']), lty=2, col='blue')
abline(lm(Fruit [Grazing=='Ungrazed']~Root[Grazing=='Ungrazed']), lty=1,lwd=2, col='red')

In analysis, we fit the most complicated model first, then simplify it by removing non-significant terms until minimal adequate model is obtained
ancova<-lm(Fruit~Grazing*Root)
summary(ancova)                   #summary for linear model
anova(ancova)
ancova2<-lm(Fruit~Grazing+Root)
summary(ancova2)
Lets check which one is better
anova(ancova,ancova2)
Our Null hypothesis is there: ancova = ancova2
Check the result, find p=0.75, means Null Hypothesis is accepted, and there is no difference i.e. there is no significant effect of the interaction term (Grazing:Root)
And in second model we have removed the interaction term
So we choose the simple mode here (simple is better)
Now our model is
ancova2<-lm(Fruit~Grazing+Root)
summary(ancova2)
Now let's check if 'Grazing' really has any effect on fruit production, by remove it from model
ancova3<-lm(Fruit~Root)
summary(ancova3)
As above, let's check if there is significant difference between 'ancova2' and 'ancova3' (more simple model)
anova (ancova2,ancova3)
See the p-value, says that they are significantly different, so we should not choose 'ancova3' model
Result shows that p-value decrease significantly, meaning that there is effect of 'Grazing', so, should keep in model
Now our final simplest model is the 'ancova2' i.e. lm(Fruit~Grazing+Root)
Now our result is like this
ancova.test<-lm(Fruit~Grazing+Root)
summary (ancova.test)
There is significant effect of grazing treatment in fruit production and there is significant effect of root size in fruit production
Remember that the order of variables matters in ANCOVA, means lm(Fruits~Grazing+Root) not equal to lm(Fruits~Root+Grazing)
There is no clear answer about the order, but till my understanding, put the variable in first of which you are more concerned.
summary.aov(lm(Fruit~Grazing*Root))
summary.aov(lm(Fruit~Root*Grazing))

No comments:

Post a Comment