Regression
Linear regression
Used when both response and
explanatory variables are continuous (eg. Height, weight, volumes, temp, DBH,
etc)
We will cover
Linear regression
the linear model is : y = a + b x
Multiple regression : Yi = β0 + β1 X1i + β2X2i + єi
Linear regression
It is written as 'lm' (linear model) in R
The response variable comes first, then tilda (' ~ ') then the explanatory variable
bio.df<- read.table ('E:/R Exercise/r class/age.txt', header=T)
attach(bio.df)
bio.df [1:5,]
reg.lm <- lm (weight ~ age) # Response~Predictor
here the linear model is saved in 'demo.lm', so that summary about the result can be check and the model can be used for other purposes. We do not same it in R object, it will be difficult to do above things.
reg.lm # check the lm result
plot(age,weight)
abline (reg.lm) # fitting linear model in plot
summary(reg.lm) # checking the summary of the linear model
summary(lm(weight~age))[[8]] # give R sq value
The function lm(y~x) = fit a linear model of 'y' as a function of 'x'
summary(reg.lm)[[8]] # to check the R-sq value
## see the equation " y = a + bx "
reg.lm # to check only the coefficients
Now the maximum likelihood equation is like this (of: y = a + b x)
Weight = 44.6831 + 0.7033*age
One more example
tannin.df<-read.table('E:/R Exercise/r class/tannin.txt', header=T)
attach(tannin.df)
tannin.df
fit.lm <-lm(growth ~ tannin)
fit.lm
summary(fit.lm)
summary(fit.lm)[[8]] # checking R Sq value
plot(tannin, growth)
abline(fit.lm)
x11() # create blank plot window
par(mfrow=c(2,3)) # creating partition for 6 plot
plot(fit.lm) # plotting the regression property plots, this will plot only four plots
plot(fit.lm, which=1:6) # this will plot 6 property plots
Plot interpretation, from top left, top right, bottom left to bottom right. (First for are most important)
1. Residual Vs Fitted:- it is bad if scatteredness increases as the fitted values get bigger (from left to right). The plot should have view like star in sky at night i.e. no pattern at all
2. Normal QQ plot: - it should be straight line. If the pattern shapes 'S' or Banana, it needs to fit a different model
3. Scale-location plot: - if line is straight, it's better, if triangle shape occur - not good.
4. Cook's Distance: shows most influencing 3 points, and also depicts how far they are influencing in graphical way
5. Residual Vs Leverage plot: show which points are most influencing in reference to Cook's Contour
coef(fit.lm)
summary.lm (fit.lm)
To predict the 'Y' with given 'X'
predict (fit.lm, list(tannin=5.5)) # single prediction at a time
predict (fit.lm, list(tannin=c(3,5.5,10,12.1))) # multiple prediction at a time
Multiple Linear Regressions
- A multiple regression is a statistical model with two or more continuous explanatory variables.
- It requires that the model should be as simple as possible.
- We achieve this by fitting a maximal model and then simplifying it by following one or more of these steps
-- Remove non-significant interaction terms.
-- Remove non-significant quadratic or other non-linear terms.
-- Remove non-significant explanatory variables.
-- Group together factor levels that do not differ from one another.
-- Amalgamate explanatory variables that have similar parameter values.
-- Set non-significant slopes to zero within ANCOVA.
- A multiple regression is a statistical model with two or more continuous explanatory variables.
- It requires that the model should be as simple as possible.
- We achieve this by fitting a maximal model and then simplifying it by following one or more of these steps
-- Remove non-significant interaction terms.
-- Remove non-significant quadratic or other non-linear terms.
-- Remove non-significant explanatory variables.
-- Group together factor levels that do not differ from one another.
-- Amalgamate explanatory variables that have similar parameter values.
-- Set non-significant slopes to zero within ANCOVA.
data.df<-read.table('E:/R Exercise/r class/age.txt', header=T)
head(data.df) # to view top few lines of dataframe
attach(data.df)
fit.lm <- lm(weight ~ age + height)
summary(fit.lm) # show results
Other useful functions in regression model
coefficients(fit.lm) # check model coefficients, same thing as 'coef(fit.lm)'
confint(fit.lm, level=0.95) # Confidence Intervals for model parameters
fitted(fit.lm) # predicted or estimated values 'Y' (here weight)
residuals(fit.lm) # residuals (difference between estimated and observed 'Y' (here weight))
anova(fit.lm) # anova test result of regression model
Note: Analysis of variance in regression: SSY =SSR+SSE
The idea is simple: we take the total variation in y, SSY, and partition it into components that tell us about the explanatory power of our model. The variation that is explained by the model is called the regression sum of squares (denoted by SSR), and the unexplained variation is called the error sum of squares (denoted by SSE). Then SSY =SSR+SSE.
vcov(fit.lm) # covariance matrix for model parameters
influence(fit.lm) # regression diagnostics
?influence
predict(fit.lm, data.frame(age=21,height=58)) # predict 'Y' (here weight) when X1 (age=21), X2 (height=58)
summary(fit.lm)[[8]]
Note: when talking about the degree of scatter we actually quantify the lack of scatter, so the graph on the left, with a perfect fit (zero scatter) gets a value of 1, and the graph on the right, which shows no relationship at all between y and x (100% scatter), gets a value of 0. This quantity used to measure the lack of scatter is officially called the coefficient of determination, but everybody refers to it as ‘r squared’. This is an important definition that you should try to memorize: r2 is the fraction of the total variation in y that is explained by variation in x.
Write the Maximum Likelihood Equation for above analysis
Weight = 16.5648316 + 0.5249316*age + 0.5120971*height
Identifying Relation
types
We can check the type of relation in regression such as unimodal
response, linear response, or no response at all.
The following steps illustrates which kind of relation is
significant
Let's make all three kinds of models: Null model, Linear
model and Unimodal
Let's start for 'jan' predictor
This method is also called "Forward
selection method"
ranna.df <-
read.table ('E:R Exercise/R class/ranuncrich_csv.csv', header = T, sep =',')
attach(ranna.df)
ranna.df[1:5,]
plot(jan,ranspno)
abline(lm(ranspno~1),col=3)
abline(lm(ranspno~jan),col=2,lty=2,lwd=2)
fit.ord<-lm(ranspno~jan+I(jan^2))
ord<-order(jan)
lines(jan[ord],fitted(fit.ord)[ord],lwd=3)
fit0.lm <- lm
(ranspno~1) # Null model, in
absence of any kind of explanatory variables
summary(fit0.lm)
fit1.lm <- lm (ranspno~jan) # linear model or I order polynomial model (for jan)
summary(fit1.lm)
fit2.lm <- lm
(ranspno~poly(jan,2)) #
unimodal, II order polynomial model (for jan)
summary(fit2.lm)
anova (fit0.lm,
fit1.lm, test ='F') #
find which model is better
anova (fit1.lm,
fit2.lm, test ='F') # find which
model is better
# as the linear model is better than null and then II order
polynomial, accepts the linear model
anova (fit0.lm,
fit1.lm, test ='F') # test
statistics
# Result, there is significant influence of 'jan' temp on
'ranspno'
### Try for following variables
# for 'July' temp
# for 'ppt' rainfall
# for 'range' elevatio range
# for 'resum' productivity
# for 'lime' limestone or pH
fitl0.lm <- lm
(ranspno~1) # null model
fitl1.lm <- lm
(ranspno~lime) # linear model or I order
# categorical data or P/A data will not have II order
polynomial
anova (fitl0.lm,
fitl1.lm, test = 'F')
anova (fitl1.lm)
Let's check out the "Backward selection method"
When there are many predictor variable, it cannot be determined which one has greater influence on response, in
such cases, this method is used where all the variables are give in linear
model and least influential are removed in latter steps
ranna.df <-
read.table ('E:R Exercise/R class/ranuncrich_csv.csv', header = T, sep =',')
attach(ranna.df)
ranna.df[1:5,]
fitt.lm <- lm
(ranspno~(jan+july+ppt+range+resum)*lime)
step (fitt.lm)
Here, the lesser the value of AIC, is better, the variables
that gives higher AIC should be removed
in every step, it removed one parameter and check AIC value, it is a
kind of automated. And at the end, if the AIC value doesn't drop, it will stop
there. And the model at the end is our starting model on which we can check
significance of all the parameter and shall remove some insignificant parameter
to prepare the final best model, here the later steps.
final.lm<-lm(ranspno
~ jan + july + ppt + range + resum + lime +ppt:lime + resum:lime)
summary(final.lm)
final2.lm<-lm(ranspno
~ jan + july + ppt + range + lime +ppt:lime + resum:lime)
summary(final2.lm)
final3.lm<-lm(ranspno
~ jan + july + ppt +ppt:lime)
summary(final3.lm)
No our
final best model for 'ranspo' response variable is the 'final3.lm' model.
II order model, Unimodal
Here we will
try to see if the variable has unimodal relation
tron.df <-
read.table ('E:/R Exercise/R class/tron.csv', header = T, sep =',')
head(tron.df)
attach (tron.df)
names (tron.df)
tfit.lm<- lm
(richness~altitude)
plot (altitude,
richness)
anova (tfit.lm) # checking model if OK
abline (tfit.lm) # fitting linear regression with no order in plot, not
representing the plot
# the
scatter plot has unimodal distribution, so
plot (altitude,
richness)
fit2.lm<-lm(richness~altitude+I(altitude^2))# create II order polynomial
regression (lm)
ord1 <-order(altitude)
#
Create First Order (II order) object
lines
(altitude[ord1], fitted (fit2.lm)[ord1],lwd=3) # to draw II order
polynomial line in scatter plot
anova (fit2.lm)
anova (tfit.lm,
fit2.lm) #
compare between models
# Result
shows that model is better with significant values for II Order and Altitude
predictor
Generalized Linear Model
When the
variance is not constant, and/or when the errors are not normally distributed.
Certain kinds of response variables invariably suffer from these two important
contraventions or the standard assumption, and GLMs are excellent at dealing
with them. Specifically, we might consider using GLMs when the response
variables is:
- Count
data expressed as proportions (e.g. logistic regression)
- Count
data that are not proportions (e.g. log-linear models of counts)
- Binary
response variables (e.g. dead or alive)
- Date
on time to death where the variance increases faster than linearly with the
mean (e.g. time data with gamma errors)
--Many statistical problems involve binary
response variables. Such as:
dead or alive; present or absent; male or female; occupied or empty; literate
or illitere
--Binary analysis will be a useful option
when at least one of the explanatory variables is continuous
--In order to carry out modelling on a
binary response variable we take the following steps:
--Create a single vector containing 0 and
1 as the reponse variable
--Use glm with family = binomial
--Consider changing link functin from
default logit to complementary log-log
--Fit the model in usual way
--Test significance by deletion of terms
from the maximal model and compare the change in deviance with chi-squared
In coming
example, the response variable is incidence; 1=occupied, 0=not occupied.
Explanatory
variables are area and isolation distance
island.df<-read.table('E:/R
Exercise/R class/island.txt',header=T)
attach(island.df);
island.df[1:5,]
island.df<-data.frame(incidence,
area, isolation)
head(island.df) # extracting only required variables
As there
are two explanatory variables, the appropriate analysis is multiple
regressions. The response is binary, so we shall do logistic regression with
binomial errors function. We can begin by fitting a complex model with
interaction term.
model1<-glm(incidence~area*isolation,
binomial)
Now check
the simpler model as well, fit a simpler model without interaction of
explanatory variables
model2<-glm(incidence~area+isolation,
binomial)
Now compare
two models using ANOVA (chi squared)
summary(model1)
summary(model2)
anova(model1,
model2, test='Chi')
Simpler
model is not significantly worse, so accept model2 and inspect the parameters
estimates
summary(model2)
Lets plot
the fitted model through the scatter plot separately for each predictors
model.area<-glm(incidence~area,
binomial)
model.isol<-glm(incidence~isolation,
binomial)
par(mfrow=c(1,2))
plot(area, incidence);
plot(isolation, incidence)
Here in
plot, the 0 and 1 are at top and bottom of the plot. We can fit a line between
them for better presentation. Let's prepare for it.
xv<-seq(0,9,by=0.01) #
creating vector to fit line from 0 to 9 with interval 0.01
The range
is from 0 to 9, because, the value of area ranges within it
xv #
check the vector
yv<-predict(model.area,
list(area=xv),type='response')
Assigning to a vector for predicted
value of 'response' that when area is 'xv' vector
Predict (model.area,
list(area=c(0.00, 0.01,0.02)),type='response')
head(yv)
plot (area, incidence)
lines(xv,yv)
xv2<-seq(0,10,0.1)
yv2<-predict(model.isol,
list(isolation=xv2),type='response')
As above,
range of isolation goes from 0 to 10
plot(isolation, incidence)
lines(xv2,yv2)
Generalized Additive Model (GAM) with Binary Data
In many
cases, however, you have one or more continuous explanatory variables, but you
have no a priori reason to choose one particular parametric form over
another for describing the shape of the relationship between the response
variable and the explanatory variable(s). Generalized additive models (GAMs)
are useful in such cases because they allow you to capture the shape of a
relationship between y and x without prejudging the issue by choosing a
particular parametric form.
island.df<-read.table('F:/R
class/island.txt',header=T)
attach(island.df);
island.df[1:5,]
library(mgcv) #
library for GAM
model3<-gam(incidence~s(area)+s(isolation),
binomial) # the 's'
is a non-parametric smoothing functions of GAM, the variable is parametric, no
need to give 's'
summary(model3)
# this indicates a highly significant effect of isolation but no
effect of area
# remember previous GLM test
par(mfrow=c(1,2))
plot.gam(model3,
residuals=T, pch=16)
# the line is not hard as linear model, it is called 'spline'
# Splines are like a robber
sheet stretched from all side, and move more towards more effective force
# this plot suggest that Area
has good effect
# lets assess the significance
of Area by deletion and compare the models
model4<-gam(incidence~s(isolation),
binomial)
anova(model3,model4,
test='Chi')
# there are significantly different i.e. not similar, simpler
model is model 3
# now an alternative is to fit
area as a parametric term and isolation as a smoothed term
model5<-gam(incidence~area+s(isolation),
binomial)
summary(model5)
# find that Area and Isolation both have significant effect to
incidence)
# Hence final model is 'model5'
Mixed Effect Models
Linear Random Effect Model
It was
Eisenhart (1947) who realized that there were actually two fundamentally
different sorts of categorical explanatory variables: he called these fixed
effects and random effects. It will take a good
deal of practice before you are confident in deciding whether a particular
categorical explanatory variable should be treated
as a
fixed effect or a random effect, but in essence:
-Fixed
effects influence only the mean of y;
-Random
effects influence only the variance of y.
Fixed
effects are unknown constants to be estimated from the data. Random effects
govern the variance–covariance structure of the response variable (see p. 473).
Nesting (or hierarchical structure) of random effects is a classic source of
pseudoreplication, so it important that you are able to recognize it and hence
not fall into its trap. Random effects that come from the same group will be
correlated, and this contravenes one of the fundamental assumptions of standard
statistical models: independence of errors.
Random effects occur in two contrasting kinds of circumstances:
- observational
studies with hierarchical structure;
- designed
experiments with different spatial or temporal scales.
Fixed
effects have informative factor levels, while random effects often have
uninformative factor levels. The distinction is best seen by an example. In
most mammal species the categorical variable sex has two levels: male and
female. For any individual that you find, the knowledge that it is, say, female
conveys a great deal of information about the individual, and this information
draws on experience gleaned from many other individuals that were female. A
female will have a whole set of attributes (associated with her being female)
no matter what population that individual was drawn from. Take a different
categorical variable like genotype. If we have two genotypes in a population we
might label them A and B. If we take two more genotypes from a different
population we might label them A and B as well. In a case like this,
the label A does not convey any information at all about the genotype, other
than that it is probably different from genotype B. In the case of sex, the
factor level (male or female) is informative: sex is a fixed effect. In the
case of genotype, the factor level (A or B) is uninformative: genotype is a
random effect.
Random
effects have factor levels that are drawn from a large (potentially very large)
population in which the individuals differ in many ways, but we do not know
exactly how or why they differ. To get a feel for the difference between fixed
effects and random effects here are some more examples:
The
important point is that because the random effects come from a large
population, there is not much point in concentrating on estimating means of our
small subset of factor levels, and no point at all in comparing individual
pairs of means for different factor levels. Much better to recognize them for
what they are, random samples from a much larger population, and to concentrate
on their variance. This is the added variation
caused by differences between the levels of the random effects. Variance
components analysis is all about estimating the size of this variance, and
working out its percentage contribution to the overall variation. There are
five fundamental assumptions of linear mixed-effects models:
- Within-group
errors are independent with mean zero and variance
- Within-group
errors are independent of the random effects.
- The
random effects are normally distributed with mean zero and covariance matrix _.
- The
random effects are independent in different groups.
- The
covariance matrix does not depend on the group.
The validity
of these assumptions needs to be tested by employing a series of plotting
methods involving the residuals, the fitted values and the predicted random
effects. The tricks with mixed-effects models are
- learning
which variables are random effects;
- specifying
the fixed and random effects in two model formulae;
- getting
the nesting structure of the random effects right;
- remembering
to get library(lme4) or library(nlme) at the
outset.
The
issues fall into two broad categories: questions about experimental design and
the management of experimental error (e.g. where does most of the variation
occur, and where would increased replication be most profitable?); and
questions about hierarchical structure, and the relative magnitude of variation
at different levels within the hierarchy (e.g. studies on the genetics of
individuals within families, families within parishes, and parishes with
counties, to discover the relative importance of genetic and phenotypic
variation).
Most
ANOVA models are based on the assumption that there is a single error term. But
in hierarchical studies and nested experiments, where the data are gathered at
two or more different spatial scales, there is a
different error variance for each different spatial scale. There
are two reasonably clear-cut sets of circumstances where your first choice
would be to use a linear mixed-effects model: you want to do variance
components analysis because all your explanatory variables are categorical
random effects and you don’t have any fixed effects; or you do have fixed
effects, but you also have pseudoreplication of one sort or another (e.g.
temporal pseudoreplication resulting from repeated measurements on the same
individuals; see p. 645). To test whether one should use a model with mixed effects
or just a plain old linear model, Douglas Bates wrote in the R help archive: ‘I
would recommend the likelihood ratio test against a linear model fit by lm. The p-value
returned from this test will be conservative because you are testing on the
boundary of the parameter space.’
vac.df <-
read.table ('F:/R class/vac.csv', header = T, sep = ',')
vac.df[18:28,]
attach (vac.df)
names(vac.df)
## The vaccination and tank are predictor variable and the
growth is response variable
## Vaccination is fixed effect
variable as it is controlled by the experimenter
## while tank is random effect
variable (as it is not primary interest variable but still has influence)
sapply (vac.df,
data.class)
# To check variable type (whether categorical or numerical)
levels(vacc)
# Here Tank is numerical, it
should be factorial as it is just tank labels
vac.df $tank<-
as.factor (vac.df$tank)
# the tank is numerical, should be categorical, so change with
above command
sapply (vac.df,
data.class)
attach (vac.df) # after conversion of variable type, attach is must
head(vac.df)
Syntax for
random and fixed effect
library (nlme) # load the library (command center
within R) before giving command (must)
fit.lme <- lme (growth~vacc, random =
~+1|tank) # lme = linear mixed
effect (regression model)
fit.lme
anova (fit.lme)
summary(fit.lme)
# Vaccination has good effect on growth irrespective of all the
tanks
One more example
yields.df<-read.table("E:/R
Exercise/R class/splityield.txt",header=T)
attach(yields.df)
names(yields.df)
head(yields.df)
sapply(yields.df,data.class)
levels(block)
levels(irrigation)
levels(density)
levels(fertilizer)
library(nlme)
model.lme<-lme(yield~irrigation*density*fertilizer,random=~1|block/irrigation/density)
summary(model.lme)
# This
output suggests that the only significant effects are the main effect of
irrigation
# p =
0.0318 and the irrigation by density interaction p = 0.0057. The three way interaction
is not significant so we remove it, fitting all terms up to two-way
interactions:
model.lme2<-lme(yield~(irrigation+density+fertilizer)^2,
random=~1|block/irrigation/density)
summary(model.lme2)
# The
fertilizer by density interaction is not significant, so we remove it:
model.lme3<- lme(yield~irrigation*density+irrigation*fertilizer,
random=~1|block/irrigation/density)
summary(model.lme3)
# The moral
is that you must do the model simplification to get the appropriate p values. Remember,
too, that if you want to use anova to compare mixed models with different fixed-effects
structures, then you must use maximum likelihood (method = "ML")
rather than the default restricted maximum likelihood (REML). Here is the
analysis again, but this time using anova to compare models with progressively
simplified fixed effects
model.lme<-lme(yield~irrigation*density*fertilizer,random=~
1| block/irrigation/density,method="ML")
model.lme.2<-update(model.lme,~. -
irrigation:density:fertilizer) #
it is line (yield ~irrigation+density+fertilizer+irrigation:density+irrigation:fertilizer+density:fertilizer)
anova(model.lme,
model.lme.2)
summary(model.lme.2)
model.lme.3<-update(model.lme.2,~.
- density:fertilizer)
anova(model.lme.3,model.lme.2)
model.lme.4<-update(model.lme.3,~.
- irrigation:fertilizer)
anova(model.lme.3,model.lme.4)
model.lme.5<-update(model.lme.2,~.
- irrigation:density)
anova(model.lme.5,model.lme.2)
# here best
model is 'model.lme.3'
summary(model.lme.3)
#
Model-checking plots show that the residuals are well behaved:
plot(model.lme.3)
# The
response variable is a reasonably linear function of the fitted values:
plot(model.lme.3,yield~fitted(.))
# and the
errors are reasonably close to normally distributed in all four blocks:
qqnorm(model.lme.3,~
resid(.)| block)
Comparing LM and GLM
hum.df<-read.table('F:/R
class/humlesnurr.csv', header = T, sep=',')
attach(hum.df)
head(hum.df)
plot (food,eggs) # scatter plot
fit1.lm<-lm(eggs~food) # linear model
abline(fit1.lm) # fitting linear model in scatter
plot
summary (fit1.lm)
fit0.lm<-
lm(eggs~1) #
Null model
anova(fit0.lm,fit1.lm,
test ='F') # comparing linear and null model
par (mfrow=c(2,2)) # preparing window to plot
plot
(fit1.lm,panel=panel.smooth) # plot the lm
fit1.glm<-glm(eggs~food,poisson(link='log')) # glm model
summary(fit1.glm)
# or
fit1.glm<-glm(eggs~food,poisson) # glm model
plot (food,eggs) #
scatter plot
ord1<-order(food) # preparing for polynomial model
points(food[ord1],fitted(fit1.glm)[ord1],type='l')
par(mfrow=c(1,4))
plot (fit1.glm,
panel=panel.smooth)
fit.llm<-
lm(log(eggs)~food)
plot(food,eggs)
abline(fit1.lm,
lty=3,lwd=3)
abline(fit1.lm) # linear
model (to compare)
points(food[ord1],fitted(fit1.glm)[ord1],type='l',
col='red')
points(food[ord1],exp(fitted(fit.llm))[ord1],type='l',
col='blue')
pairs(island.df)
Hierarchical Partitioning
Collinearity
problems can be effectively alleviated using an analytical method named
hierarchical partitioning (HP hereafter; [3]). HP reduces collinearity problems
by determining the independent contribution of each explanatory variable to the
response variable and separates it from the joint contribution, resulting from
correlation with other variables
For models
with 9 or fewer variables, the nominal ranking of importance of the explanatory
variables did not change with different entering order in the models
HP should
be applied with caution when more than 9 explanatory variables are used to know
ranking of covariate importance. The explained variance is not a useful
parameter to use in models with more than 9 independent variables
Eg. Data Linear
regression of log(electrical conductivity) in streams against seven independent
variables describing catchment characteristics (from Hatt et al. 2004)
data(urbanwq)
env <-
urbanwq[,2:8]
library(hier.part)
png(filename="hier.part_%03d_large.png",
width=1000, height=800) #
Starting graphics i.e. plot command, default, the resulting plot will be in My
Document folder
hier.part(urbanwq$lec,
env, fam = "gaussian", gof = "Rsqu")
dev.off()
No comments:
Post a Comment