R getting easy: Regression

Regression

Used when both response and explanatory variables are continuous (eg. Height, weight, volumes, temp, DBH, etc)

We will cover

Linear regression

the linear model is : y = a + b x

Multiple regression : Yi = β0 + β1 X1i + β2X2i + єi

Linear regression

It is written as 'lm' (linear model) in R

The response variable comes first, then tilda (' ~ ') then the explanatory variable

bio.df<- read.table ('E:/R Exercise/r class/age.txt', header=T)

attach(bio.df)

bio.df [1:5,]

reg.lm <- lm (weight ~ age) # Response~Predictor

here the linear model is saved in 'demo.lm', so that summary about the result can be check and the model can be used for other purposes. We do not same it in R object, it will be difficult to do above things.

reg.lm # check the lm result

plot(age,weight)

abline (reg.lm) # fitting linear model in plot

summary(reg.lm) # checking the summary of the linear model

summary(lm(weight~age))[[8]] # give R sq value

The function lm(y~x) = fit a linear model of 'y' as a function of 'x'

summary(reg.lm)[[8]] # to check the R-sq value

## see the equation " y = a + bx "

reg.lm # to check only the coefficients

Now the maximum likelihood equation is like this (of: y = a + b x)

Weight = 44.6831 + 0.7033*age

One more example

tannin.df<-read.table('E:/R Exercise/r class/tannin.txt', header=T)

attach(tannin.df)

tannin.df

fit.lm <-lm(growth ~ tannin)

fit.lm

summary(fit.lm)

summary(fit.lm)[[8]] # checking R Sq value

plot(tannin, growth)

abline(fit.lm)

x11() # create blank plot window

par(mfrow=c(2,3)) # creating partition for 6 plot

plot(fit.lm) # plotting the regression property plots, this will plot only four plots

plot(fit.lm, which=1:6) # this will plot 6 property plots

Plot interpretation, from top left, top right, bottom left to bottom right. (First for are most important)

1. Residual Vs Fitted:- it is bad if scatteredness increases as the fitted values get bigger (from left to right). The plot should have view like star in sky at night i.e. no pattern at all

2. Normal QQ plot: - it should be straight line. If the pattern shapes 'S' or Banana, it needs to fit a different model

3. Scale-location plot: - if line is straight, it's better, if triangle shape occur - not good.

4. Cook's Distance: shows most influencing 3 points, and also depicts how far they are influencing in graphical way

5. Residual Vs Leverage plot: show which points are most influencing in reference to Cook's Contour

coef(fit.lm)

summary.lm (fit.lm)

To predict the 'Y' with given 'X'

predict (fit.lm, list(tannin=5.5)) # single prediction at a time

predict (fit.lm, list(tannin=c(3,5.5,10,12.1))) # multiple prediction at a time

Multiple Linear Regressions
- A multiple regression is a statistical model with two or more continuous explanatory variables.
- It requires that the model should be as simple as possible.
- We achieve this by fitting a maximal model and then simplifying it by following one or more of these steps
-- Remove non-significant interaction terms.
-- Remove non-significant quadratic or other non-linear terms.
-- Remove non-significant explanatory variables.
-- Group together factor levels that do not differ from one another.
-- Amalgamate explanatory variables that have similar parameter values.
-- Set non-significant slopes to zero within ANCOVA.

data.df<-read.table('E:/R Exercise/r class/age.txt', header=T)

head(data.df) # to view top few lines of dataframe

attach(data.df)

fit.lm <- lm(weight ~ age + height)

summary(fit.lm) # show results

Other useful functions in regression model

coefficients(fit.lm) # check model coefficients, same thing as 'coef(fit.lm)'

confint(fit.lm, level=0.95) # Confidence Intervals for model parameters

fitted(fit.lm) # predicted or estimated values 'Y' (here weight)

residuals(fit.lm) # residuals (difference between estimated and observed 'Y' (here weight))

anova(fit.lm) # anova test result of regression model

Note: Analysis of variance in regression: SSY =SSR+SSE

The idea is simple: we take the total variation in y, SSY, and partition it into components that tell us about the explanatory power of our model. The variation that is explained by the model is called the regression sum of squares (denoted by SSR), and the unexplained variation is called the error sum of squares (denoted by SSE). Then SSY =SSR+SSE.

vcov(fit.lm) # covariance matrix for model parameters

influence(fit.lm) # regression diagnostics

?influence

predict(fit.lm, data.frame(age=21,height=58)) # predict 'Y' (here weight) when X1 (age=21), X2 (height=58)

summary(fit.lm)[[8]]

Note: when talking about the degree of scatter we actually quantify the lack of scatter, so the graph on the left, with a perfect fit (zero scatter) gets a value of 1, and the graph on the right, which shows no relationship at all between y and x (100% scatter), gets a value of 0. This quantity used to measure the lack of scatter is officially called the coefficient of determination, but everybody refers to it as ‘r squared’. This is an important definition that you should try to memorize: r2 is the fraction of the total variation in y that is explained by variation in x.

Write the Maximum Likelihood Equation for above analysis

Weight = 16.5648316 + 0.5249316*age + 0.5120971*height

Identifying Relation types

We can check the type of relation in regression such as unimodal response, linear response, or no response at all.

The following steps illustrates which kind of relation is significant

Let's make all three kinds of models: Null model, Linear model and Unimodal

Let's start for 'jan' predictor

This method is also called "Forward selection method"

ranna.df <- read.table ('E:R Exercise/R class/ranuncrich_csv.csv', header = T, sep =',')

attach(ranna.df)

ranna.df[1:5,]

plot(jan,ranspno)

abline(lm(ranspno~1),col=3)

abline(lm(ranspno~jan),col=2,lty=2,lwd=2)

fit.ord<-lm(ranspno~jan+I(jan^2))

ord<-order(jan)

lines(jan[ord],fitted(fit.ord)[ord],lwd=3)

fit0.lm <- lm (ranspno~1) # Null model, in absence of any kind of explanatory variables

summary(fit0.lm)

fit1.lm <- lm (ranspno~jan) # linear model or I order polynomial model (for jan)

summary(fit1.lm)

fit2.lm <- lm (ranspno~poly(jan,2)) # unimodal, II order polynomial model (for jan)

summary(fit2.lm)

anova (fit0.lm, fit1.lm, test ='F') # find which model is better

anova (fit1.lm, fit2.lm, test ='F') # find which model is better

# as the linear model is better than null and then II order polynomial, accepts the linear model

anova (fit0.lm, fit1.lm, test ='F') # test statistics

# Result, there is significant influence of 'jan' temp on 'ranspno'

### Try for following variables

# for 'July' temp

# for 'ppt' rainfall

# for 'range' elevatio range

# for 'resum' productivity

# for 'lime' limestone or pH

fitl0.lm <- lm (ranspno~1) # null model

fitl1.lm <- lm (ranspno~lime) # linear model or I order

# categorical data or P/A data will not have II order polynomial

anova (fitl0.lm, fitl1.lm, test = 'F')

anova (fitl1.lm)

Let's check out the "Backward selection method"

When there are many predictor variable, it cannot be determined which one has greater influence on response, in such cases, this method is used where all the variables are give in linear model and least influential are removed in latter steps

ranna.df <- read.table ('E:R Exercise/R class/ranuncrich_csv.csv', header = T, sep =',')

attach(ranna.df)

ranna.df[1:5,]

fitt.lm <- lm (ranspno~(jan+july+ppt+range+resum)*lime)

step (fitt.lm)

Here, the lesser the value of AIC, is better, the variables that gives higher AIC should be removed in every step, it removed one parameter and check AIC value, it is a kind of automated. And at the end, if the AIC value doesn't drop, it will stop there. And the model at the end is our starting model on which we can check significance of all the parameter and shall remove some insignificant parameter to prepare the final best model, here the later steps.

final.lm<-lm(ranspno ~ jan + july + ppt + range + resum + lime +ppt:lime + resum:lime)

summary(final.lm)

final2.lm<-lm(ranspno ~ jan + july + ppt + range + lime +ppt:lime + resum:lime)

summary(final2.lm)

final3.lm<-lm(ranspno ~ jan + july + ppt +ppt:lime)

summary(final3.lm)

No our final best model for 'ranspo' response variable is the 'final3.lm' model.

II order model, Unimodal

Here we will try to see if the variable has unimodal relation

tron.df <- read.table ('E:/R Exercise/R class/tron.csv', header = T, sep =',')

head(tron.df)

attach (tron.df)

names (tron.df)

tfit.lm<- lm (richness~altitude)

plot (altitude, richness)

anova (tfit.lm) # checking model if OK

abline (tfit.lm) # fitting linear regression with no order in plot, not representing the plot

# the scatter plot has unimodal distribution, so

plot (altitude, richness)

fit2.lm<-lm(richness~altitude+I(altitude^2))# create II order polynomial regression (lm)

ord1 <-order(altitude) # Create First Order (II order) object

lines (altitude[ord1], fitted (fit2.lm)[ord1],lwd=3) # to draw II order polynomial line in scatter plot

anova (fit2.lm)

anova (tfit.lm, fit2.lm) # compare between models

# Result shows that model is better with significant values for II Order and Altitude predictor

Generalized Linear Model

When the variance is not constant, and/or when the errors are not normally distributed. Certain kinds of response variables invariably suffer from these two important contraventions or the standard assumption, and GLMs are excellent at dealing with them. Specifically, we might consider using GLMs when the response variables is:

- Count data expressed as proportions (e.g. logistic regression)

- Count data that are not proportions (e.g. log-linear models of counts)

- Binary response variables (e.g. dead or alive)

- Date on time to death where the variance increases faster than linearly with the mean (e.g. time data with gamma errors)

--Many statistical problems involve binary response variables. Such as:

dead or alive; present or absent; male or female; occupied or empty; literate or illitere

--Binary analysis will be a useful option when at least one of the explanatory variables is continuous

--In order to carry out modelling on a binary response variable we take the following steps:

--Create a single vector containing 0 and 1 as the reponse variable

--Use glm with family = binomial

--Consider changing link functin from default logit to complementary log-log

--Fit the model in usual way

--Test significance by deletion of terms from the maximal model and compare the change in deviance with chi-squared

In coming example, the response variable is incidence; 1=occupied, 0=not occupied.

Explanatory variables are area and isolation distance

island.df<-read.table('E:/R Exercise/R class/island.txt',header=T)

attach(island.df); island.df[1:5,]

island.df<-data.frame(incidence, area, isolation)

head(island.df) # extracting only required variables

As there are two explanatory variables, the appropriate analysis is multiple regressions. The response is binary, so we shall do logistic regression with binomial errors function. We can begin by fitting a complex model with interaction term.

model1<-glm(incidence~area*isolation, binomial)

Now check the simpler model as well, fit a simpler model without interaction of explanatory variables

model2<-glm(incidence~area+isolation, binomial)

Now compare two models using ANOVA (chi squared)

summary(model1)

summary(model2)

anova(model1, model2, test='Chi')

Simpler model is not significantly worse, so accept model2 and inspect the parameters estimates

summary(model2)

Lets plot the fitted model through the scatter plot separately for each predictors

model.area<-glm(incidence~area, binomial)

model.isol<-glm(incidence~isolation, binomial)

par(mfrow=c(1,2))

plot(area, incidence); plot(isolation, incidence)

Here in plot, the 0 and 1 are at top and bottom of the plot. We can fit a line between them for better presentation. Let's prepare for it.

xv<-seq(0,9,by=0.01) # creating vector to fit line from 0 to 9 with interval 0.01

The range is from 0 to 9, because, the value of area ranges within it

xv # check the vector

yv<-predict(model.area, list(area=xv),type='response')

Assigning to a vector for predicted value of 'response' that when area is 'xv' vector

Predict (model.area, list(area=c(0.00, 0.01,0.02)),type='response')

head(yv)

plot (area, incidence)

lines(xv,yv)

xv2<-seq(0,10,0.1)

yv2<-predict(model.isol, list(isolation=xv2),type='response')

As above, range of isolation goes from 0 to 10

plot(isolation, incidence)

lines(xv2,yv2)

Generalized Additive Model (GAM) with Binary Data

In many cases, however, you have one or more continuous explanatory variables, but you have no a priori reason to choose one particular parametric form over another for describing the shape of the relationship between the response variable and the explanatory variable(s). Generalized additive models (GAMs) are useful in such cases because they allow you to capture the shape of a relationship between y and x without prejudging the issue by choosing a particular parametric form.

island.df<-read.table('F:/R class/island.txt',header=T)

attach(island.df); island.df[1:5,]

library(mgcv) # library for GAM

model3<-gam(incidence~s(area)+s(isolation), binomial) # the 's' is a non-parametric smoothing functions of GAM, the variable is parametric, no need to give 's'

summary(model3)

# this indicates a highly significant effect of isolation but no effect of area

# remember previous GLM test

par(mfrow=c(1,2))

plot.gam(model3, residuals=T, pch=16)

# the line is not hard as linear model, it is called 'spline'

# Splines are like a robber sheet stretched from all side, and move more towards more effective force

# this plot suggest that Area has good effect

# lets assess the significance of Area by deletion and compare the models

model4<-gam(incidence~s(isolation), binomial)

anova(model3,model4, test='Chi')

# there are significantly different i.e. not similar, simpler model is model 3

# now an alternative is to fit area as a parametric term and isolation as a smoothed term

model5<-gam(incidence~area+s(isolation), binomial)

summary(model5)

# find that Area and Isolation both have significant effect to incidence)

# Hence final model is 'model5'

Mixed Effect Models

Linear Random Effect Model

It was Eisenhart (1947) who realized that there were actually two fundamentally different sorts of categorical explanatory variables: he called these fixed effects and random effects. It will take a good deal of practice before you are confident in deciding whether a particular categorical explanatory variable should be treated

as a fixed effect or a random effect, but in essence:

-Fixed effects influence only the mean of y;

-Random effects influence only the variance of y.

Fixed effects are unknown constants to be estimated from the data. Random effects govern the variance–covariance structure of the response variable (see p. 473). Nesting (or hierarchical structure) of random effects is a classic source of pseudoreplication, so it important that you are able to recognize it and hence not fall into its trap. Random effects that come from the same group will be correlated, and this contravenes one of the fundamental assumptions of standard statistical models: independence of errors. Random effects occur in two contrasting kinds of circumstances:

- observational studies with hierarchical structure;

- designed experiments with different spatial or temporal scales.

Fixed effects have informative factor levels, while random effects often have uninformative factor levels. The distinction is best seen by an example. In most mammal species the categorical variable sex has two levels: male and female. For any individual that you find, the knowledge that it is, say, female conveys a great deal of information about the individual, and this information draws on experience gleaned from many other individuals that were female. A female will have a whole set of attributes (associated with her being female) no matter what population that individual was drawn from. Take a different categorical variable like genotype. If we have two genotypes in a population we might label them A and B. If we take two more genotypes from a different population we might label them A and B as well. In a case like this, the label A does not convey any information at all about the genotype, other than that it is probably different from genotype B. In the case of sex, the factor level (male or female) is informative: sex is a fixed effect. In the case of genotype, the factor level (A or B) is uninformative: genotype is a random effect.

Random effects have factor levels that are drawn from a large (potentially very large) population in which the individuals differ in many ways, but we do not know exactly how or why they differ. To get a feel for the difference between fixed effects and random effects here are some more examples:

The important point is that because the random effects come from a large population, there is not much point in concentrating on estimating means of our small subset of factor levels, and no point at all in comparing individual pairs of means for different factor levels. Much better to recognize them for what they are, random samples from a much larger population, and to concentrate on their variance. This is the added variation caused by differences between the levels of the random effects. Variance components analysis is all about estimating the size of this variance, and working out its percentage contribution to the overall variation. There are five fundamental assumptions of linear mixed-effects models:

- Within-group errors are independent with mean zero and variance

- Within-group errors are independent of the random effects.

- The random effects are normally distributed with mean zero and covariance matrix _.

- The random effects are independent in different groups.

- The covariance matrix does not depend on the group.

The validity of these assumptions needs to be tested by employing a series of plotting methods involving the residuals, the fitted values and the predicted random effects. The tricks with mixed-effects models are

- learning which variables are random effects;

- specifying the fixed and random effects in two model formulae;

- getting the nesting structure of the random effects right;

- remembering to get library(lme4) or library(nlme) at the outset.

The issues fall into two broad categories: questions about experimental design and the management of experimental error (e.g. where does most of the variation occur, and where would increased replication be most profitable?); and questions about hierarchical structure, and the relative magnitude of variation at different levels within the hierarchy (e.g. studies on the genetics of individuals within families, families within parishes, and parishes with counties, to discover the relative importance of genetic and phenotypic variation).

Most ANOVA models are based on the assumption that there is a single error term. But in hierarchical studies and nested experiments, where the data are gathered at two or more different spatial scales, there is a different error variance for each different spatial scale. There are two reasonably clear-cut sets of circumstances where your first choice would be to use a linear mixed-effects model: you want to do variance components analysis because all your explanatory variables are categorical random effects and you don’t have any fixed effects; or you do have fixed effects, but you also have pseudoreplication of one sort or another (e.g. temporal pseudoreplication resulting from repeated measurements on the same individuals; see p. 645). To test whether one should use a model with mixed effects or just a plain old linear model, Douglas Bates wrote in the R help archive: ‘I would recommend the likelihood ratio test against a linear model fit by lm. The p-value returned from this test will be conservative because you are testing on the boundary of the parameter space.’

vac.df <- read.table ('F:/R class/vac.csv', header = T, sep = ',')

vac.df[18:28,]

attach (vac.df)

names(vac.df)

## The vaccination and tank are predictor variable and the growth is response variable

## Vaccination is fixed effect variable as it is controlled by the experimenter

## while tank is random effect variable (as it is not primary interest variable but still has influence)

sapply (vac.df, data.class)

# To check variable type (whether categorical or numerical)

levels(vacc)

# Here Tank is numerical, it should be factorial as it is just tank labels

vac.df $tank<- as.factor (vac.df$tank)

# the tank is numerical, should be categorical, so change with above command

sapply (vac.df, data.class)

attach (vac.df) # after conversion of variable type, attach is must

head(vac.df)

Syntax for random and fixed effect

library (nlme) # load the library (command center within R) before giving command (must)

fit.lme <- lme (growth~vacc, random = ~+1|tank) # lme = linear mixed effect (regression model)

fit.lme

anova (fit.lme)

summary(fit.lme)

# Vaccination has good effect on growth irrespective of all the tanks

One more example

yields.df<-read.table("E:/R Exercise/R class/splityield.txt",header=T)

attach(yields.df)

names(yields.df)

head(yields.df)

sapply(yields.df,data.class)

levels(block)

levels(irrigation)

levels(density)

levels(fertilizer)

library(nlme)

model.lme<-lme(yield~irrigation*density*fertilizer,random=~1|block/irrigation/density)

summary(model.lme)

# This output suggests that the only significant effects are the main effect of irrigation

# p = 0.0318 and the irrigation by density interaction p = 0.0057. The three way interaction is not significant so we remove it, fitting all terms up to two-way interactions:

model.lme2<-lme(yield~(irrigation+density+fertilizer)^2, random=~1|block/irrigation/density)

summary(model.lme2)

# The fertilizer by density interaction is not significant, so we remove it:

model.lme3<- lme(yield~irrigation*density+irrigation*fertilizer, random=~1|block/irrigation/density)

summary(model.lme3)

# The moral is that you must do the model simplification to get the appropriate p values. Remember, too, that if you want to use anova to compare mixed models with different fixed-effects structures, then you must use maximum likelihood (method = "ML") rather than the default restricted maximum likelihood (REML). Here is the analysis again, but this time using anova to compare models with progressively simplified fixed effects

model.lme<-lme(yield~irrigation*density*fertilizer,random=~ 1| block/irrigation/density,method="ML")

model.lme.2<-update(model.lme,~. - irrigation:density:fertilizer) # it is line (yield ~irrigation+density+fertilizer+irrigation:density+irrigation:fertilizer+density:fertilizer)

anova(model.lme, model.lme.2)

summary(model.lme.2)

model.lme.3<-update(model.lme.2,~. - density:fertilizer)

anova(model.lme.3,model.lme.2)

model.lme.4<-update(model.lme.3,~. - irrigation:fertilizer)

anova(model.lme.3,model.lme.4)

model.lme.5<-update(model.lme.2,~. - irrigation:density)

anova(model.lme.5,model.lme.2)

# here best model is 'model.lme.3'

summary(model.lme.3)

# Model-checking plots show that the residuals are well behaved:

plot(model.lme.3)

# The response variable is a reasonably linear function of the fitted values:

plot(model.lme.3,yield~fitted(.))

# and the errors are reasonably close to normally distributed in all four blocks:

qqnorm(model.lme.3,~ resid(.)| block)

Comparing LM and GLM

hum.df<-read.table('F:/R class/humlesnurr.csv', header = T, sep=',')

attach(hum.df)

head(hum.df)

plot (food,eggs) # scatter plot

fit1.lm<-lm(eggs~food) # linear model

abline(fit1.lm) # fitting linear model in scatter plot

summary (fit1.lm)

fit0.lm<- lm(eggs~1) # Null model

anova(fit0.lm,fit1.lm, test ='F') # comparing linear and null model

par (mfrow=c(2,2)) # preparing window to plot

plot (fit1.lm,panel=panel.smooth) # plot the lm

fit1.glm<-glm(eggs~food,poisson(link='log')) # glm model

summary(fit1.glm)

# or

fit1.glm<-glm(eggs~food,poisson) # glm model

plot (food,eggs) # scatter plot

ord1<-order(food) # preparing for polynomial model

points(food[ord1],fitted(fit1.glm)[ord1],type='l')

par(mfrow=c(1,4))

plot (fit1.glm, panel=panel.smooth)

fit.llm<- lm(log(eggs)~food)

plot(food,eggs)

abline(fit1.lm, lty=3,lwd=3)

abline(fit1.lm) # linear model (to compare)

points(food[ord1],fitted(fit1.glm)[ord1],type='l', col='red')

points(food[ord1],exp(fitted(fit.llm))[ord1],type='l', col='blue')

pairs(island.df)

Hierarchical Partitioning

Collinearity problems can be effectively alleviated using an analytical method named hierarchical partitioning (HP hereafter; [3]). HP reduces collinearity problems by determining the independent contribution of each explanatory variable to the response variable and separates it from the joint contribution, resulting from correlation with other variables

For models with 9 or fewer variables, the nominal ranking of importance of the explanatory variables did not change with different entering order in the models

HP should be applied with caution when more than 9 explanatory variables are used to know ranking of covariate importance. The explained variance is not a useful parameter to use in models with more than 9 independent variables

For detail: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011698

Eg. Data Linear regression of log(electrical conductivity) in streams against seven independent variables describing catchment characteristics (from Hatt et al. 2004)

data(urbanwq)

env <- urbanwq[,2:8]

library(hier.part)

png(filename="hier.part_%03d_large.png", width=1000, height=800) # Starting graphics i.e. plot command, default, the resulting plot will be in My Document folder

hier.part(urbanwq$lec, env, fam = "gaussian", gof = "Rsqu")

dev.off()

R getting easy

Regression

No comments:

Post a Comment