Linear Regresssion with R
R and Tidyverse Beginners Guide is my previous article.
Essentially these are findings on a Statistics course I did which is geared towards biological scientists. The joy of finding things out and exploring something different..
These are my notes up to the point I got lost!
There may be wont be articles in the future here on (an area I don’t need to know about, but looks fascinating)
- Multiple Regression
- ANCOVA
- Generalised
- Mixed
Like any other discipline there are a lot of terms, which I’m going to try to define, and will go through slowly.
Univariate Linear Regression
Or linear regression, or simple linear regression, or model 1 regression
So a univariate regression but we are looking at a bivariate data (also called bivariate relationship)
Science is about making imprecise measurements and asking questions / making hypothesis. So we need a way to help us quantifyably make sense of these measurements.
“Regression is a statistical method .. that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). “
Simple linear regression uses one independent variable to explain (or predict) the outcome of the dependent variable.
By using a straight best-fit line
- Predictor / Independent variable - eg altitude (x axis), airspeed m/s
- Response / Dependent variable - eg temperature (y axis)… the variable we are trying to predict
It’s important that the predictor variables (measurements) are more accurate (contain less error) that the response. Because the most commonly used univariate regression aka model 1 regression assumes this.
eg What do we predict the temperature to be at sea level if we have some sample data betweem 600m and 1100m
Strategy
- Data exploring (plotting) and getting a feel for what the regressional values should be
- Model construction
- Model diagnostics
- Model evaluation
1. Data Exploring
Viewing the data manually to check quality control
Also histograms to see any any records are sharply outside the range of other entries.
library(tidyverse)
df_Park <- read_csv("stuff.csv")
df_Park
View(df_Park)
# historams checking for any outliers
# 11 measurements for each birdid
hist(df_Park$Bird_ID)
# how many times did 4-6 windspeed occur (6 times)
hist(df_Park$Airspeed)
# how may times did each band of amp occur eg 60-70 ( 3 times)
hist(df_Park$Wingbeat_Amp)
Lets do a scatter plot
of the data using ggplot2 from tidyverse.
We need to decide which is x and y by the questions we are trying to ask:
“how swallows respond to wind” not “how the wind is affected by a single swallow’s wing movements”
# scatter plot
# if it looks like a reasonably straight line, we can do a simple linear regression
df_Park %>%
filter(Bird_ID == 1) %>%
ggplot(aes(x = Airspeed, y = Wingbeat_Amp)) +
geom_point() +
theme_bw()
Residual
A univariate regresion seeks to establish a straight line relationship between predictor and reponse subject.
Minimize the unexplained variation residual variation
or residual distance
The best fit line is the one which minimises the sums of the squared deviations from the line (so that negative numbers are taken out of the running)
C / Intercept C value of 50 and slope m
Let’s do a sanity check regression straight line equation of the form y = mx + c
# y = mx + c
# we can roughly tell that the c / co-efficient / intercept should be 50
# ie when x=0, y=50
df_Park %>%
filter(Bird_ID == 1) %>%
ggplot(aes(x = Airspeed, y = Wingbeat_Amp)) +
geom_point() +
xlim(0, 14) +
ylim(40, 130) +
theme_bw()
From eyeballing we can say C is around 50.
How to get m, ie the slope?
what is the change in y over the whole range: 123 - 72 = 51 what is the change in x over the whole range: 14 - 4 = 10
gradient or slope coefficient should be 51/10 = 5.1
y = 5.1x + 51
so when x is 10, y is 102, which looks right
2. Model Construction
# lm is linear model
# response variable (the variable we are trying to predict eg windbeat amplitude) on lhs of ~
# predictors (something we can measure eg windspeed) on rhs of ~
lm_wba_1 <- lm(Wingbeat_Amp ~ Airspeed,
data = df_bird1)
lm_wba_1
naming of prefix lm, and suffix number as we’ll build many of these
3. Model diagnostics
Lets check to see if the model adheres to assumptions and to look for high influence observations.
- the predictor variables should be a measure with negigble error
- the variance in the response variable should be constant for all values of the predictor(s) - homogenity of variance.
- residuals must be independant (one observation cannot predict the otner)
- residuals should be normally distributed
# Gives coefficients of (Intercept) 47.988 ie the C value
# Airspeed of 5.051 ie the airspeed slope m value
lm_wba_1
View(lm_wba_1)
str(lm_wba_1)
par(mfrow = c(2, 2))
plot(lm_wba_1)
par(mfrow = c(1, 1))
1.residuals vs fits - The plot is used to detect non-linearity, unequal error variances, and outliers. funnel like? curvilinearity - are residuals generally positive, then negative, then positive? yes this is the case here. Maybe a non-linear function would be a better analysis.
2.quantile quantile qq normal plot are the residuals a normal distribution?
histogram will give something ie looking for outlier data too
hist(lm_wba_1$residuals)
3.scale location plot should show no pattern. Is assessing hetrogeneity of variance but uses a different transformation of residuals, which can show problems that 1. graph cannot.
4.cook’s distance plot
ADF Method - Augmented Dickey-Fuller test.. test the null hypothesis
4.Model Evaluation
summary(lm_wba_1)
results:
Call:
lm(formula = Wingbeat_Amp ~ Airspeed, data = df_bird1)
Residuals:
Min 1Q Median 3Q Max
-6.0427 -1.1243 0.2078 1.4225 4.5966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.988 2.919 16.44 5.08e-08 ***
Airspeed 5.030 0.306 16.44 5.08e-08 ***
The Coeffieients look good:
y = mx + c
y = 5.030x + 47.988
y-intercept (c) - (Intercept) is in brackets - wingbeat amplitude
slope (m) - is listed as Airspeed - slope is associated with a predictor so will take its name (univariate regression)
The coefficients each have a standard errori (SE)
- t-value - test statistic for the null hypothesis of a zero coefficient value
- p-value associated with this null hypothesis
Our p-values are tiny, so we should reject the hypothesis that “no association between airspeed and wingbeat amplitude”
What is the confidence interval for the slope?
# confidence interval (CI)
confint(lm_wba_1)
F-statistic - this is a test of the whole regression, and is associated with the null hypothesis that the model does not predict the response.
ANOVA - don’t use this any more.
Conclusion
This is as far as I got!
Learning the terms and scientific strategies to analyse data essentially with a straight line fit was fascinating.