This article covers:

  • What is R
  • Tidying Data
  • Charting

The next article covers Linear Regression

What is R

r-project.org is a programming language that implements statistics and graphical techniques

R is an implmentation of S combined with semantics inspired by Scheme.

  • 1976 - S was created
  • 1991 - Uni Auckland, began an alternative implementation of S
  • 1993 - R had first release - named for 2 authors Ross and Robert and a play on S

wikipedia.org on R_(programming_language)

Why use R / Who uses R

  • Biology Scientists - analyse experimental data
  • Data Scientists - wrangle data

I’ve noticed that people who know Python / C# (or a high level language) tend to use that for the wrangling

Pandas is a common Python library for wrangling

For storage if you’re comfortable with SQL often people store the data in Postgres / MySQL then chart with R. This means you can use SQL to get the data out of the db in the shape you want.

  • Data Manipulation / Data Wrangling / Data Munging - transforming raw data into another formating with the intent of making it more appropriate. (Dplry package in Tidyverse) - pronounced dee plier
  • Tidying data ie changing it. (tidyr packacke in Tidyverse)
  • Visualising / making charts to publish papers in journals (ggplot2 in Tidyverse)

Alternatives to R include: Python (or any high level language but python is super popular for data science) Excel, SPSS, MatLab

Use R because it is

  • Scriptable
  • Free
  • Powerful
  • Popular

What does it do

  • Data wrangling
  • Data visualisation

implements a wide variety of stats and graphics techniques

  • linear and non-linear modelling
  • stats tests
  • time series analysis
  • classification
  • clustering

What is Tidyverse

Tidyverse is an opinionated collection of R packages.

R and RStudio

Download the latest version of R from r-project.org - currently on 4.0.3 on 6th Nov 2020

Download RStudio - 1.3.1093 on 6th Nov 2020.

alt text

Tools, Global Options

I prefer to set my default directory to c:\r so when working on different machines there is no communcation except from raw R files projects which will be in Git. The default user directory for me was linked to my OneDrive.

Whilst here these are my preferences:

  • General - working folder as c:\r
  • Code, Soft wrap R files tick
  • Code, vim keybindings
  • Panel layout, Console in top right
  • Packages, change CRAN mirror to UK (London)
  • Appearance, Editor Theme, Pastel on Dark
  • Appearance, Editor font, Consolas

alt text

My preferred RStudio setup

R Packages

I change the .libPaths() folder to c:\r\library\

Update this setting in C:\Program Files\R\R-4.0.3\etc\Rprofile.site as

# my custom library path
.libPaths("C:/r/library")

By default is set to ~ and on my Windows machine this is a synced OneDrive folder. R creates a folder called R and installs libraries in there - 370MB of libraries and around 30,000 separate files.

# display the path
.libPaths()

install.packages("installr")
library(installer)

If you get an error:

“WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:”

alt text

Then I solved this by installing RTools:

RTools

Download RTools if you get errors saying you need to install it (shown above)

# .Renviron in Documents folders
PATH="${RTOOLS40_HOME}\usr\bin;${PATH}"

Then install tidyverse using the dropdown: Tools, Install Packages, Tidyverse or do the install.packages command below

# install on local machine
install.packages("tidyverse")

# bring in the tidyverse libraries
library(tidyverse)

# or could just bring in 
library(dplyr)

alt text

then

alt text

okay so we are good to go

R Studio Keyboard shortcuts

ctrl enter - run - This is by far my most used keyboard shortcut

ctrl shift c - comment / uncomment

ctrl shift m - pipe %>% aka magrittr

alt - assignment <-

R Useful code

# clear R of all objects
rm(list=ls())

# install package on local machine
install.packages("tidyverse")

# bring in tidyverse library
library(tidyverse)

# load in csv into a new dataframe using tidyverse's readr package
df_stuffcount <- read_csv("StuffCount.csv")

# Section 1 ####

# useful to print all the df
print.data.frame(.)

Pluralsight

Data Science with R by Matthew Renze

Tidyverse: R Playbook

The goal (of data science) is to turn data into information, and information into insight

Carly Fiorina - former CEO of HP

Reading data

readr

eg read_csv() read_log() - web log files

Also lots of other sources including: Postgres, httr (Web API’s), rvest (web) however as a developer I’m going to stick to

alt text

This is useful to open up a Windows Explorer window so can copy the file into the correct place.

alt text

which lets me see the csv it is importing (like in Excel CSV import), then it generates the code to do it using readr.

library(readr)
data <- read_csv("data.csv")
View(data)

Wrangling data

Data Manipulation / Data Wrangling / Data Munging - transforming raw data into another formating with the intent of making it more appropriate.

dplyr pronounced - Dee plier

dplyr cheatsheet here

verbs

  • filter - where
  • arrange - sort / order by
  • mutate - new column

more

  • group by
  • summerise - selecting group by data (always comes after a group by)
  • join
  • arrange (desc(Year))

Tidying data

tidyr

Data Visualisation

10 Simple rules for Better Figures

# clear R of all objects
rm(list=ls())


# package tidyverse readr
library(readr)
# library(tidyverse)

df_stuff <- read_csv("stuff.csv")

# x rows (shows the tibble - types too)
df_stuff

print.data.frame(df_stuff)

# select / mssql style data viewer
View(df_stuff)

# view a histogram of the vector (array of dbl's)
# this part of base R and not tidyverse
hist(df_stuff$TREATMENT)

R for a C# Application Buidler

Logfile Analysis in R

log file analysis server log analysis web scraping library?

I suspect the real benefit for people like me who know a General Purpose Language like C# and SQL, is that R can do easy good stats analysis, and show the data.

Postgres

RPostgres is more up to date and has more GH stars, and may be slightly faster than RPostgreSQL

DBI defines R’s interfaces to databases. RPostgres implements this spec.

Here is some sample code:

# Install the latest RPostgres release from CRAN:
# install.packages("RPostgres")

library(DBI)
library(tidyverse)

con <- dbConnect(RPostgres::Postgres(),dbname = 'imdbr', 
                 host = 'localhost',
                 port = 5432,
                 user = 'postgres',
                 password = 'letmein')

# show all db tables
dbListTables(con)

# get entire table
dbReadTable(con, "rating")

# send and fetch
res <- dbSendQuery(con, "SELECT * FROM rating limit 100")
dbFetch(res)

# does send and fetch together - handy
df_ratings <- dbGetQuery(con, "SELECT * FROM rating limit 100")
df_ratings

summary(df_ratings)

hist(df_ratings$average_rating)

Analysing data

It’s very important to understand the raw data and what it actually means.

  • Excel to view raw data, then export to csv

  • csv_import - does it work, and are the types it infers okay eg chr, dbl

  • summary(dataframe) to find the max,min, types

  • View(dataframe) and sorting - move view to different Quadrant to see the max / min / obvious errors eg NA null parts too

  • Histogram of each variable to check for outliers and distribution (does it make sense)

Correcting data errors

I would usually do it in Excel or a higher level language. Especially regarding whitespace, null and spurious non expected characters

# find the error
# it is row 94 that has ALTITUDE 2960 instead of 296
TLD %>% 
  # this just puts in a row number
  rownames_to_column() %>% 
  filter(ALTITUDE > 1500)

# 195 rows
summary(TLD)
TLD

# can now fix with indexing
TLD[94,5] <- 296

# or more functional using tidyverse
TLD <- TLD %>% 
  mutate(ALTITUDE = if_else(ALTITUDE == 2960, 296, ALTITUDE))

# OR
TLD$ALTITUDE <- recode(TLD$ALTITUDE, `2960` = 296)

Transforming data

Because the raw data (and more importantly their residuals) may be skewed… so we can transform into a more normal (bell?) manner.

We want an normal distribution of data so can run standard types of analysis on it

Ggplot

R cookbook for Graphs

Type of Charts

Visualise the data

  • Histogram (used to show distribution of variables eg Altitude)

Very useful to see mistakes in the data eg Altitude of >1300m in the UK

  • Bar charts (used to compare variables)

Terms

  • R - language
  • R Studio - IDE

  • base R - no use of Tidyverse

  • Tidyverse
    • Dpylry - for wrangling data
    • Tidyr - tidying data
    • Ggplot2
  • Data Structures
  • Data frame - columns can be different types. eg like a table
  • Vector - 1d array
  • Matrix - 2d array
  • Array

For experimentation we have fixed factors (eg experiment type) and measurements

  • Variables - a measurement
  • Factor / Fixed Factor of the experiment eg a Treatment can be 1, 2 or 3 only

  • zero inflation - an excess of 0 data

  • Gaussian (normal) distribution of data - bell curve
  • Right skewed data (more data distributed to the left, so graph is skewed to the right ) of the histogram

Tutorials

Introduction to Spacial Data in R