Chapter 9 Supplementary materials
9.1 Beginner R
- Basic R functionality
- Reading in data
- Tidy R philosophy
- Manipulating data with core
tidyr
functions
9.1.1 Basic R functionality
- Variables
- Functions
- Operators
9.1.2 Variables
The way R stores your information will determine the kinds of functions/operators you can use
- Data: dataframes
- Values: lists, vectors, matrices, etc.
# Variables can be numbers, strings, etc.
<- 200
some_val <- TRUE
other_val
# Variables can also be lists of numbers, strings, etc.
<- c("a", "b", "c")
a_list
# We can also make dataframes (which are essentially special lists)
<- tibble(column_1 = 1:3,
a_dataframe column_2 = a_list)
9.1.3 Functions
Functions take a certain number and certain types of “arguments”
base
R functions: part of downloading R- Packages: need to be installed, then loaded
install.packages(package)
once (or to update package)library(package)
every R session
Use Help window or ?package
to check argument names, types, and defaults
- Named arguments without values are required (and user-defined)
- Named arguments with values show defaults
- Elipses mean that you can add other optional arguments
# file: required
# header = TRUE: default
# ...: other potential arguments, like stringsAsFactors = FALSE
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
9.1.4 base
R
# base R function
# Count the number of rows in this dataframe
nrow(a_dataframe)
## [1] 3
# If I give nrow() too many arguments, it will throw an error
nrow(a_dataframe, a_list)
# If I give it the wrong kind of argument, it will just return NULL
# Some functions won't run at all with wrong kind of argument
nrow(a_list)
## NULL
9.1.5 Packages
# scholar package
library(scholar)
# get_publications function
# Pull publications from Google Scholar for Marie Curie
get_publications("EmD_lTEAAAAJ&EmD_lTEAAAAJ&") %>%
::filter(cites > 30) %>%
dplyrdistinct(title, .keep_all = TRUE) %>%
select(author, title) %>%
head(2) %>%
kable()
author | title |
---|---|
E Curie | Madame Curie |
M Curie, S Mme | Curie |
9.1.6 Operators
- Relational: >, <, ==, !=, <=, >=
is.na
,exists
, etc. will return TRUE/FALSE valuesgrep
,filter
,str_detect
, etc. use TRUE/FALSE values
- Logical: !, &, &&, |, ||
- Arithmetic
9.1.7 Reading in data
General parameters for csv files
read.csv("file_name.csv",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE,
na.strings = "")
Avoid special characters (including spaces) in file names, directories, and column headers!
readxl
package for Excel spreadsheets- APIs for direct access to online data
qualtRics
package for Qualtrics dataggmap
package for Google services (geolocation data)
read_table
fromreadr
package for text files
9.1.8 Tidy R philosophy
- One variable per column
- One observation per row
9.1.9 Manipulating data with tidyr
%>%
: pass the results of one function on to anotherselect
: choose columns by namemutate
: add/change columnsfilter
: filter for (or out) rowsgroup_by
andsummarise
: perform operations on groups of datagather
andspread
pivot_longer
andpivot_wider
: condense multiple columns into one or the inverseseparate
andunite
: split a column into multiple or the inverse
9.1.10 Other helpful tidyr
functions
slice
: choose a rowpull
: choose a column- Helper functions for
select
(e.g.,contains
) join
family of functions: combine datasets based on a shared unique identifierunion
: combine datasets by rows (column names must be the same)replace_na
/drop_na
: alter/remove rows with NA values
9.1.11 Helpful base
R functions
rbind
andcbind
: add rows/columnsnrow
andncol
: count rows/columnsunique
: pull unique values- Indexing with
var$column
andvar[row, column]
which
with column/row indexing
9.1.12 Other packages and functions
tibble
package for dataframes withtibble
kableExtra
forkable
tablesggplot2
package for graphs (cheat sheet here)factor
for ordering text labels in graphs- Use
na.rm = TRUE
argument (e.g., inmean
) to remove NA values from calculations
9.2 Intermediate R
- Dealing with free-response text
- Dynamic variable creation and reference
- Helpful functions
- Other tips
9.2.1 Free-response text
- Regular expressions (cheat sheet here)
- Pattern matching
- Return position/value of elements that match a pattern:
grep
,agrep
- Return TRUE/FALSE:
grepl
,agrepl
,str_detect
- Change a pattern:
sub
,gsub
,replace
- Return position of pattern for all elements:
regexpr
,gregexpr
,regexec
- Return position/value of elements that match a pattern:
- Use
perl = TRUE
argument to handle especially complex patterns
# List of elements
<- c("apple", "banana", "pear", "pinapple")
fruit
# grep position
grep(pattern = "le", x = fruit)
## [1] 1 4
# grep value
grep(pattern = "le", x = fruit, value = TRUE)
## [1] "apple" "pinapple"
# agrep (can also specify value = TRUE)
agrep(pattern = "le", x = fruit, max.distance = 0.1)
## [1] 1 3 4
# regexpr
# match.length attribute gives starting position of match
# index.type attribute gives length of matched text
regexpr(pattern = "le", text = fruit)
## [1] 4 -1 -1 7
## attr(,"match.length")
## [1] 2 -1 -1 2
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
9.2.2 Dynamic variables
# Variables
<- rep(1:3, 3)
vals <- "assign_example"
name
# Assign values to variable name
assign(name, vals)
# Use the variable as usual
assign_example
## [1] 1 2 3 1 2 3 1 2 3
# You can also get the new variable from the name
get(name)
## [1] 1 2 3 1 2 3 1 2 3
# You can add to this variable dynamically as well
assign(name, c(get(name), 4:6))
# New output
assign_example
## [1] 1 2 3 1 2 3 1 2 3 4 5 6
9.2.3 Helpful functions
apply
,lapply
,sapply
,tapply
source
variables from R scripts%notin%
and%in%
(compared to != and ==)
# Example using apply: go across columns of dataset
# and substitute characters
<- apply(language, 2,
language function(x) gsub("\\\\", "", x, fixed = TRUE))
# Source other sripts
source("data_cleaning.R", local = TRUE)
# Helper function
"%notin%" <- Negate("%in%")
# Example from processing pipeline for Qualtrics data
<- c("0","00","107")
unusable %>% dplyr::filter(Progress==100 & ID %notin% unusable) dat
9.2.4 Other tips
- Store
ggplot2
parameters in alist
list
vs.c
9.3 Graphing with ggplot
Now for some general tips on using ggplot
. I’ll go over the following topics:
- Formatting
- Layering plot objects
- Saving
ggplot
functions in lists - Color palettes
color
vs.fill
- Using the
theme
function
- Visualization
- Using
Rmisc
for error bars - Using
patchwork
to combine multiple plots
- Using
9.3.1 Formatting
Recall the scatter plot from Section 5.3.2. If you’re playing around with this code yourself, remember that we had to convert cyl
to a factor so that R would treat it as a discrete rather than a continuous variable.
# Output scatter plot
mtcars_scatter
A major aspect of ggplot
that you should be aware of is that the plots are built in sequence.
In the graph above (the code also appears below), ggplot(mtcars)
came first, which just set up a blank graph, then there’s geom_jitter
, which added the wt
and mpg
data and point colors based on the cyl
variable, and so on.
If I wanted the regression line from geom_smooth
to appear beneath the points rather than on top of them, I could’ve just reordered the function calls.
This will also help you troubleshoot if you have issues outputting graphs; just comment out individual plot objects, being careful not to leave any dangling plus signs, and rerun to find the problem.
This functionality also means that you can save several layers in a list to apply them easily to multiple graphs. This can save you time and effort when you want to have a consistent look across different graphs; just update a particular parameter in your list once, and then it’ll apply across your graphs.
Some functions, like labs
, support this layering very well by allowing multiple calls to the same function.
Others, like scale_y_continuous
, will override previous calls in the same graph (you’ll get a warning message when this happens).
You’ll notice that I have to put commas in between each function in the list rather than plus signs.
I also have to save these in a list
(rather than with c
).
# Create list of specifications
<- list(scale_y_continuous(expand = c(0,0), limits = c(0,6)),
format_list scale_x_continuous(expand = c(0,0), limits = c(10,35)),
scale_color_brewer(type = "qual", palette = "Paired"),
theme_classic(),
labs(y = "Vehicle weight (1000 lbs)",
x = "Miles per gallon",
color = "Cylinders"))
# Create scatter plot
# Move legend to the bottom
<- ggplot(mtcars) +
mtcars_scatter geom_jitter(aes(mpg, wt, color = cyl)) +
geom_smooth(formula = y ~ x, aes(mpg, wt), method = "lm", se = TRUE, level = 0.95,
fill = "#d7d8db", color = "black", size = 0.5) +
+
format_list theme(legend.position = "bottom")
# Output scatter plot with new legend position
mtcars_scatter
You’ll want to be familiar with the color brewer palettes that I used here.
The website I’ve linked allows you to play with the different options and choose ones with particular characteristics, like color blind-friendly palettes.
You can use the scale_color/fill_brewer
functions with ggplot
to apply these color palettes.
Relatedly, there’s a difference between setting fill
and color
options in ggplot
.
Generally, scatter plots get color
settings, while bar plots get fill
settings.
If I’d set fill = cyl
in geom_jitter
, I would’ve gotten a legend with the three cylinder types, but the colors of the points wouldn’t have changed.
Similarly, if I’d used scale_fill_brewer
rather than scale_color_brewer
, the colors of the points would still have been the R defaults (or you would get an error, depending on the data type).
Lastly on formatting, you’ll want to become familiar with the theme
function as your graphs increase in complexity.
The description for this function says it best: “Themes are a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends.”
I typically layer theme_classic
with several theme
specifications.
Search ?theme
in your R console or Help window for the full list of parameters.
9.3.2 Visualization
I often need to make bar graphs to present data.
I also tend to want to include error bars on these bar graphs.
There are different error bars you can calculate, but I tend to have error bars that represent 95% confident intervals around my means.
You’ll first need to install the Rmisc
package to do this, then transform your data into a long format.
I also turned the categorical variables that I cared about into factors.
This allows you to control the exact label text and order in your graph.
Then, I used the summarySE
function to create a table with the means and confidence intervals.
# Create long-form data for creating error table
# Mutate columns to create factors with particular labels
# This is harder to change once you're at ggplot stage
<- mtcars %>%
mtcars_long select(mpg, wt, cyl) %>%
pivot_longer(cols = -cyl, names_to = "variables", values_to = "measures") %>%
mutate(cyl = factor(cyl, levels = c(4, 6, 8),
labels = c("4-cylinder", "6-cylinder", "8-cylinder")),
variables = factor(variables, levels = c("mpg", "wt"),
labels = c("Miles per gallon", "Vehicle weight (1000 lbs)")))
# Make error table
<- summarySE(mtcars_long,
mtcars_error measurevar = "measures",
groupvars = c("cyl", "variables"))
# Output error table
mtcars_error
## cyl variables N measures sd se ci
## 1 4-cylinder Miles per gallon 11 26.663636 4.5098277 1.3597642 3.0297434
## 2 4-cylinder Vehicle weight (1000 lbs) 11 2.285727 0.5695637 0.1717299 0.3826381
## 3 6-cylinder Miles per gallon 7 19.742857 1.4535670 0.5493967 1.3443253
## 4 6-cylinder Vehicle weight (1000 lbs) 7 3.117143 0.3563455 0.1346860 0.3295647
## 5 8-cylinder Miles per gallon 14 15.100000 2.5600481 0.6842016 1.4781278
## 6 8-cylinder Vehicle weight (1000 lbs) 14 3.999214 0.7594047 0.2029595 0.4384672
Now I can make a bar graph with this error table.
My main ggplot
object uses the cyl
column and plots the means, while the geom_errorbar
object uses the confidence interval (ci
) column to create the error bars.
You’ll also notice that I changed the width of the error bars and their position, so that they were centered over the bars.
# Create bar graph
<- ggplot(mtcars_error, aes(cyl, measures, fill = cyl)) +
mtcars_bar geom_bar(position="dodge", stat="identity") +
geom_errorbar(aes(ymin = measures - ci, ymax = measures + ci),
width = 0.1, position = position_dodge(0.9)) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, .2))) +
scale_fill_brewer(type = "qual", palette = "Paired") +
theme_classic() +
facet_wrap(~ variables, scales = "free_x") +
labs(y = "Measures",
x = "Cylinders",
fill = "Cylinders") +
theme(axis.ticks.y = element_blank(),
legend.position = "none",
panel.spacing.x = unit(1.5, "lines"),
axis.text.y = element_text(angle = 90, hjust = 0.5))
# Output bar graph
mtcars_bar
Sometimes, you’ll want to output multiple graphs with a particular layout.
The patchwork
package allows you to do this easily.
The syntax is very similar to ggplot
syntax, in that you add plots together with a plus sign.
You can also add a plot_layout
object to change aspects of the plots, like their widths.
+ mtcars_scatter + plot_layout(widths = c(2, 1)) mtcars_bar
9.4 Practice materials
You can find practice materials for R Markdown and R here. You’ll want to clone this repository or download the following documents/folders:
- brownbag_activity.Rmd
- data_cleaning.R
- apa.csl
- /data