Chapter 9 Supplementary materials

9.1 Beginner R

  • Basic R functionality
  • Reading in data
  • Tidy R philosophy
  • Manipulating data with core tidyr functions

9.1.1 Basic R functionality

  • Variables
  • Functions
  • Operators

9.1.2 Variables

The way R stores your information will determine the kinds of functions/operators you can use

  • Data: dataframes
  • Values: lists, vectors, matrices, etc.
# Variables can be numbers, strings, etc.
some_val <- 200
other_val <- TRUE

# Variables can also be lists of numbers, strings, etc.
a_list <- c("a", "b", "c")

# We can also make dataframes (which are essentially special lists)
a_dataframe <- tibble(column_1 = 1:3,
                      column_2 = a_list)

9.1.3 Functions

Functions take a certain number and certain types of “arguments”

  • base R functions: part of downloading R
  • Packages: need to be installed, then loaded
    • install.packages(package) once (or to update package)
    • library(package) every R session

Use Help window or ?package to check argument names, types, and defaults

  • Named arguments without values are required (and user-defined)
  • Named arguments with values show defaults
  • Elipses mean that you can add other optional arguments
# file: required
# header = TRUE: default
# ...: other potential arguments, like stringsAsFactors = FALSE
read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

9.1.4 base R

# base R function
# Count the number of rows in this dataframe
nrow(a_dataframe)
## [1] 3
# If I give nrow() too many arguments, it will throw an error
nrow(a_dataframe, a_list)
# If I give it the wrong kind of argument, it will just return NULL
# Some functions won't run at all with wrong kind of argument
nrow(a_list)
## NULL

9.1.5 Packages

# scholar package 
library(scholar)

# get_publications function
# Pull publications from Google Scholar for Marie Curie
get_publications("EmD_lTEAAAAJ&EmD_lTEAAAAJ&") %>% 
  dplyr::filter(cites > 30) %>% 
  distinct(title, .keep_all = TRUE) %>%
  select(author, title) %>%
  head(2) %>%
  kable()
author title
E Curie Madame Curie
M Curie, S Mme Curie

9.1.6 Operators

  • Relational: >, <, ==, !=, <=, >=
    • is.na, exists, etc. will return TRUE/FALSE values
    • grep, filter, str_detect, etc. use TRUE/FALSE values
  • Logical: !, &, &&, |, ||
  • Arithmetic

9.1.7 Reading in data

General parameters for csv files

read.csv("file_name.csv", 
         header = TRUE, 
         stringsAsFactors = FALSE, 
         check.names = FALSE, 
         na.strings = "")

Avoid special characters (including spaces) in file names, directories, and column headers!

  • readxl package for Excel spreadsheets
  • APIs for direct access to online data
    • qualtRics package for Qualtrics data
    • ggmap package for Google services (geolocation data)
  • read_table from readr package for text files

9.1.8 Tidy R philosophy

  • One variable per column
  • One observation per row

9.1.9 Manipulating data with tidyr

  • %>%: pass the results of one function on to another
  • select: choose columns by name
  • mutate: add/change columns
  • filter: filter for (or out) rows
  • group_by and summarise: perform operations on groups of data
  • gather and spread pivot_longer and pivot_wider: condense multiple columns into one or the inverse
  • separate and unite: split a column into multiple or the inverse

9.1.10 Other helpful tidyr functions

  • slice: choose a row
  • pull: choose a column
  • Helper functions for select (e.g., contains)
  • join family of functions: combine datasets based on a shared unique identifier
  • union: combine datasets by rows (column names must be the same)
  • replace_na/drop_na: alter/remove rows with NA values

9.1.11 Helpful base R functions

  • rbind and cbind: add rows/columns
  • nrow and ncol: count rows/columns
  • unique: pull unique values
  • Indexing with var$column and var[row, column]
  • which with column/row indexing

9.1.12 Other packages and functions

  • tibble package for dataframes with tibble
  • kableExtra for kable tables
  • ggplot2 package for graphs (cheat sheet here)
  • factor for ordering text labels in graphs
  • Use na.rm = TRUE argument (e.g., in mean) to remove NA values from calculations

9.2 Intermediate R

  • Dealing with free-response text
  • Dynamic variable creation and reference
  • Helpful functions
  • Other tips

9.2.1 Free-response text

  • Regular expressions (cheat sheet here)
  • Pattern matching
    • Return position/value of elements that match a pattern: grep, agrep
    • Return TRUE/FALSE: grepl, agrepl, str_detect
    • Change a pattern: sub, gsub, replace
    • Return position of pattern for all elements: regexpr, gregexpr, regexec
  • Use perl = TRUE argument to handle especially complex patterns
# List of elements
fruit <- c("apple", "banana", "pear", "pinapple")

# grep position
grep(pattern = "le", x = fruit)
## [1] 1 4
# grep value
grep(pattern = "le", x = fruit, value = TRUE)
## [1] "apple"    "pinapple"
# agrep (can also specify value = TRUE)
agrep(pattern = "le", x = fruit, max.distance = 0.1)
## [1] 1 3 4
# regexpr
# match.length attribute gives starting position of match
# index.type attribute gives length of matched text
regexpr(pattern = "le", text = fruit)
## [1]  4 -1 -1  7
## attr(,"match.length")
## [1]  2 -1 -1  2
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

9.2.2 Dynamic variables

# Variables
vals <- rep(1:3, 3)
name <- "assign_example"

# Assign values to variable name
assign(name, vals)

# Use the variable as usual
assign_example
## [1] 1 2 3 1 2 3 1 2 3
# You can also get the new variable from the name
get(name)
## [1] 1 2 3 1 2 3 1 2 3
# You can add to this variable dynamically as well
assign(name, c(get(name), 4:6))

# New output
assign_example
##  [1] 1 2 3 1 2 3 1 2 3 4 5 6

9.2.3 Helpful functions

  • apply, lapply, sapply, tapply
  • source variables from R scripts
  • %notin% and %in% (compared to != and ==)
# Example using apply: go across columns of dataset 
  # and substitute characters
language <- apply(language, 2, 
                  function(x) gsub("\\\\", "", x, fixed = TRUE))

# Source other sripts
source("data_cleaning.R", local = TRUE)

# Helper function
"%notin%" <- Negate("%in%")

# Example from processing pipeline for Qualtrics data
unusable <- c("0","00","107")
dat %>% dplyr::filter(Progress==100 & ID %notin% unusable)

9.2.4 Other tips

9.3 Graphing with ggplot

Now for some general tips on using ggplot. I’ll go over the following topics:

  • Formatting
    • Layering plot objects
    • Saving ggplot functions in lists
    • Color palettes
    • color vs. fill
    • Using the theme function
  • Visualization
    • Using Rmisc for error bars
    • Using patchwork to combine multiple plots

9.3.1 Formatting

Recall the scatter plot from Section 5.3.2. If you’re playing around with this code yourself, remember that we had to convert cyl to a factor so that R would treat it as a discrete rather than a continuous variable.

# Output scatter plot
mtcars_scatter

A major aspect of ggplot that you should be aware of is that the plots are built in sequence. In the graph above (the code also appears below), ggplot(mtcars) came first, which just set up a blank graph, then there’s geom_jitter, which added the wt and mpg data and point colors based on the cyl variable, and so on. If I wanted the regression line from geom_smooth to appear beneath the points rather than on top of them, I could’ve just reordered the function calls. This will also help you troubleshoot if you have issues outputting graphs; just comment out individual plot objects, being careful not to leave any dangling plus signs, and rerun to find the problem.

This functionality also means that you can save several layers in a list to apply them easily to multiple graphs. This can save you time and effort when you want to have a consistent look across different graphs; just update a particular parameter in your list once, and then it’ll apply across your graphs.

Some functions, like labs, support this layering very well by allowing multiple calls to the same function. Others, like scale_y_continuous, will override previous calls in the same graph (you’ll get a warning message when this happens). You’ll notice that I have to put commas in between each function in the list rather than plus signs. I also have to save these in a list(rather than with c).

# Create list of specifications
format_list <- list(scale_y_continuous(expand = c(0,0), limits = c(0,6)),
                    scale_x_continuous(expand = c(0,0), limits = c(10,35)),
                    scale_color_brewer(type = "qual", palette = "Paired"),
                    theme_classic(),
                    labs(y = "Vehicle weight (1000 lbs)",
                         x = "Miles per gallon",
                         color = "Cylinders"))

# Create scatter plot
  # Move legend to the bottom
mtcars_scatter <- ggplot(mtcars) +
  geom_jitter(aes(mpg, wt, color = cyl)) +
  geom_smooth(formula = y ~ x, aes(mpg, wt), method = "lm", se = TRUE, level = 0.95, 
              fill = "#d7d8db", color = "black", size = 0.5) +
  format_list +
  theme(legend.position = "bottom")

# Output scatter plot with new legend position
mtcars_scatter

You’ll want to be familiar with the color brewer palettes that I used here. The website I’ve linked allows you to play with the different options and choose ones with particular characteristics, like color blind-friendly palettes. You can use the scale_color/fill_brewer functions with ggplot to apply these color palettes.

Relatedly, there’s a difference between setting fill and color options in ggplot. Generally, scatter plots get color settings, while bar plots get fill settings. If I’d set fill = cyl in geom_jitter, I would’ve gotten a legend with the three cylinder types, but the colors of the points wouldn’t have changed. Similarly, if I’d used scale_fill_brewer rather than scale_color_brewer, the colors of the points would still have been the R defaults (or you would get an error, depending on the data type).

Lastly on formatting, you’ll want to become familiar with the theme function as your graphs increase in complexity. The description for this function says it best: “Themes are a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends.” I typically layer theme_classic with several theme specifications. Search ?theme in your R console or Help window for the full list of parameters.

9.3.2 Visualization

I often need to make bar graphs to present data. I also tend to want to include error bars on these bar graphs. There are different error bars you can calculate, but I tend to have error bars that represent 95% confident intervals around my means. You’ll first need to install the Rmisc package to do this, then transform your data into a long format. I also turned the categorical variables that I cared about into factors. This allows you to control the exact label text and order in your graph. Then, I used the summarySE function to create a table with the means and confidence intervals.

# Create long-form data for creating error table
  # Mutate columns to create factors with particular labels
  # This is harder to change once you're at ggplot stage
mtcars_long <- mtcars %>%
  select(mpg, wt, cyl) %>%
  pivot_longer(cols = -cyl, names_to = "variables", values_to = "measures") %>%
  mutate(cyl = factor(cyl, levels = c(4, 6, 8), 
                      labels = c("4-cylinder", "6-cylinder", "8-cylinder")),
         variables = factor(variables, levels = c("mpg", "wt"), 
                            labels = c("Miles per gallon", "Vehicle weight (1000 lbs)")))

# Make error table
mtcars_error <- summarySE(mtcars_long, 
                          measurevar = "measures", 
                          groupvars = c("cyl", "variables"))

# Output error table
mtcars_error
##          cyl                 variables  N  measures        sd        se        ci
## 1 4-cylinder          Miles per gallon 11 26.663636 4.5098277 1.3597642 3.0297434
## 2 4-cylinder Vehicle weight (1000 lbs) 11  2.285727 0.5695637 0.1717299 0.3826381
## 3 6-cylinder          Miles per gallon  7 19.742857 1.4535670 0.5493967 1.3443253
## 4 6-cylinder Vehicle weight (1000 lbs)  7  3.117143 0.3563455 0.1346860 0.3295647
## 5 8-cylinder          Miles per gallon 14 15.100000 2.5600481 0.6842016 1.4781278
## 6 8-cylinder Vehicle weight (1000 lbs) 14  3.999214 0.7594047 0.2029595 0.4384672

Now I can make a bar graph with this error table. My main ggplot object uses the cyl column and plots the means, while the geom_errorbar object uses the confidence interval (ci) column to create the error bars. You’ll also notice that I changed the width of the error bars and their position, so that they were centered over the bars.

# Create bar graph
mtcars_bar <- ggplot(mtcars_error, aes(cyl, measures, fill = cyl)) +
  geom_bar(position="dodge", stat="identity") +
  geom_errorbar(aes(ymin = measures - ci, ymax = measures + ci), 
                width = 0.1, position = position_dodge(0.9)) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, .2))) +
  scale_fill_brewer(type = "qual", palette = "Paired") +
  theme_classic() +
  facet_wrap(~ variables, scales = "free_x") +
  labs(y = "Measures",
       x = "Cylinders",
       fill = "Cylinders") +
  theme(axis.ticks.y = element_blank(),
        legend.position = "none",
        panel.spacing.x = unit(1.5, "lines"),
        axis.text.y = element_text(angle = 90, hjust = 0.5))

# Output bar graph
mtcars_bar

Sometimes, you’ll want to output multiple graphs with a particular layout. The patchwork package allows you to do this easily. The syntax is very similar to ggplot syntax, in that you add plots together with a plus sign. You can also add a plot_layout object to change aspects of the plots, like their widths.

 mtcars_bar + mtcars_scatter + plot_layout(widths = c(2, 1))

9.4 Practice materials

You can find practice materials for R Markdown and R here. You’ll want to clone this repository or download the following documents/folders:

  • brownbag_activity.Rmd
  • data_cleaning.R
  • apa.csl
  • /data