How to reshape data using the powerful dplyr package
Calculate new variables to analyze
Summarize data differently to suit your unit of analysis
Sort data to make it easier to visualize
How to use the pipe |> to simplify code
How to join two datasets together using different approaches and conditions
Additional resources for manipulating and joining data using dplyr
You should have:
A template R script that we will fill out today called class_script_blank.R
dplyr cheatsheet
A link to a website where you can find the slides and final code from this workshop.
Basic knowledge of R
Installing and loading packages
Basic terminology of R or programming in general
Libraries
Today we’ll be using the tidyverse library, which includes dplyr.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Review: reading and viewing data
# we'll be looking at data on Groundhog predictionsgroundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')predictions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')
You can view a dataframe in R using View() or by clicking the object in the environment pane.
Let’s take a look at our groundhog predictions dataset:
head(predictions)
Our task today
We are groundhog mythbusters and our goal is to collect some summary statistics about the groundhog prediction phenomenon.
Our main question is whether different groundhogs are more or less likely to predict an early spring.
Over the course of this workshop, we’ll be creating summary tables that will set us up for further visualization and analysis.
The dplyr package
dplyr is a “grammar” of data manipulation
dplyr is a set of R functions that work together to help you solve the most common data manipulation challenges, like:
Filtering out certain rows and sorting your data
Calculating a new column based on existing columns
Grouping and summarizing data
Joining data from different sources
dplyr grammar
dplyr’s core is a set functions that can be divided into 4 groups based on what they operate across:
rows
columns
groups
tables
We’ll call these the dplyr verbs
dplyr also contains a number of useful helper functions that operate on single values or arrays. We’ll introduce those along the way.
anatomy of a dplyr verb
In every dplyr verb:
the first argument is always dataframe
the output is always a new dataframe
arguments with a . in front of them are settings for the function, not column names
the pipe 💉
Each verb does one thing very well, so you’ll typically chain together multiple verbs. The pipe helps you do this by passing the result of every action onto the next action.
The pipe is represented in R as |>. Everything to the left of the pipe is passed as the first argument to the thing immediately to the right of the pipe.
x |> f(y) is equivalent to f(x, y)
x |> f(y) |> g(x) is equivalent tog(f(x,y), z)
⚠️ The pipe does not save new variables automatically
row verbs
filter() : keep rows based on the value of one or more columns
arrange(): changes the row order based on one or more columns
distinct(): finds all the unique rows based on the values of one or more columns
row verbs: filter()
filter() : keep rows based on the value of one or more columns
You can compose conditions using ==, >, <, >=, <=, !=, and include multiple conditions using & or |
The %in% operator can serve as a combination of | and ==
row verbs: filter()
# find groundhog predictions from 2020filter(predictions, year ==2020)
# find groundhog predictions from 2020 and 2021filter(predictions, year ==2020| year ==2021)
filter(predictions, year %in%c(2020, 2021))
# find groundhog predictions from 2020 where a shadow was seenfilter(predictions, year ==2020& shadow ==TRUE)
filter(): your turn!
Find groundhog predictions between 1900 and 2000. Bonus: Use the pipe in your answer!
filter(): your turn!
# find predictions between 1900 and 2000predictions |>filter(year >=1900& year <=2000)
filter(): useful helper functions
between() tests if a variable falls between two values (inclusive)
near() tests if a variable is within a certain range of a given number (you can set the tolerance)
is.na() tests whether the variable is NA. Use is conjunction with ! to filter for non-NA values.
row verbs: arrange()
arrange(): changes the row order based on one or more columns
You can wrap the columns with desc() to sort in descending order
# sort our predictions by yeararrange(predictions, year)
# sort our predictions by year, descendingarrange(predictions, desc(year))
row verbs: distinct()
distinct(): finds all the unique rows based on the values of one or more columns
Without any additional inputs, distinct() finds and keeps the first occurence of all unique rows
You can optionally supply one or more columns to check for distinct combinations of
If you want to retain all of the columns, set the .keep_all argument to TRUE
# list the unique years in the predictions datasetpredictions |>distinct(year)
🚩checkpoint: row verbs
Let’s put it all together!
Remove rows with no prediction record
Remove duplicate predictions
Sort the result by year, descending
Assign the result to predictions, overwriting the previous dataframe
🚩checkpoint: row verbs
# create a subset of your data where "shadow" has a value of either TRUE or FALSE. Make sure there are no duplicate rows, and sort the result by ascending year.predictions <- predictions |>filter(shadow %in%c(TRUE, FALSE)) |>distinct(year, id, .keep_all =TRUE) |>arrange(desc(year))
group verbs
group_by() groups your dataframe
summarize() reduces the dataframe to a summary table with one row for each group and one or more calculations by group
group verbs: group_by()
group_by() groups your dataframe
On it’s own, it doesn’t change your data. But you can feed the “grouped” output into other special functions to apply different transformations to each group in your data.
# group predictions by year predictions |>group_by(year)
group verbs: summarize()
summarize() reduces the dataframe to a summary table with one row for each group and one or more calculations by group
The syntax is dataframe |> group_by(column) |> summarize(new_variable = summary_function(..))
One of the most important summaries is n(), which counts the observations (rows) in each group.
Let’s try it together: How many predictions were made in each year?
n()within summarize()
# How many predictions were made in each year?predictions |>group_by(year) |>summarize(n_predictions =n()) |>arrange(desc(year))
summarize() helper functions
Other powerful summary functions include:
n_distinct(): counts the number of distinct values of a given column within a group
max() and min(): finds the max and min value of a given column within a group
Exercises:
How many different groundhogs made predictions each year?
What is the first year each groundhog made a prediction?
summarize() helper functions
# How many different groundhogs made predictions each year?predictions |>group_by(year) |>summarize(n_groundhogs =n_distinct(id)) |>arrange(desc(n_groundhogs))
summarize() helper functions
# What is the first year each groundhog made a prediction?predictions |>group_by(id) |>summarize(first_prediction =min(year))
sum() within summarize()
sum(): finds the sum of a given column within a group. You can also specify conditions within sum() to calculate the number of records within a group that meet a certain condition.
Exercise: Let’s return to our dataframe with the number of predictions in each year. How would we add a column for the number of shadows seen in each year?
sum() within summarize()
# Let's return to our dataframe with the number of predictions in each year. # How would we add a column for the number of shadows seen in each year?predictions |>group_by(year) |>summarize(n_predictions =n(),n_shadows =sum(shadow ==TRUE))
🚩checkpoint: group verbs
Your turn! Create a dataframe with three variables:
groundhog id
the number of total predictions each groundhog has made
the number of times each groundhog has seen it’s shadow.
🚩checkpoint: group verbs
# Create a dataframe with 3 variables: # groundhog id# the number of total predictions each groundhog has made# the number of times each groundhog has seen its shadowpredictions |>group_by(id) |>summarize(n_predictions =n(),n_shadows =sum(shadow ==TRUE))
column verbs
Now that we’ve calculated some summary variables within the groups that interest us (groundhog and year), we might want to use those summary variables to calculate more new variables.
mutate() adds new columns calculated from existing columns
select() keeps a subset of columns
rename() renames columns
column verbs: mutate()
mutate() adds new columns calculated from existing columns
By default, columns are added on the left side of the dataframe. You can use the .before or .after to specify where the new variable should fall
# calculate how many characters are in the details field and put the variable after idpredictions |>mutate(details_length =nchar(details), .after = id)
re-coding data with mutate()
if_else() tests for a condition and returns one value if true and another if false.
# create a column that indicates whether the prediction was made by Punxatawney Philpredictions |>mutate(phil =if_else(id ==1, 'TRUE', 'FALSE'))
re-coding data with mutate()
case_when() tests for multiple conditions and maps them to values accordingly.
# create a column that indicates the century of the predictionspredictions |>mutate(century =case_when(year <1900~'19', year <2000& year >=1900~'20', year >=2000~'21',TRUE~'Year out of range'))
column verbs: select() and rename()
select() keeps a subset of columns
You can select by name, series, test for data type (select(where(is.character()))) or use other helper functions such as starts_with(), ends_with(), or contains()
You can rename variables as you select them with = , with the new name on the left and old on the right
rename() works the same way as renaming in select with =
🚩checkpoint: put it all together!
Let’s return to our original research question: Are certain groundhogs more likely to see their shadow than others?
Working off of our table with the number of predictions and number of shadows seen per groundhog, lets:
Add a column called shadow_percent that gives the percentage of time each groundhog sees its shadow
Filter for groundhogs with more than 5 predictions
Keep only the variables id and shadow_percent, and rename id to groundhog_id
Assign the result to a variable groundhog_predictions
We’ve done a lot with the mere 4 variables in our predictions table!
What if we wanted to enhance our data with more information about each groundhog from the groundhogs table?
head(groundhogs)
join terminology
There are two main types of join:
mutating joins add variables from one dataframe to another based on matching characteristics between the two
filtering joins subset one dataframe based on matching characteristics with another dataframe
join terminology 🗝️
Every join involves a primary key and a foreign key
A primary key is a variable or set of variables that uniquely identifies an observation
A foreign key is just another table’s primary key that matches your tables’ primary key. It might have a different name or be spread across more or less variables.
The first step when joining data is to identify the primary and foreign keys you’ll work with
Always check that your primary key is unique to each row! Foreign keys may or may not be unique.
joining predictions & groundhogs
head(groundhog_predictions, 3)
head(groundhogs, 3)
joining predictions & groundhogs
How would we determine if there is a difference between the average shadow prediction rate of different types of groundhogs?
primary key: groundhog_id in groundhog_predictions
foreign key: id in groundhogs
joining predictions & groundhogs
We want to add the variables from groundhogs to our groundhog_predictions table
We’ll need a mutating join, specifically a left join.
A left join retains all rows in the left dataframe, and adds additional data in from the right dataframe if the keys match.
right_join() keeps everything in the right dataframe and adds in data from the left
more mutating joins
inner_join() keeps rows with keys that appear in both dataframes
full_join() keeps all rows from both dataframes
filtering joins
filtering joins subset one dataframe based on matching characteristics with another dataframe. In filtering
semi_join(x, y) keeps all rows in x with a match in y
anti_join(x, y) returns all rows in x without a match in y
join exercises
groundhog_predictions contains one row per 50 unique groundhogs
groundhogs contains one row per 65 unique groundhogs
Every groundhog in groundhog_predictions appears in groundhogs
How many rows would each of the following joins have: right join with groundhogs on the right, inner join, full join, semi_join, anti_join?
more complex join conditions
Within join_by(), we can use more complex conditions than whether key == key
You can use other numeric operations like >, <, etc.
The closest() function matches the closest key to another key based on some criteria (closest value at all, closest value that is larger, etc.)
between() and within() can test whether a value falls between two other values. This is useful if you want to join events that happened within a given time span.
other table verbs
bind_rows() pastes rows onto the bottom of a dataframe
bind_cols() pastes columns onto the right of a dataframe.
There are no conditions in these functions, you can think of them as copy-and-paste.
🏁 finish line
Let’s put everything we’ve learned together!
Let’s create a summary table that gives the rate at which each type of groundhog sees its’ shadow