Reshaping and Joining Data in R with dplyr

Welcome!

Today’s session will not be recorded.

Please feel free to ask questions verbally or in the chat, and I will do my best to answer them as we move through the training today.

We will take a break halfway through the training at 11am.

JHU Data Services

Find out more at dataservices.library.jhu.edu
Email us for a consultation at dataservices@jhu.edu
Share your research data at archive.data.jhu.edu

What you will learn today

How to reshape data using the powerful dplyr package
- Calculate new variables to analyze
- Summarize data differently to suit your unit of analysis
- Sort data to make it easier to visualize
How to use the pipe |> to simplify code
How to join two datasets together using different approaches and conditions
Additional resources for manipulating and joining data using dplyr

You should have:

A template R script that we will fill out today called class_script_blank.R
dplyr cheatsheet
A link to a website where you can find the slides and final code from this workshop.
Basic knowledge of R
- Installing and loading packages
- Basic terminology of R or programming in general

Libraries

Today we’ll be using the tidyverse library, which includes dplyr.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Review: reading and viewing data

# we'll be looking at data on Groundhog predictions
groundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')
predictions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')

You can view a dataframe in R using View() or by clicking the object in the environment pane.

Let’s take a look at our groundhog predictions dataset:

head(predictions)

Our task today

We are groundhog mythbusters and our goal is to collect some summary statistics about the groundhog prediction phenomenon.
Our main question is whether different groundhogs are more or less likely to predict an early spring.
Over the course of this workshop, we’ll be creating summary tables that will set us up for further visualization and analysis.

The dplyr package

dplyr is a “grammar” of data manipulation
dplyr is a set of R functions that work together to help you solve the most common data manipulation challenges, like:
- Filtering out certain rows and sorting your data
- Calculating a new column based on existing columns
- Grouping and summarizing data
- Joining data from different sources

dplyr grammar

dplyr’s core is a set functions that can be divided into 4 groups based on what they operate across:
- rows
- columns
- groups
- tables
We’ll call these the dplyr verbs
dplyr also contains a number of useful helper functions that operate on single values or arrays. We’ll introduce those along the way.

anatomy of a dplyr verb

In every dplyr verb:

the first argument is always dataframe
the output is always a new dataframe
arguments with a . in front of them are settings for the function, not column names

the pipe 💉

Each verb does one thing very well, so you’ll typically chain together multiple verbs. The pipe helps you do this by passing the result of every action onto the next action.
The pipe is represented in R as |>. Everything to the left of the pipe is passed as the first argument to the thing immediately to the right of the pipe.
x |> f(y) is equivalent to f(x, y)
x |> f(y) |> g(x) is equivalent tog(f(x,y), z)
⚠️ The pipe does not save new variables automatically

row verbs

filter() : keep rows based on the value of one or more columns
arrange(): changes the row order based on one or more columns
distinct(): finds all the unique rows based on the values of one or more columns

row verbs: `filter()`

filter() : keep rows based on the value of one or more columns
You can compose conditions using ==, >, <, >=, <=, !=, and include multiple conditions using & or |
The %in% operator can serve as a combination of | and ==

row verbs: `filter()`

# find groundhog predictions from 2020
filter(predictions, year == 2020)

# find groundhog predictions from 2020 and 2021
filter(predictions, year == 2020 | year == 2021)

filter(predictions, year %in% c(2020, 2021))

# find groundhog predictions from 2020 where a shadow was seen
filter(predictions, year == 2020 & shadow == TRUE)

`filter()`: your turn!

Find groundhog predictions between 1900 and 2000. Bonus: Use the pipe in your answer!

`filter()`: your turn!

# find predictions between 1900 and 2000
predictions |>
  filter(year >= 1900 & year <= 2000)

`filter()`: useful helper functions

between() tests if a variable falls between two values (inclusive)
near() tests if a variable is within a certain range of a given number (you can set the tolerance)
is.na() tests whether the variable is NA. Use is conjunction with ! to filter for non-NA values.

row verbs: `arrange()`

arrange(): changes the row order based on one or more columns

You can wrap the columns with desc() to sort in descending order

# sort our predictions by year
arrange(predictions, year)

# sort our predictions by year, descending
arrange(predictions, desc(year))

row verbs: `distinct()`

distinct(): finds all the unique rows based on the values of one or more columns

Without any additional inputs, distinct() finds and keeps the first occurence of all unique rows
You can optionally supply one or more columns to check for distinct combinations of
If you want to retain all of the columns, set the .keep_all argument to TRUE

# list the unique years in the predictions dataset
predictions |>
  distinct(year)

🚩checkpoint: row verbs

Let’s put it all together!

Remove rows with no prediction record
Remove duplicate predictions
Sort the result by year, descending
Assign the result to predictions, overwriting the previous dataframe

Open up the script and look at the predictions table in R
We’ll mainly be working with our predictions table to answer our question of whether there is any pattern to groundhog predictions
There are a few things we want to change and verify in the data before we start calculating new variabales
The way I like to approach data reshaping is to envision the table I want to end up with and work backwards from there. For our question, it would be nice to have table with one row per year that gives the percentage of groundhogs that saw their shadow that year.
In order to make this percentage, we need to have counts of the number of groundhogs that saw their shadow in a given year, and the total number that made predictions each year
Once you have more experience using dplyr, you’ll start to be able to identify things that will throw off your desired groupings and calculations.
One thing that jumps out to me about the predictions table is that there are rows that represent years with no recorded prediction - we want to filter these out because they will mess up our counts of predictions each year
We also want to make sure there is only one record per groundhog per year
Finally, we want to sort the dataframe in descending order simply because the later years are more interesting because there are more groundhogs on the scene.
Write the code in the script

🚩checkpoint: row verbs

# create a subset of your data where "shadow" has a value of either TRUE or FALSE. Make sure there are no duplicate rows, and sort the result by ascending year.
predictions <- predictions |>
  filter(shadow %in% c(TRUE, FALSE)) |>
  distinct(year, id, .keep_all = TRUE) |>
  arrange(desc(year))

group verbs

group_by() groups your dataframe
summarize() reduces the dataframe to a summary table with one row for each group and one or more calculations by group

group verbs: `group_by()`

group_by() groups your dataframe

On it’s own, it doesn’t change your data. But you can feed the “grouped” output into other special functions to apply different transformations to each group in your data.

# group predictions by year 
predictions |>
  group_by(year)

group verbs: `summarize()`

summarize() reduces the dataframe to a summary table with one row for each group and one or more calculations by group
The syntax is dataframe |> group_by(column) |> summarize(new_variable = summary_function(..))
One of the most important summaries is n(), which counts the observations (rows) in each group.
Let’s try it together: How many predictions were made in each year?

`n()`within `summarize()`

# How many predictions were made in each year?
predictions |>
  group_by(year) |>
  summarize(n_predictions = n()) |>
  arrange(desc(year))

`summarize()` helper functions

Other powerful summary functions include:
- n_distinct(): counts the number of distinct values of a given column within a group
- max() and min(): finds the max and min value of a given column within a group
Exercises:
- How many different groundhogs made predictions each year?
- What is the first year each groundhog made a prediction?

`summarize()` helper functions

# How many different groundhogs made predictions each year?
predictions |>
  group_by(year) |>
  summarize(n_groundhogs = n_distinct(id)) |>
  arrange(desc(n_groundhogs))

`summarize()` helper functions

# What is the first year each groundhog made a prediction?
predictions |>
  group_by(id) |>
  summarize(first_prediction = min(year))

`sum()` within `summarize()`

sum(): finds the sum of a given column within a group. You can also specify conditions within sum() to calculate the number of records within a group that meet a certain condition.
Exercise: Let’s return to our dataframe with the number of predictions in each year. How would we add a column for the number of shadows seen in each year?

`sum()` within `summarize()`

# Let's return to our dataframe with the number of predictions in each year. 
# How would we add a column for the number of shadows seen in each year?
predictions |>
  group_by(year) |>
  summarize(n_predictions = n(),
            n_shadows = sum(shadow == TRUE))

🚩checkpoint: group verbs

Your turn! Create a dataframe with three variables:

groundhog id
the number of total predictions each groundhog has made
the number of times each groundhog has seen it’s shadow.

🚩checkpoint: group verbs

# Create a dataframe with 3 variables: 
# groundhog id
# the number of total predictions each groundhog has made
# the number of times each groundhog has seen its shadow
predictions |>
  group_by(id) |>
  summarize(n_predictions = n(),
            n_shadows = sum(shadow == TRUE))

column verbs

Now that we’ve calculated some summary variables within the groups that interest us (groundhog and year), we might want to use those summary variables to calculate more new variables.

mutate() adds new columns calculated from existing columns
select() keeps a subset of columns
rename() renames columns

column verbs: `mutate()`

mutate() adds new columns calculated from existing columns

By default, columns are added on the left side of the dataframe. You can use the .before or .after to specify where the new variable should fall

# calculate how many characters are in the details field and put the variable after id
predictions |>
  mutate(details_length = nchar(details), .after = id)

re-coding data with `mutate()`

if_else() tests for a condition and returns one value if true and another if false.

# create a column that indicates whether the prediction was made by Punxatawney Phil
predictions |> 
  mutate(phil = if_else(id == 1, 'TRUE', 'FALSE'))

re-coding data with `mutate()`

case_when() tests for multiple conditions and maps them to values accordingly.

# create a column that indicates the century of the predictions
predictions |> 
  mutate(century = case_when(year < 1900 ~ '19',
                             year < 2000 & year >= 1900 ~ '20',
                             year >= 2000 ~ '21',
                             TRUE ~ 'Year out of range'))

column verbs: `select()` and `rename()`

select() keeps a subset of columns
- You can select by name, series, test for data type (select(where(is.character()))) or use other helper functions such as starts_with(), ends_with(), or contains()
- You can rename variables as you select them with = , with the new name on the left and old on the right
rename() works the same way as renaming in select with =

🚩checkpoint: put it all together!

Let’s return to our original research question: Are certain groundhogs more likely to see their shadow than others?

Working off of our table with the number of predictions and number of shadows seen per groundhog, lets:

Add a column called shadow_percent that gives the percentage of time each groundhog sees its shadow
Filter for groundhogs with more than 5 predictions
Keep only the variables id and shadow_percent, and rename id to groundhog_id
Assign the result to a variable groundhog_predictions

🚩checkpoint: put it all together!

groundhog_predictions <- predictions |>
  group_by(id) |>
  summarize(n_predictions = n(),
            n_shadows = sum(shadow == TRUE)) |>
  mutate(shadow_percent = n_shadows/n_predictions) |>
  filter(n_predictions > 5) |>
  select(id, shadow_percent) |>
  rename(groundhog_id = id)

table verbs: joining data

We’ve done a lot with the mere 4 variables in our predictions table!

What if we wanted to enhance our data with more information about each groundhog from the groundhogs table?

head(groundhogs)

join terminology

There are two main types of join:

mutating joins add variables from one dataframe to another based on matching characteristics between the two
filtering joins subset one dataframe based on matching characteristics with another dataframe

join terminology 🗝️

Every join involves a primary key and a foreign key
- A primary key is a variable or set of variables that uniquely identifies an observation
- A foreign key is just another table’s primary key that matches your tables’ primary key. It might have a different name or be spread across more or less variables.
The first step when joining data is to identify the primary and foreign keys you’ll work with
Always check that your primary key is unique to each row! Foreign keys may or may not be unique.

joining predictions & groundhogs

head(groundhog_predictions, 3)

head(groundhogs, 3)

joining predictions & groundhogs

How would we determine if there is a difference between the average shadow prediction rate of different types of groundhogs?
primary key: groundhog_id in groundhog_predictions
foreign key: id in groundhogs

joining predictions & groundhogs

We want to add the variables from groundhogs to our groundhog_predictions table
We’ll need a mutating join, specifically a left join.
A left join retains all rows in the left dataframe, and adds additional data in from the right dataframe if the keys match.
left_join(x, y, join_by(x.key == y.key))

joining predictions & groundhogs

left_join(groundhog_predictions, groundhogs, join_by(groundhog_id == id))

more mutating joins

right_join() keeps everything in the right dataframe and adds in data from the left

more mutating joins

inner_join() keeps rows with keys that appear in both dataframes

full_join() keeps all rows from both dataframes

filtering joins

filtering joins subset one dataframe based on matching characteristics with another dataframe. In filtering

semi_join(x, y) keeps all rows in x with a match in y

anti_join(x, y) returns all rows in x without a match in y

join exercises

groundhog_predictions contains one row per 50 unique groundhogs
groundhogs contains one row per 65 unique groundhogs
Every groundhog in groundhog_predictions appears in groundhogs
- How many rows would each of the following joins have: right join with groundhogs on the right, inner join, full join, semi_join, anti_join?

more complex join conditions

Within join_by(), we can use more complex conditions than whether key == key
You can use other numeric operations like >, <, etc.
The closest() function matches the closest key to another key based on some criteria (closest value at all, closest value that is larger, etc.)
between() and within() can test whether a value falls between two other values. This is useful if you want to join events that happened within a given time span.

other table verbs

bind_rows() pastes rows onto the bottom of a dataframe
bind_cols() pastes columns onto the right of a dataframe.
There are no conditions in these functions, you can think of them as copy-and-paste.

🏁 finish line

Let’s put everything we’ve learned together!

Let’s create a summary table that gives the rate at which each type of groundhog sees its’ shadow

🏁 finish line

predictions |>
  left_join(groundhogs, join_by(id == id)) |>
  group_by(type) |>
  summarize(n_groundhogs = n_distinct(id),
            n_predictions = n(),
            n_shadows = sum(shadow == TRUE)) |>
  mutate(percent_shadow = n_shadows/n_predictions)

🏆 bonus exercises

Write code to calculate the column predictions_count in groundhogs
Write code to calculate the column is_groundhog in groundhogs
Calculate the proportion of groundhogs from each country that make predictions each year
Add a column to groundhogs indicating the first year each groundhog saw its shadow
Create a summary table showing the first year each type of groundhog made a prediction

summary: verbs & helper functions

Verbs:

filter() , arrange() , distinct()
group_by() , summarize()
mutate() , select(), rename()
left_, right_, inner_, full_, semi_, anti_ joins
bind _rows and _cols

Helper functions:

desc()
n(), n_distinct(), min(), max() , sum()
if_else() and case_when()
between() and within()

resources

R for Data Science 2e, Chapters 3 & 19
dplyr documentation
Software Carpentries: Data Frame Manipulation with dplyr
Website for the workshop: https://jhu-data-services.github.io/dplyr-quarto-site/

thank you! 🙏

Please take the post-workshop survey: https://bit.ly/dplyr-survey

Future trainings:

Designing Effective Data Visualizations: 3/14 12-2pm
Finding a Repository to Share Research Data: 3/26 12-1pm
Preparing to Share Human Participant Data in a Repository: 4/24 12-1pm

Reshaping and Joining Data in R with dplyr

Welcome!

JHU Data Services

We help faculty, researchers, and students find, use, manage, visualize, and share data.

What you will learn today

You should have:

Libraries

Review: reading and viewing data

Our task today

The dplyr package

dplyr grammar

anatomy of a dplyr verb

the pipe 💉

row verbs

row verbs: filter()

row verbs: filter()

filter(): your turn!

filter(): your turn!

filter(): useful helper functions

row verbs: arrange()

row verbs: distinct()

🚩checkpoint: row verbs

🚩checkpoint: row verbs

group verbs

group verbs: group_by()

group verbs: summarize()

n()within summarize()

summarize() helper functions

summarize() helper functions

summarize() helper functions

sum() within summarize()

sum() within summarize()

🚩checkpoint: group verbs

🚩checkpoint: group verbs

column verbs

column verbs: mutate()

re-coding data with mutate()

re-coding data with mutate()

column verbs: select() and rename()

🚩checkpoint: put it all together!

🚩checkpoint: put it all together!

table verbs: joining data

join terminology

join terminology 🗝️

joining predictions & groundhogs

joining predictions & groundhogs

joining predictions & groundhogs

joining predictions & groundhogs

more mutating joins

more mutating joins

filtering joins

join exercises

more complex join conditions

other table verbs

🏁 finish line

🏁 finish line

🏆 bonus exercises

summary: verbs & helper functions

resources

thank you! 🙏

row verbs: `filter()`

row verbs: `filter()`

`filter()`: your turn!

`filter()`: your turn!

`filter()`: useful helper functions

row verbs: `arrange()`

row verbs: `distinct()`

group verbs: `group_by()`

group verbs: `summarize()`

`n()`within `summarize()`

`summarize()` helper functions

`summarize()` helper functions

`summarize()` helper functions

`sum()` within `summarize()`

`sum()` within `summarize()`

column verbs: `mutate()`

re-coding data with `mutate()`

re-coding data with `mutate()`

column verbs: `select()` and `rename()`