In that same talk, the all_of function was introduced as a method to safely
select column names from data frames. In this post, I will discuss how one should
use this function and the .data pronoun to safely select column names in
production-grade R code.
Setup
First, I will do some setting up of my environment for the rest of the post:
Safely Selecting a Single Column
As I mentioned from my last post, “data masking” enables you to “blend data with
the workspace”. This allows for you to easily select a single column from a
data frame like this:
Here the dplyr::select function knows you are trying to select the “mpg”
column from the mtcars data frame. If you use a variable to store the column
name you want to select:
N.B: We’ll ignore the note regarding the usage of all_of for now and come back
to this at the end of the post.
The dplyr::select figures out you are selecting the column name that is stored
in the col_to_select variable and not a phantom column called “col_to_select”
(since it doesn’t exist). Now what happens when you do something like this?
Due to data masking, the “gear” column name takes precedence over the value
stored in the gear variable. There are two ways to deal with this. You can
explicitly refer to the column name by double quoting the column name.
If you want to use a variable to store the column name, then the safest thing to
do is to combine it with the .data pronoun:
Safely Selecting Multiple Columns
If you want to safely select multiple columns, then you should start by double
quoting the columns names to make it explicit:
If you want to use a variable to select multiple columns, you should create a
character vector storing the different column names:
Here the dplyr::select function is smart enough to figure out that you are
asking for multiple columns stored in the cols_to_select variable. However,
what happens if you have something like this?
Similar to the scenario where we were selecting a single column, data masking
results in the “gear” column name taking precedence over the values in the
gear variable. This ends up creating an unexpected result. This is where the
all_of function comes in:
The function means to literally select “all of” the columns that are stored in
the character vector variable. By using all_of, you guard yourself against
any data masking.
.data vs. all_of
Near the beginning of the post, we saw the following note when using a variable
with the dplyr::select function:
Based on this, it seems that using the function all_of is the recommended way
to select columns when using a variable even when it’s only a single column.
In other words,
This is as valid as using the .data pronoun. My personal preference is to
stick to using the .data pronoun when I only want to select a single column
and use all_of when my character vector has multiple values in it. This simply
makes it more obvious to me what I should be expecting from dplyr::select
call.
Conclusions
In conclusion, my recommendations for safely selecting data frame columns in the
tidyverse are as follows:
Be explicit in the single and multiple column selections by double quoting the
column names.
If you want to use a variable to store a single column name, then combine it
with the .data pronoun.
If you want to use a variable to store a character vector of multiple column
names, then use the all_of function.