Safely Selecting Data Frame Columns in Your Tidyverse Code
In my previous post “Use of the .data and .env Pronouns to Disambiguate Your Tidyverse Code”,
I discussed how using the .data
and .env
pronouns should be used to write
production-grade R code. The post was inspired by Lionel Henry’s talk titled
“Interactivity and Programming in the tidyverse”.
In that same talk, the all_of
function was introduced as a method to safely
select column names from data frames. In this post, I will discuss how one should
use this function and the .data
pronoun to safely select column names in
production-grade R code.
Setup
First, I will do some setting up of my environment for the rest of the post:
# Set mtcars to tibble to control the maximum number of printed rows. This is
# just to make the post easier to follow
library("magrittr")
library("tibble")
options(tibble.print_max = 6, tibble.print_min = 6)
mtcars <- as_tibble(mtcars)
Safely Selecting a Single Column
As I mentioned from my last post, “data masking” enables you to “blend data with the workspace”. This allows for you to easily select a single column from a data frame like this:
dplyr::select(mtcars, mpg)
## # A tibble: 32 x 1
## mpg
## <dbl>
## 1 21
## 2 21
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## # … with 26 more rows
Here the dplyr::select
function knows you are trying to select the “mpg”
column from the mtcars
data frame. If you use a variable to store the column
name you want to select:
col_to_select <- "mpg"
dplyr::select(mtcars, col_to_select)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(col_to_select)` instead of `col_to_select` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## # A tibble: 32 x 1
## mpg
## <dbl>
## 1 21
## 2 21
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## # … with 26 more rows
N.B: We’ll ignore the note regarding the usage of all_of
for now and come back
to this at the end of the post.
The dplyr::select
figures out you are selecting the column name that is stored
in the col_to_select
variable and not a phantom column called “col_to_select”
(since it doesn’t exist). Now what happens when you do something like this?
gear <- "mpg"
dplyr::select(mtcars, gear)
## # A tibble: 32 x 1
## gear
## <dbl>
## 1 4
## 2 4
## 3 4
## 4 3
## 5 3
## 6 3
## # … with 26 more rows
Due to data masking, the “gear” column name takes precedence over the value
stored in the gear
variable. There are two ways to deal with this. You can
explicitly refer to the column name by double quoting the column name.
gear <- "mpg"
# Ignores the `gear` variable.
dplyr::select(mtcars, "gear")
## # A tibble: 32 x 1
## gear
## <dbl>
## 1 4
## 2 4
## 3 4
## 4 3
## 5 3
## 6 3
## # … with 26 more rows
If you want to use a variable to store the column name, then the safest thing to
do is to combine it with the .data
pronoun:
gear <- "mpg"
dplyr::select(mtcars, .data[[gear]])
## # A tibble: 32 x 1
## mpg
## <dbl>
## 1 21
## 2 21
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## # … with 26 more rows
Safely Selecting Multiple Columns
If you want to safely select multiple columns, then you should start by double quoting the columns names to make it explicit:
dplyr::select(mtcars, "mpg", "carb")
## # A tibble: 32 x 2
## mpg carb
## <dbl> <dbl>
## 1 21 4
## 2 21 4
## 3 22.8 1
## 4 21.4 1
## 5 18.7 2
## 6 18.1 1
## # … with 26 more rows
If you want to use a variable to select multiple columns, you should create a character vector storing the different column names:
cols_to_select <- c("mpg", "carb")
dplyr::select(mtcars, cols_to_select)
## # A tibble: 32 x 2
## mpg carb
## <dbl> <dbl>
## 1 21 4
## 2 21 4
## 3 22.8 1
## 4 21.4 1
## 5 18.7 2
## 6 18.1 1
## # … with 26 more rows
Here the dplyr::select
function is smart enough to figure out that you are
asking for multiple columns stored in the cols_to_select
variable. However,
what happens if you have something like this?
gear <- c("mpg", "carb")
dplyr::select(mtcars, gear)
## # A tibble: 32 x 1
## gear
## <dbl>
## 1 4
## 2 4
## 3 4
## 4 3
## 5 3
## 6 3
## # … with 26 more rows
Similar to the scenario where we were selecting a single column, data masking
results in the “gear” column name taking precedence over the values in the
gear
variable. This ends up creating an unexpected result. This is where the
all_of
function comes in:
gear <- c("mpg", "carb")
dplyr::select(mtcars, all_of(gear))
## # A tibble: 32 x 2
## mpg carb
## <dbl> <dbl>
## 1 21 4
## 2 21 4
## 3 22.8 1
## 4 21.4 1
## 5 18.7 2
## 6 18.1 1
## # … with 26 more rows
The function means to literally select “all of” the columns that are stored in
the character vector variable. By using all_of
, you guard yourself against
any data masking.
.data
vs. all_of
Near the beginning of the post, we saw the following note when using a variable
with the dplyr::select
function:
col_to_select <- "mpg"
dplyr::select(mtcars, col_to_select)
## # A tibble: 32 x 1
## mpg
## <dbl>
## 1 21
## 2 21
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## # … with 26 more rows
Based on this, it seems that using the function all_of
is the recommended way
to select columns when using a variable even when it’s only a single column.
In other words,
col_to_select <- "mpg"
dplyr::select(mtcars, all_of(col_to_select))
## # A tibble: 32 x 1
## mpg
## <dbl>
## 1 21
## 2 21
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## # … with 26 more rows
This is as valid as using the .data
pronoun. My personal preference is to
stick to using the .data
pronoun when I only want to select a single column
and use all_of
when my character vector has multiple values in it. This simply
makes it more obvious to me what I should be expecting from dplyr::select
call.
Conclusions
In conclusion, my recommendations for safely selecting data frame columns in the tidyverse are as follows:
- Be explicit in the single and multiple column selections by double quoting the column names.
- If you want to use a variable to store a single column name, then combine it
with the
.data
pronoun. - If you want to use a variable to store a character vector of multiple column
names, then use the
all_of
function.
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.2 (2019-12-12)
## os macOS Sierra 10.12.6
## system x86_64, darwin16.7.0
## ui unknown
## language (EN)
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/London
## date 2020-05-08
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## argparse * 2.0.1 2019-03-08 [1] CRAN (R 3.6.2)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
## backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.2)
## callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.2)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.2)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
## devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.2)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
## dplyr 0.8.5 2020-03-07 [1] CRAN (R 3.6.2)
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
## findpython 1.0.5 2019-03-08 [1] CRAN (R 3.6.2)
## fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.2)
## glue * 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
## jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.2)
## knitr * 1.28 2020-02-06 [1] CRAN (R 3.6.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.2)
## magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.2)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2)
## pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
## pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
## prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
## processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
## ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.2)
## purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
## Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.2)
## remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.2)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
## testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.2)
## tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.2)
## tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.2)
## usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.2)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.2)
## vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.2)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
## xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
##
## [1] /usr/local/Cellar/r/3.6.2/lib/R/library