As someone who has been writing tidyverse code for a few years, I’ve always found it difficult to bring the concept of tidyeval into a production-level environment. The recent rstudio::conf 2020 talk by Lionel Henry titled “Interactivity and Programming in the tidyverse” shed some much needed light on how to write safe production-level tidyverse code.
I wanted to use this blog post to highlight Lionel’s tip on the usage of
.env pronouns in your tidyverse code to disambiguate your code. I’ve
already been using the
.data pronoun for quite some time, but didn’t realize that it wasn’t enough by itself. However,
if you use it in combination with the
.env pronoun, this will ensure that you
don’t produce unanticipated results in your R code!
Lionel Henry starts the talk by discussing the idea of “data masking” which allows you to “blend data with the workspace”. In essence, it enables you to do something like:
dplyr::filter function knows that you are filtering for rows where
the “cyl” column matches the “carb” column. Data masking allows us to refer to
the columns “cyl” and “carb” in the dataframe
mtcars without having to
explicitly list the column names like this:
Similarly, if you do this:
dplyr::filter function knows that you are filtering for rows where the
“cyl” column matches the value stored in the
num_cyl variable because there is
no column name called “num_cyl”. This is great because it simplifies your code
when you are exploring the data.
Unexpected Data Masking
Now what do you expect to happen with this piece of code?
You might have expected it to return all rows where the
cyl column had a value
of 6 as this was the number the
carb variable was set to. However in this
mtcars dataframe had a column name called “carb” and this actually
took precedence over the value in the variable. To me, this produced
unanticipated results due to unexpected data masking.
This is generally not a big issue when you are using R in interactive mode. You would know which variables are in your workspace and what the column names are in your dataframe. Additionally, if you ran into this issue where you had a column name that matched a variable name, you could just rename the column name or variable and just move on.
However when writing production level R code, you might not have this luxury. You really want to be disambiguous in what values the R code should be using. So what’s the solution?
The .data and .env Pronouns to the Rescue
This is where the
.env pronouns come into play. The pronouns refer
to data in your dataframe and workspace respectively. In this case:
would produce the intended results. The
to filter on the “cyl” column in the
mtcars dataframe. The
indicates that we should be filtering on the value stored in the workspace
carb and NOT the “carb” column in
If we wanted to filter for rows where the “cyl” and “carb” column values matched , then we would do this:
This principle should also be applied to your functions. The
also takes into account lexical scoping:
Here it filters by the value 8 and not 6 as it determines the value from the
cyl_val variable in the function environment and not the global environment.
In summary, make sure to use the
env pronouns in your R code!
I have gotten into the habit of doing this regardless of whether I am writing
production-level code or not. As far as I can tell, there is no harm in being
explicit in my R code aside from the extra few characters you have to type. If
you don’t do it, you might get unexpected data masking that could produce
results you might not know are wrong. In a production-level environment, you
can’t afford this type of mistake.
I would also highly recommend watching the full presentation by Lionel Henry to get more tips on safely using tidyverse code in a production-level environment.