As someone who has been writing tidyverse code for a few years, I’ve always found it difficult to bring the concept of tidyeval into a production-level environment. The recent rstudio::conf 2020 talk by Lionel Henry titled “Interactivity and Programming in the tidyverse” shed some much needed light on how to write safe production-level tidyverse code.

I wanted to use this blog post to highlight Lionel’s tip on the usage of .data and .env pronouns in your tidyverse code to disambiguate your code. I’ve already been using the .data pronoun for quite some time, but didn’t realize that it wasn’t enough by itself. However, if you use it in combination with the .env pronoun, this will ensure that you don’t produce unanticipated results in your R code!

Lionel Henry starts the talk by discussing the idea of “data masking” which allows you to “blend data with the workspace”. In essence, it enables you to do something like:

Here the dplyr::filter function knows that you are filtering for rows where the “cyl” column matches the “carb” column. Data masking allows us to refer to the columns “cyl” and “carb” in the dataframe mtcars without having to explicitly list the column names like this:

Similarly, if you do this:

The dplyr::filter function knows that you are filtering for rows where the “cyl” column matches the value stored in the num_cyl variable because there is no column name called “num_cyl”. This is great because it simplifies your code when you are exploring the data.

Now what do you expect to happen with this piece of code?

You might have expected it to return all rows where the cyl column had a value of 6 as this was the number the carb variable was set to. However in this case, the mtcars dataframe had a column name called “carb” and this actually took precedence over the value in the variable. To me, this produced unanticipated results due to unexpected data masking.

This is generally not a big issue when you are using R in interactive mode. You would know which variables are in your workspace and what the column names are in your dataframe. Additionally, if you ran into this issue where you had a column name that matched a variable name, you could just rename the column name or variable and just move on.

However when writing production level R code, you might not have this luxury. You really want to be disambiguous in what values the R code should be using. So what’s the solution?

# The .data and .env Pronouns to the Rescue

This is where the .data and .env pronouns come into play. The pronouns refer to data in your dataframe and workspace respectively. In this case:

would produce the intended results. The .data[["cyl"]] tells dplyr::filter to filter on the “cyl” column in the mtcars dataframe. The .env[["carb"]] indicates that we should be filtering on the value stored in the workspace variable carb and NOT the “carb” column in mtcars.

If we wanted to filter for rows where the “cyl” and “carb” column values matched , then we would do this:

This principle should also be applied to your functions. The .env pronoun also takes into account lexical scoping:

Here it filters by the value 8 and not 6 as it determines the value from the cyl_val variable in the function environment and not the global environment.

# Conclusions

In summary, make sure to use the .data and env pronouns in your R code! I have gotten into the habit of doing this regardless of whether I am writing production-level code or not. As far as I can tell, there is no harm in being explicit in my R code aside from the extra few characters you have to type. If you don’t do it, you might get unexpected data masking that could produce results you might not know are wrong. In a production-level environment, you can’t afford this type of mistake.

I would also highly recommend watching the full presentation by Lionel Henry to get more tips on safely using tidyverse code in a production-level environment.