Creating Local Variables Within Pipe Chains
A few months ago, I discovered that it is possible to create a local variable within a tidyverse pipe chain. For example: data %>% {new_var <- mean(.,x)}
To me, this was an irrationally exciting discovery! Why? If you’re like me and you much prefer having a single pipe chain versus several separated functions and stored objects in your environment, the ability to create local variables on the fly within a pipe chain results in some incredibly efficient (and at least to me, pleasing) code.
To provide an example, let’s say that we want to create a table of means for some variables, and within the column headers, we want to add another statistic about the columns within the label.
Let’s use the trusty iris data set.
Loading data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data <- iris %>%
janitor::clean_names()
head(data)
## sepal_length sepal_width petal_length petal_width species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If you’re unfamiliar with the iris data, there are four species of iris included, with four attribute variables describing the plants. Which attribute contains the most variation across the different species? One reasonable way to answer this question would be to 1) find the means of each attribute across the species, and then 2) calculate the standard deviation of the means across each attribute. Let’s make a table containing all of this information.
Creating table
First I need to calculate the means for each variable.
data %>%
group_by(species) %>%
summarize(
across(everything(), ~ mean(.x))
)
## # A tibble: 3 × 5
## species sepal_length sepal_width petal_length petal_width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
Easy enough. Now, the tricky part: How can we include information about the standard deviations of these columns? The first thought would be to calculate these values and somehow append them as a separate row. This would work fine, but it would be a bit difficult to format successfully. An alternative approach is to include the standard deviations within the column labels themselves. Normally this would involve a few clunky steps, but with the ability to create local variables within a pipe, this turns out to be a pretty easy and concise task!
data %>%
group_by(species) %>%
summarize(
across(everything(), ~ mean(.x))
) %>%
{
sds <- summarize(.,across(-1, ~ sd(.x))) %>% # calculating SDs of each column which are stored in a local variable, "sds".
mutate(across(everything(), ~ round(.x,2))) # rounding SDs
rename_with(., .cols = -1, ~ paste0(.x," (SD = ",sds[1,],")")) #renaming columns to include SD values
} %>%
gt::gt()
species | sepal_length (SD = 0.8) | sepal_width (SD = 0.34) | petal_length (SD = 2.09) | petal_width (SD = 0.9) |
---|---|---|---|---|
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
Pretty cool, huh? But what if you wanted to store the sds
object for something later on? Easy: assign it to a global variable!
data %>%
group_by(species) %>%
summarize(
across(everything(), ~ mean(.x))
) %>%
{
sds <- summarize(., across(-1, ~ sd(.x))) %>% # calculating SDs of each column which are stored in a local variable, "sds".
mutate(across(everything(), ~ round(.x,2))) # rounding SDs
SDs <<- sds # Creating global variable which will be saved in environment
rename_with(., .cols = -1, ~ paste0(.x," (SD = ",sds[1,],")")) #renaming columns to include SD values
} %>%
gt::gt()
species | sepal_length (SD = 0.8) | sepal_width (SD = 0.34) | petal_length (SD = 2.09) | petal_width (SD = 0.9) |
---|---|---|---|---|
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
And here is the stored SDs
object:
SDs
## # A tibble: 1 × 4
## sepal_length sepal_width petal_length petal_width
## <dbl> <dbl> <dbl> <dbl>
## 1 0.8 0.34 2.09 0.9
There are plenty more useful applications beyond what I’ve shown here, but hopefully this example provides a glimpse into the potential power of using local variables within pipes.