Advanced Data Operations

tidyverse

recoding

Author

Josef Fruehwald

Published

September 18, 2024

Some very useful, ocassionally used Tidyverse functions

Some data tasks only come up every so often, but when they do, having the just right function to use is so key.

Finding the right function

Every once in a while, I’ll just scroll through the reference pages of the core tidyverse packages. They have pretty good naming conventions, so you can start to get an idea of what they can do from the name of the function. My goal is just to give my future self a “I think I saw something that can do this once” feeling when I’m dealing with a specific data task.

readr - data input/output
stringr - strings
dplyr - Data summarizing etc
tidyr - Data reshaping
forcats - Categorical variable work
lubridate - Working with dates
purrr - vectors and lists

library(tidyverse)
library(gt)

Important to remember

The two functions I think it’s important to just remember are:

stringr::str_detect()
dplyr::case_when()

Pattern searching

str_detect() will search a string, or a vector of strings for a regular expression.

fruits <- c("raspberry", "cherry", "apple", "strawberry")

str_detect(
  fruits,
  "berry"
)

[1]  TRUE FALSE FALSE  TRUE

For any given thing you might want to do with a string, check out the stringr documentation.

“Vectorized Switch”

dplyr::case_when() is useful for creating new data columns, or recoding existing columns.

tibble(
  month=month.name
) |>
  mutate(
    oysters = case_when(
      str_ends(month, "r") ~ "yes",
      .default = "no"
    )
  ) |>
  gt() |> 
    tab_header(
      title = "Should you eat oysters?"
    )

1: If the month name ends in "r", return "yes". Otherwise, "no"

Should you eat oysters?
month	oysters
January	no
February	no
March	no
April	no
May	no
June	no
July	no
August	no
September	yes
October	yes
November	yes
December	yes

Examples

With just these two additional functions and what we’ve done in the tidyverse already, we can do some pretty powerful analyses.

Peterson & Barney

library(babynames)
library(phonTools)
data("pb52")

I noticed, when making a plot of the Peterson & Barney data, that there was some stripeyness to it.

pb52 |> 
  ggplot(
    aes(
      x = f0,
      y = f1,
      color = f1/f0
    )
  )+
    geom_point()

I suspect this because they were visually measuring the formants as printed out from a physical Spectrograph Machine. As such, it would be easier, and more principled, to record a formant value directly on one of the harmonics of F0.

pb52 |>
  mutate(
    ratio = f1/f0,
    rounded = round(ratio),
    harmonic = case_when(
      between(
        ratio,
        rounded - 0.05,
        rounded + 0.05
      ) ~ rounded,
      .default = NA
    )
  ) ->
  harmonic_coding

1: Get the f1:f0 ratio, and the closest harmonic of the fundamental frequency.
2: case_when() will evaluate a sequence of logical statements. When true, it returns the value to the right of ~. After all logical statements are evaluated, any .default value will be returned.
3: between() will return true of ratio is greater than rounded-0.05 and less than rounded-0.05.
4: The value to return for all points that don’t satisfy the between() condition.

harmonic_coding |> 
  ggplot(
    aes(
      f0,
      f1, 
      color = factor(harmonic-1)
    )
  )+
    geom_point()+
    labs(
      color = "harmonic"
    )

The rise of -aden

If you look at trends in babynames, one feature really pops out around the turn of the 21st century.

library(babynames)
library(geomtextpath)

babynames |>
  mutate(
    last_letter = str_extract(
      name,
      "\\w$"
    ),
    last_letter = fct_lump(
      last_letter,
      n = 9,
      w = n
    )
  ) |>
  summarise(
    .by = c(year, sex, last_letter),
    total = sum(n)
  ) |>
  mutate(
    .by = c(year, sex),
    prop = total/sum(total)
  ) ->
  last_letter_df

last_letter_df |>
  ggplot(
    aes(year, prop)
  )+
   geom_textline(
     aes(
       label = last_letter,
       color = last_letter
     )
   )+
    facet_wrap(~sex)+
    guides(
      color = "none"
    )

1: This will pull out just the last letter from each name.
2: This will clump together every letter that isn’t in the top 9 most frequent.
3: This will get the total number of babies with the last letter by year by sex.
4: This will calculate the proportion of babies who have this last letter in their name, by year and sex.

Boy names ending in <n> have shot up dramatically in popularity. But is it just any name, or have the final syllables /ej.dɨn/ disproportionately contributed?

babynames |>
  mutate(
    aden = case_when(
      str_detect(
        name,
        "[aAe]i?y?[dt][aeiouy]n$",
      ) ~ "aden",
      str_detect(
        name,
        "[dt][aeiouy]n$",
      ) ~ "den",
      str_detect(
        name,
        "[aeiouy]n$",
      ) ~ "en",
      str_detect(
        name,
        "n$"
      ) ~ "n",
      .default = "other"
    )
  ) |>
  summarise(
    .by = c(year, sex, aden),
    total = sum(n)
  ) |>
  mutate(
    .by = c(year, sex),
    prop = total/sum(total)
  ) |>
  filter(
    aden != "other"
  ) |>
  ggplot(
    aes(
      year,
      prop,
      fill = aden
    )
  )+
    geom_area(
      position = "stack"
    )+
    scale_y_continuous(
      expand = expansion(mult = c(0, 0.05))
    )+
    scale_x_continuous(
      expand = expansion(0)
    )+  
    facet_wrap(~sex)+
    labs(
      title = "Baby names ending in _",
      fill = NULL
    )

1: case_when() executes each logical statement in order. If one returns TRUE, none of the rest are evaluated.
2: Trying to capture Payton and Aiden and Braden etc.
3: If not the /ey/ vowel quality, then names like Landon, Easton, Eden.
4: Any vowel+N ending.
5: Any remaining names ending in n: John Quinn.
6: Get the total number of babies in each year, in each sex, in each name ending.
7: Get the proportion of babies in each year, in each sex, that have each name ending.
8: For the purpose of plotting, I don’t care about “other”

Reuse

CC-BY 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Advanced {Data} {Operations}},
  date = {2024-09-18},
  url = {https://lin611-2024.github.io/notes/meetings/2024-09-18_advanced-data.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Advanced Data Operations.” September 18, 2024. https://lin611-2024.github.io/notes/meetings/2024-09-18_advanced-data.html.