5 Data Manipulation with dplyr

https://learn.datacamp.com/courses/data-manipulation-with-dplyr

Main functions and concepts covered in this BP chapter:

  1. Understanding your data
  2. Selecting columns
  3. Arranging observations
  4. The filter() and arrange() verbs
    • Filtering for conditions
    • Filtering and arranging
  5. The mutate() verb
    • Calculating the number of government employees
    • Calculating the percentage of women in a county
  6. Mutate(), filter(), and arrange()
  7. The count() verb
    • Counting by region
    • Counting citizens by state
    • Mutating and counting
  8. The group_by(), summarize(), and ungroup() verbs
    • Summarizing
    • Summarizing by state
    • Summarizing by state and region
  9. The slice_min and slice_max verbs
    • Selecting a county from each region
    • Finding the lowest-income state in each region
  10. Using summarize, slice_max, and count together
  11. Selecting
    • Selecting columns
    • Select helpers
  12. The rename verb
    • Renaming a column after count
    • Renaming a column as part of a select
  13. The relocate verb
    • Using relocate
  14. Choosing among the four verbs
  15. Filtering and arranging for one year
  16. Finding the most popular names each year
  17. Visualizing names with ggplot2
  18. Grouped mutates
    • Finding the year each name is most common
    • Adding the total and maximum for each name
    • Visualizing the normalized change in popularity
  19. Window functions
    • Using ratios to describe the frequency of a name
    • Biggest jumps in a name

Packages used in this chapter:

## Load all packages used in this chapter
library(tidyverse) #includes dplyr, ggplot2, and other common packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Datasets used in this chapter:

## Load datasets used in this chapter
counties <- read_rds("data/counties.rds")
babynames <- read_rds("data/babynames.rds")

Tip: don’t forget the pipe (%>%) between each step

5.1 Transforming Data with dplyr

5.1.1 Understanding your data

Take a look at the counties dataset using the glimpse() function.

Q: What is the first value in the income variable?

A: 51281

glimpse(counties)
## Rows: 3,138
## Columns: 40
## $ census_id          <chr> "1001", "1003", "1005", "1007", "1009", "1011", "10…
## $ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
## $ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", …
## $ region             <chr> "South", "South", "South", "South", "South", "South…
## $ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", "No…
## $ population         <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 1…
## $ men                <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5627…
## $ women              <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 603…
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7…
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.…
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3,…
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0…
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0…
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
## $ citizens           <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88…
## $ income             <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41…
## $ income_err         <dbl> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2949…
## $ income_per_cap     <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21…
## $ income_per_cap_err <dbl> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366, 15…
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.…
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.…
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.…
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.…
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.…
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5…
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.…
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.…
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, …
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0…
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1…
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1…
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1…
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.…
## $ employed           <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401,…
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.…
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.…
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4…
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0…
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.…
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 77…

5.1.2 Selecting columns

Select the following four columns from the counties variable:

state county population poverty

counties %>%
  # Select the columns
  select(state, county, population, poverty)
## # A tibble: 3,138 × 4
##    state   county   population poverty
##    <chr>   <chr>         <dbl>   <dbl>
##  1 Alabama Autauga       55221    12.9
##  2 Alabama Baldwin      195121    13.4
##  3 Alabama Barbour       26932    26.7
##  4 Alabama Bibb          22604    16.8
##  5 Alabama Blount        57710    16.7
##  6 Alabama Bullock       10678    24.6
##  7 Alabama Butler        20354    25.4
##  8 Alabama Calhoun      116648    20.5
##  9 Alabama Chambers      34079    21.6
## 10 Alabama Cherokee      26008    19.2
## # ℹ 3,128 more rows

5.1.3 Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

Add a verb to sort the observations of the public_work variable in descending order.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

counties_selected %>%
  # Add a verb to sort in descending order of public_work
 arrange(desc(public_work))
## # A tibble: 3,138 × 6
##    state        county         population private_work public_work self_employed
##    <chr>        <chr>               <dbl>        <dbl>       <dbl>         <dbl>
##  1 Hawaii       Kalawao                85         25          64.1          10.9
##  2 Alaska       Yukon-Koyukuk…       5644         33.3        61.7           5.1
##  3 Wisconsin    Menominee            4451         36.8        59.1           3.7
##  4 North Dakota Sioux                4380         32.9        56.8          10.2
##  5 South Dakota Todd                 9942         34.4        55             9.8
##  6 Alaska       Lake and Peni…       1474         42.2        51.6           6.1
##  7 California   Lassen              32645         42.6        50.5           6.8
##  8 South Dakota Buffalo              2038         48.4        49.5           1.8
##  9 South Dakota Dewey                5579         34.9        49.2          14.7
## 10 Texas        Kenedy                565         51.9        48.1           0  
## # ℹ 3,128 more rows

5.1.4 The filter() and arrange() verbs

5.1.4.1 Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

Find only the counties that have a population above one million (1000000).

counties_selected <- counties %>%
  select(state, county, population)

counties_selected %>%
  # Filter for counties with a population above 1000000
  filter(population > 1000000)
## # A tibble: 41 × 3
##    state      county         population
##    <chr>      <chr>               <dbl>
##  1 Arizona    Maricopa          4018143
##  2 California Alameda           1584983
##  3 California Contra Costa      1096068
##  4 California Los Angeles      10038388
##  5 California Orange            3116069
##  6 California Riverside         2298032
##  7 California Sacramento        1465832
##  8 California San Bernardino    2094769
##  9 California San Diego         3223096
## 10 California Santa Clara       1868149
## # ℹ 31 more rows

We can filter 2 things at once. Find only the counties in the state of California that also have a population above one million (1000000).

counties_selected <- counties %>%
  select(state, county, population)

counties_selected %>%
  # Filter for counties with a population above 1000000
  filter(state == "California", population > 1000000)
## # A tibble: 9 × 3
##   state      county         population
##   <chr>      <chr>               <dbl>
## 1 California Alameda           1584983
## 2 California Contra Costa      1096068
## 3 California Los Angeles      10038388
## 4 California Orange            3116069
## 5 California Riverside         2298032
## 6 California Sacramento        1465832
## 7 California San Bernardino    2094769
## 8 California San Diego         3223096
## 9 California Santa Clara       1868149

5.1.4.2 Filtering and arranging

We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector. Filter for counties in the state of Texas that have more than ten thousand people (10000), and sort them in descending order of the percentage of people employed in private work.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
  # Filter for Texas and more than 10000 people
  filter(state == "Texas", population > 10000) %>%
  # Sort in descending order of private_work
  arrange(desc(private_work))
## # A tibble: 169 × 6
##    state county  population private_work public_work self_employed
##    <chr> <chr>        <dbl>        <dbl>       <dbl>         <dbl>
##  1 Texas Gregg       123178         84.7         9.8           5.4
##  2 Texas Collin      862215         84.1        10             5.8
##  3 Texas Dallas     2485003         83.9         9.5           6.4
##  4 Texas Harris     4356362         83.4        10.1           6.3
##  5 Texas Andrews      16775         83.1         9.6           6.8
##  6 Texas Tarrant    1914526         83.1        11.4           5.4
##  7 Texas Titus        32553         82.5        10             7.4
##  8 Texas Denton      731851         82.2        11.9           5.7
##  9 Texas Ector       149557         82          11.2           6.7
## 10 Texas Moore        22281         82          11.7           5.9
## # ℹ 159 more rows

5.1.5 The mutate() verb

Datasets don’t always have the variables we need. We can use the mutate() verb to add new variables or change existing variables.

5.1.5.1 Calculating the number of government employees

In the datacamp video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work. Use mutate() to add a column called public_workers to the dataset, with the number of people employed in public (government) work. Then sort the new column in descending order.

counties_selected <- counties %>%
  select(state, county, population, public_work)

counties_selected %>%
  # Add a new column public_workers with the number of people employed in public work
  mutate(public_workers = public_work * population / 100) %>%
  # Sort in descending order of the public_workers column
  arrange(desc(public_workers))
## # A tibble: 3,138 × 5
##    state      county         population public_work public_workers
##    <chr>      <chr>               <dbl>       <dbl>          <dbl>
##  1 California Los Angeles      10038388        11.5       1154415.
##  2 Illinois   Cook              5236393        11.5        602185.
##  3 California San Diego         3223096        14.8        477018.
##  4 Arizona    Maricopa          4018143        11.7        470123.
##  5 Texas      Harris            4356362        10.1        439993.
##  6 New York   Kings             2595259        14.4        373717.
##  7 California San Bernardino    2094769        16.7        349826.
##  8 California Riverside         2298032        14.9        342407.
##  9 California Sacramento        1465832        21.8        319551.
## 10 California Orange            3116069        10.2        317839.
## # ℹ 3,128 more rows

5.1.5.2 Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county. Select the columns state, county, population, men, and women. Add a new variable called proportion_women with the fraction of the county’s population made up of women.

counties_selected <- counties %>%
  # Select the columns state, county, population, men, and women
  select(state, county, population, men, women)

counties_selected %>%
  # Calculate proportion_women as the fraction of the population made up of women
  mutate(proportion_women = women / (men + women))
## # A tibble: 3,138 × 6
##    state   county   population   men women proportion_women
##    <chr>   <chr>         <dbl> <dbl> <dbl>            <dbl>
##  1 Alabama Autauga       55221 26745 28476            0.516
##  2 Alabama Baldwin      195121 95314 99807            0.512
##  3 Alabama Barbour       26932 14497 12435            0.462
##  4 Alabama Bibb          22604 12073 10531            0.466
##  5 Alabama Blount        57710 28512 29198            0.506
##  6 Alabama Bullock       10678  5660  5018            0.470
##  7 Alabama Butler        20354  9502 10852            0.533
##  8 Alabama Calhoun      116648 56274 60374            0.518
##  9 Alabama Chambers      34079 16258 17821            0.523
## 10 Alabama Cherokee      26008 12975 13033            0.501
## # ℹ 3,128 more rows

5.1.6 Mutate(), filter(), and arrange()

In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men. Select the state, county, and population columns, and add a proportion_men column with the fractional male population using a single verb. Filter for counties with a population of at least ten thousand (10000). Arrange counties in descending order of their proportion of men.

Hint: use .keep = "none" to get rid of all extra columns.

counties %>%
  # Keep state, county, and population, and add proportion_men
  mutate(state, county, population, proportion_men = men / population, .keep = "none") %>%
  # Filter for population of at least 10,000
  filter(population >= 10000) %>% 
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))
## # A tibble: 2,437 × 4
##    state      county         population proportion_men
##    <chr>      <chr>               <dbl>          <dbl>
##  1 Virginia   Sussex              11864          0.685
##  2 California Lassen              32645          0.668
##  3 Georgia    Chattahoochee       11914          0.666
##  4 Louisiana  West Feliciana      15415          0.664
##  5 Florida    Union               15191          0.647
##  6 Texas      Jones               19978          0.633
##  7 Missouri   DeKalb              12782          0.632
##  8 Texas      Madison             13838          0.625
##  9 Virginia   Greensville         11760          0.621
## 10 Texas      Anderson            57915          0.612
## # ℹ 2,427 more rows

5.2 Aggregating Data

So far in this course, we’ve learned to select variables from a dataset, to filter and sort observations, and to create new variables with the mutate verb. But, so far, we’ve been working at the same level as the initial data, where every observation corresponds to one US county. In this chapter, you’ll learn to aggregate data, that is, to take many observations and summarize them into one. This is a common strategy for making datasets manageable and interpretable.

5.2.1 The count() verb

One way we can aggregate data is to count it: to find out the number of observations. The dplyr verb for this is count().

5.2.1.1 Counting by region

The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.

Use count() to find the number of counties in each region, using a second argument to sort in descending order.

#Create counties_selected2
counties_selected2 <- counties %>%
  select(county, region, state, population, citizens)
# Use count to find the number of counties in each region

counties_selected2 %>%
  count(region, sort=TRUE)
## # A tibble: 4 × 2
##   region            n
##   <chr>         <int>
## 1 South          1420
## 2 North Central  1054
## 3 West            447
## 4 Northeast       217

5.2.1.2 Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state. Count the number of counties in each state, weighted based on the citizens column, and sorted in descending order.

# Find number of counties per state, weighted by citizens, sorted in descending order
counties_selected2 %>%
  count(state, wt=citizens, sort=TRUE)
## # A tibble: 50 × 2
##    state                 n
##    <chr>             <dbl>
##  1 California     24280349
##  2 Texas          16864864
##  3 Florida        13933052
##  4 New York       13531404
##  5 Pennsylvania    9710416
##  6 Illinois        8979999
##  7 Ohio            8709050
##  8 Michigan        7380136
##  9 North Carolina  7107998
## 10 Georgia         6978660
## # ℹ 40 more rows

5.2.1.3 Mutating and counting

You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?” You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

Use mutate() to calculate and add a column called population_walk, containing the total number of people who walk to work in a county. Use a (weighted and sorted) count() to find the total number of people who walk to work in each state.

# Save new data set
counties_selected3 <- counties %>%
  select(county, region, state, population, walk)

counties_selected3 %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column, sort in descending order
  count(state, wt = population_walk, sort = TRUE)
## # A tibble: 50 × 2
##    state                n
##    <chr>            <dbl>
##  1 New York      1237938.
##  2 California    1017964.
##  3 Pennsylvania   505397.
##  4 Texas          430783.
##  5 Illinois       400346.
##  6 Massachusetts  316765.
##  7 Florida        284723.
##  8 New Jersey     273047.
##  9 Ohio           266911.
## 10 Washington     239764.
## # ℹ 40 more rows

5.2.2 The group_by(), summarize(), and ungroup() verbs

We’ve learned about to aggregate data using the count() verb, but count() is a special case of a more general set of verbs: group_by and summarize.

5.2.2.1 Summarizing

The summarize() verb is very useful for collapsing a large dataset into a single observation. Summarize the counties data set to find the following columns: min_population (with the smallest population), max_unemployment (with the maximum unemployment), and average_income (with the mean of the income variable).

# Save new dataset
counties_selected4 <- counties %>%
  select(county, population, income, unemployment)

counties_selected4 %>%
  # Summarize to find minimum population, maximum unemployment, and average income
  summarize(min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))
## # A tibble: 1 × 3
##   min_population max_unemployment average_income
##            <dbl>            <dbl>          <dbl>
## 1             85             29.4         46832.

5.2.2.2 Summarizing by state

Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles). Group the data by state, and summarize to create the columns total_area (the total land_area in square miles) and total_population (with total population).

# Save new data set
counties_selected5 <- counties %>%
  select(state, county, population, land_area)


counties_selected5 %>%
  # Group by state 
  group_by(state) %>%
  # Find the total area and population
  summarize(total_area = sum(land_area), total_population = sum(population) )
## # A tibble: 50 × 3
##    state       total_area total_population
##    <chr>            <dbl>            <dbl>
##  1 Alabama         50645.          4830620
##  2 Alaska         553560.           725461
##  3 Arizona        113594.          6641928
##  4 Arkansas        52035.          2958208
##  5 California     155779.         38421464
##  6 Colorado       103642.          5278906
##  7 Connecticut      4842.          3593222
##  8 Delaware         1949.           926454
##  9 Florida         53625.         19645772
## 10 Georgia         57514.         10006693
## # ℹ 40 more rows

Add a density column containing the total population per square mile, using the columns you already created, then arrange in descending order.

counties_selected5 %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  # Add a density column
  mutate(density = total_population / total_area) %>%
  # Sort by density in descending order
  arrange(desc(density))
## # A tibble: 50 × 4
##    state         total_area total_population density
##    <chr>              <dbl>            <dbl>   <dbl>
##  1 New Jersey         7354.          8904413   1211.
##  2 Rhode Island       1034.          1053661   1019.
##  3 Massachusetts      7800.          6705586    860.
##  4 Connecticut        4842.          3593222    742.
##  5 Maryland           9707.          5930538    611.
##  6 Delaware           1949.           926454    475.
##  7 New York          47126.         19673174    417.
##  8 Florida           53625.         19645772    366.
##  9 Pennsylvania      44743.         12779559    286.
## 10 Ohio              40861.         11575977    283.
## # ℹ 40 more rows

5.2.2.3 Summarizing by state and region

You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

Summarize to find the total population, as a column called total_pop, in each combination of region and state, grouped in that order.

#Save new data set
counties_selected6 <- counties %>%
  select(region, state, county, population)

counties_selected6 %>%
  # Group and summarize to find the total population
  group_by(region, state) %>%
  summarize(total_pop = sum(population))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 50 × 3
## # Groups:   region [4]
##    region        state        total_pop
##    <chr>         <chr>            <dbl>
##  1 North Central Illinois      12873761
##  2 North Central Indiana        6568645
##  3 North Central Iowa           3093526
##  4 North Central Kansas         2892987
##  5 North Central Michigan       9900571
##  6 North Central Minnesota      5419171
##  7 North Central Missouri       6045448
##  8 North Central Nebraska       1869365
##  9 North Central North Dakota    721640
## 10 North Central Ohio          11575977
## # ℹ 40 more rows

Notice the tibble is still grouped by region; use another summarize() step to calculate two new columns: the average state population in each region (average_pop) and the median state population in each region (median_pop).

counties_selected6 %>%
  # Group and summarize to find the total population
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  # Calculate the average_pop and median_pop columns 
  summarize(average_pop = mean(total_pop), median_pop = median(total_pop))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
##   region        average_pop median_pop
##   <chr>               <dbl>      <dbl>
## 1 North Central    5627687.    5580644
## 2 Northeast        6221058.    3593222
## 3 South            7370486     4804098
## 4 West             5722755.    2798636

5.2.3 The slice_min and slice_max verbs

What if instead of aggregating each state, we wanted to find only the largest or smallest counties in each state? dplyr’s slice_min and slice_max verbs allow us to extract the most extreme observations from each group.

5.2.3.1 Selecting a county from each region

Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county. Find the county in each region with the highest percentage of citizens who walk to work.

# Save new data set
counties_selected7 <- counties %>%
  select(region, state, county, metro, population, walk)

counties_selected7 %>%
  # Group by region
  group_by(region) %>%
  # Find the county with the highest percentage of people who walk to work
  slice_max(walk, n = 1)
## # A tibble: 4 × 6
## # Groups:   region [4]
##   region        state        county                 metro    population  walk
##   <chr>         <chr>        <chr>                  <chr>         <dbl> <dbl>
## 1 North Central North Dakota McIntosh               Nonmetro       2759  17.5
## 2 Northeast     New York     New York               Metro       1629507  20.7
## 3 South         Virginia     Lexington city         Nonmetro       7071  31.7
## 4 West          Alaska       Aleutians East Borough Nonmetro       3304  71.2

5.2.3.2 Finding the lowest-income state in each region

You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and slice_min() to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X. Calculate the average income (as average_income) of counties within each region and state (notice the group_by() has already been done for you). Find the state with the lowest average_income in each region.

# Save new data set
counties_selected8 <- counties %>%
  select(region, state, county, population, income)

counties_selected8 %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the lowest income state in each region
  slice_min(average_income, n=1)
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   region [4]
##   region        state       average_income
##   <chr>         <chr>                <dbl>
## 1 North Central Missouri            41755.
## 2 Northeast     Maine               46142.
## 3 South         Mississippi         34939.
## 4 West          New Mexico          40184.

5.2.4 Using summarize, slice_max, and count together

In this chapter, you’ve learned to use six dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), slice_max(), and slice_min(). In this exercise, you’ll combine them to answer a question:

In how many states do more people live in metro areas than non-metro areas?

Recall that the metro column has one of the two values "Metro" (for high-density city areas) or "Nonmetro" (for suburban and country areas).

For each combination of state and metro, find the total population as total_pop.

#Save new data set
counties_selected9 <- counties %>%
  select(state, metro, population)

counties_selected9 %>%
  # Find the total population for each combination of state and metro
  group_by(state, metro) %>%
  summarize(total_pop = sum(population))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 97 × 3
## # Groups:   state [50]
##    state      metro    total_pop
##    <chr>      <chr>        <dbl>
##  1 Alabama    Metro      3671377
##  2 Alabama    Nonmetro   1159243
##  3 Alaska     Metro       494990
##  4 Alaska     Nonmetro    230471
##  5 Arizona    Metro      6295145
##  6 Arizona    Nonmetro    346783
##  7 Arkansas   Metro      1806867
##  8 Arkansas   Nonmetro   1151341
##  9 California Metro     37587429
## 10 California Nonmetro    834035
## # ℹ 87 more rows

Extract the most populated row from each state, which will be either Metro or Nonmetro.

counties_selected9 %>%
  # Find the total population for each combination of state and metro
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  # Extract the most populated row for each state
  slice_max(total_pop, n=1)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 50 × 3
## # Groups:   state [50]
##    state       metro total_pop
##    <chr>       <chr>     <dbl>
##  1 Alabama     Metro   3671377
##  2 Alaska      Metro    494990
##  3 Arizona     Metro   6295145
##  4 Arkansas    Metro   1806867
##  5 California  Metro  37587429
##  6 Colorado    Metro   4590896
##  7 Connecticut Metro   3406918
##  8 Delaware    Metro    926454
##  9 Florida     Metro  18941821
## 10 Georgia     Metro   8233886
## # ℹ 40 more rows

Ungroup, then count how often Metro or Nonmetro appears to see how many states have more people living in those areas.

counties_selected9 %>%
  # Find the total population for each combination of state and metro
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  # Extract the most populated row for each state
  slice_max(total_pop, n = 1) %>%
  # Count the states with more people in Metro or Nonmetro areas
  ungroup() %>%
  count(metro)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 2
##   metro        n
##   <chr>    <int>
## 1 Metro       44
## 2 Nonmetro     6

5.3 Selecting and Transforming Data

We’ve already seen that we can select the columns that we’re interested in by listing them inside the select verb. This chapter focuses on advanced methods of selecting and transforming columns. There are a ton of variables in the counties dataset, and we often only want to work with a subset of them.

5.3.1 Selecting

We can also select a range of columns. For example, there are a series of columns containing information on how people get to work. If we wanted to select this range of columns from drive to work_at_home, we can use drive colon work_at_home.

5.3.1.1 Selecting columns

Using the select() verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

Use glimpse() to examine all the variables in the counties table.

Select the columns for state, county, population, and (using a colon) all five of those industry-related variables; there are five consecutive variables in the table related to the industry of people’s work: professional, service, office, construction, and production.

Arrange the table in descending order of service to find which counties have the highest rates of working in the service industry.

# Glimpse the counties table
glimpse(counties)
## Rows: 3,138
## Columns: 40
## $ census_id          <chr> "1001", "1003", "1005", "1007", "1009", "1011", "10…
## $ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
## $ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", …
## $ region             <chr> "South", "South", "South", "South", "South", "South…
## $ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", "No…
## $ population         <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 1…
## $ men                <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5627…
## $ women              <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 603…
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7…
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.…
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3,…
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0…
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0…
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
## $ citizens           <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88…
## $ income             <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41…
## $ income_err         <dbl> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2949…
## $ income_per_cap     <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21…
## $ income_per_cap_err <dbl> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366, 15…
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.…
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.…
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.…
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.…
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.…
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5…
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.…
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.…
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, …
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0…
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1…
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1…
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1…
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.…
## $ employed           <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401,…
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.…
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.…
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4…
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0…
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.…
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 77…
counties %>%
  # Select state, county, population, and industry-related columns
  select(state, county, population, professional:production) %>%
  # Arrange service in descending order 
  arrange(desc(service))
## # A tibble: 3,138 × 8
##    state   county population professional service office construction production
##    <chr>   <chr>       <dbl>        <dbl>   <dbl>  <dbl>        <dbl>      <dbl>
##  1 Missis… Tunica      10477         23.9    36.6   21.5          3.5       14.5
##  2 Texas   Kinney       3577         30      36.5   11.6         20.5        1.3
##  3 Texas   Kenedy        565         24.9    34.1   20.5         20.5        0  
##  4 New Yo… Bronx     1428357         24.3    33.3   24.2          7.1       11  
##  5 Texas   Brooks       7221         19.6    32.4   25.3         11.1       11.5
##  6 Colora… Fremo…      46809         26.6    32.2   22.8         10.7        7.6
##  7 Texas   Culbe…       2296         20.1    32.2   24.2         15.7        7.8
##  8 Califo… Del N…      27788         33.9    31.5   18.8          8.9        6.8
##  9 Minnes… Mahno…       5496         26.8    31.5   18.7         13.1        9.9
## 10 Virgin… Lanca…      11129         30.3    31.2   22.8          8.1        7.6
## # ℹ 3,128 more rows

5.3.1.2 Select helpers

In the video you learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.

Select the columns state, county, population, and all those that end with work. Filter just for the counties where at least 50% of the population is engaged in public work.

counties %>%
  # Select the state, county, population, and those ending with "work"
  select(state, county, population, ends_with("work")) %>%
  # Filter for counties that have at least 50% of people engaged in public work
  filter(public_work > 50)
## # A tibble: 7 × 6
##   state        county            population private_work public_work family_work
##   <chr>        <chr>                  <dbl>        <dbl>       <dbl>       <dbl>
## 1 Alaska       Lake and Peninsu…       1474         42.2        51.6         0.2
## 2 Alaska       Yukon-Koyukuk Ce…       5644         33.3        61.7         0  
## 3 California   Lassen                 32645         42.6        50.5         0.1
## 4 Hawaii       Kalawao                   85         25          64.1         0  
## 5 North Dakota Sioux                   4380         32.9        56.8         0.1
## 6 South Dakota Todd                    9942         34.4        55           0.8
## 7 Wisconsin    Menominee               4451         36.8        59.1         0.4

5.3.2 The rename verb

Often, rather than only selecting columns, we’ll sometimes want to rename the ones we already have. We’ll learn a new verb to do this: the rename() verb.

5.3.2.1 Renaming a column after count

The rename() verb is often useful for changing the name of a column that comes out of another verb, such as count(). In this exercise, you’ll rename the default n column generated from count() to something more descriptive.

Use count() to determine how many counties are in each state.

counties %>%
  # Count the number of counties in each state
  count(state)
## # A tibble: 50 × 2
##    state           n
##    <chr>       <int>
##  1 Alabama        67
##  2 Alaska         28
##  3 Arizona        15
##  4 Arkansas       75
##  5 California     58
##  6 Colorado       64
##  7 Connecticut     8
##  8 Delaware        3
##  9 Florida        67
## 10 Georgia       159
## # ℹ 40 more rows

Notice the n column in the output; use rename() to rename that to num_counties.

counties %>%
  # Count the number of counties in each state
  count(state) %>%
  # Rename the n column to num_counties
  rename(num_counties=n)
## # A tibble: 50 × 2
##    state       num_counties
##    <chr>              <int>
##  1 Alabama               67
##  2 Alaska                28
##  3 Arizona               15
##  4 Arkansas              75
##  5 California            58
##  6 Colorado              64
##  7 Connecticut            8
##  8 Delaware               3
##  9 Florida               67
## 10 Georgia              159
## # ℹ 40 more rows

5.3.2.2 Renaming a column as part of a select

rename() isn’t the only way you can choose a new name for a column; you can also choose a name as part of a select().

Select the columns state, county, and poverty from the counties dataset; in the same step, rename the poverty column to poverty_rate.

counties %>%
  # Select state, county, and poverty as poverty_rate
  select(state, county, poverty_rate = poverty)
## # A tibble: 3,138 × 3
##    state   county   poverty_rate
##    <chr>   <chr>           <dbl>
##  1 Alabama Autauga          12.9
##  2 Alabama Baldwin          13.4
##  3 Alabama Barbour          26.7
##  4 Alabama Bibb             16.8
##  5 Alabama Blount           16.7
##  6 Alabama Bullock          24.6
##  7 Alabama Butler           25.4
##  8 Alabama Calhoun          20.5
##  9 Alabama Chambers         21.6
## 10 Alabama Cherokee         19.2
## # ℹ 3,128 more rows

5.3.3 The relocate verb

We’ve learned about the select, rename, and mutate verbs for transforming data, but now we’ll learn a fourth one: relocate().

5.3.3.1 Using relocate

As you learned in the video, the relocate() verb allows you to move columns around relative to other columns or its overall position in the tibble.

You’ve been given the counties_selected tibble, which contains the columns you need for your analysis of population density, but in an order that isn’t easy to read. You’ll use your new-found skills to put them right!

Move the density column to the end of the tibble. Move the population column to before the land_area column.

counties_selected10 <- counties %>%
  select(population, land_area) %>%
  mutate(density = population / land_area)

counties_selected10 %>%
  # Move the density column to the end
  relocate(density, .after = last_col()) %>%
  # Move the population column to before land_area
  relocate(population, .before = land_area)
## # A tibble: 3,138 × 3
##    population land_area density
##         <dbl>     <dbl>   <dbl>
##  1      55221      594.    92.9
##  2     195121     1590.   123. 
##  3      26932      885.    30.4
##  4      22604      623.    36.3
##  5      57710      645.    89.5
##  6      10678      623.    17.1
##  7      20354      777.    26.2
##  8     116648      606.   193. 
##  9      34079      597.    57.1
## 10      26008      554.    47.0
## # ℹ 3,128 more rows

5.3.3.2 Choosing among the four verbs

In this chapter you’ve learned about the four verbs: select(), mutate(), relocate(), and rename(). Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

Choose the right verb for changing the name of the ‘unemployment’ column to ‘unemployment_rate’ Choose the right verb for keeping only the columns ‘state’, ‘county’, and the ones containing ‘poverty.’ Calculate a new column called ‘fraction_women’ with the fraction of the population made up of women, without dropping any columns. Keep only three columns: the ‘state’, ‘county’, and ‘employed’ / ‘population’, which you’ll call ‘employment_rate.’

# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = unemployment)
## # A tibble: 3,138 × 40
##    census_id state   county   region metro population   men women hispanic white
##    <chr>     <chr>   <chr>    <chr>  <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 1001      Alabama Autauga  South  Metro      55221 26745 28476      2.6  75.8
##  2 1003      Alabama Baldwin  South  Metro     195121 95314 99807      4.5  83.1
##  3 1005      Alabama Barbour  South  Nonm…      26932 14497 12435      4.6  46.2
##  4 1007      Alabama Bibb     South  Metro      22604 12073 10531      2.2  74.5
##  5 1009      Alabama Blount   South  Metro      57710 28512 29198      8.6  87.9
##  6 1011      Alabama Bullock  South  Nonm…      10678  5660  5018      4.4  22.2
##  7 1013      Alabama Butler   South  Nonm…      20354  9502 10852      1.2  53.3
##  8 1015      Alabama Calhoun  South  Metro     116648 56274 60374      3.5  73  
##  9 1017      Alabama Chambers South  Nonm…      34079 16258 17821      0.4  57.3
## 10 1019      Alabama Cherokee South  Nonm…      26008 12975 13033      1.5  91.7
## # ℹ 3,128 more rows
## # ℹ 30 more variables: black <dbl>, native <dbl>, asian <dbl>, pacific <dbl>,
## #   citizens <dbl>, income <dbl>, income_err <dbl>, income_per_cap <dbl>,
## #   income_per_cap_err <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, …
# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty"))
## # A tibble: 3,138 × 4
##    state   county   poverty child_poverty
##    <chr>   <chr>      <dbl>         <dbl>
##  1 Alabama Autauga     12.9          18.6
##  2 Alabama Baldwin     13.4          19.2
##  3 Alabama Barbour     26.7          45.3
##  4 Alabama Bibb        16.8          27.9
##  5 Alabama Blount      16.7          27.2
##  6 Alabama Bullock     24.6          38.4
##  7 Alabama Butler      25.4          39.2
##  8 Alabama Calhoun     20.5          31.6
##  9 Alabama Chambers    21.6          37.2
## 10 Alabama Cherokee    19.2          30.1
## # ℹ 3,128 more rows
# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = women / population)
## # A tibble: 3,138 × 41
##    census_id state   county   region metro population   men women hispanic white
##    <chr>     <chr>   <chr>    <chr>  <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 1001      Alabama Autauga  South  Metro      55221 26745 28476      2.6  75.8
##  2 1003      Alabama Baldwin  South  Metro     195121 95314 99807      4.5  83.1
##  3 1005      Alabama Barbour  South  Nonm…      26932 14497 12435      4.6  46.2
##  4 1007      Alabama Bibb     South  Metro      22604 12073 10531      2.2  74.5
##  5 1009      Alabama Blount   South  Metro      57710 28512 29198      8.6  87.9
##  6 1011      Alabama Bullock  South  Nonm…      10678  5660  5018      4.4  22.2
##  7 1013      Alabama Butler   South  Nonm…      20354  9502 10852      1.2  53.3
##  8 1015      Alabama Calhoun  South  Metro     116648 56274 60374      3.5  73  
##  9 1017      Alabama Chambers South  Nonm…      34079 16258 17821      0.4  57.3
## 10 1019      Alabama Cherokee South  Nonm…      26008 12975 13033      1.5  91.7
## # ℹ 3,128 more rows
## # ℹ 31 more variables: black <dbl>, native <dbl>, asian <dbl>, pacific <dbl>,
## #   citizens <dbl>, income <dbl>, income_err <dbl>, income_per_cap <dbl>,
## #   income_per_cap_err <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, …
# Move the region column to before state
counties %>%
  relocate(region, .before = state)
## # A tibble: 3,138 × 40
##    census_id region state   county   metro population   men women hispanic white
##    <chr>     <chr>  <chr>   <chr>    <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 1001      South  Alabama Autauga  Metro      55221 26745 28476      2.6  75.8
##  2 1003      South  Alabama Baldwin  Metro     195121 95314 99807      4.5  83.1
##  3 1005      South  Alabama Barbour  Nonm…      26932 14497 12435      4.6  46.2
##  4 1007      South  Alabama Bibb     Metro      22604 12073 10531      2.2  74.5
##  5 1009      South  Alabama Blount   Metro      57710 28512 29198      8.6  87.9
##  6 1011      South  Alabama Bullock  Nonm…      10678  5660  5018      4.4  22.2
##  7 1013      South  Alabama Butler   Nonm…      20354  9502 10852      1.2  53.3
##  8 1015      South  Alabama Calhoun  Metro     116648 56274 60374      3.5  73  
##  9 1017      South  Alabama Chambers Nonm…      34079 16258 17821      0.4  57.3
## 10 1019      South  Alabama Cherokee Nonm…      26008 12975 13033      1.5  91.7
## # ℹ 3,128 more rows
## # ℹ 30 more variables: black <dbl>, native <dbl>, asian <dbl>, pacific <dbl>,
## #   citizens <dbl>, income <dbl>, income_err <dbl>, income_per_cap <dbl>,
## #   income_per_cap_err <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, …

5.4 Case Study: The babynames Dataset

So far in this course, we’ve been using the counties dataset, which contains US census data at a county level. In this chapter, we’re going to analyze a new dataset, one representing the names of babies born in the United States each year.

5.4.1 Filtering and arranging for one year

The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

Filter for only the year 1990. Sort the table in descending order of the number of babies born.

babynames %>%
  # Filter for the year 1990
  filter(year == 1990) %>%
  # Sort the number column in descending order 
  arrange(desc(number))
## # A tibble: 21,223 × 3
##     year name        number
##    <dbl> <chr>        <int>
##  1  1990 Michael      65560
##  2  1990 Christopher  52520
##  3  1990 Jessica      46615
##  4  1990 Ashley       45797
##  5  1990 Matthew      44925
##  6  1990 Joshua       43382
##  7  1990 Brittany     36650
##  8  1990 Amanda       34504
##  9  1990 Daniel       33963
## 10  1990 David        33862
## # ℹ 21,213 more rows

5.4.3 Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2.

Filter for only the names Steven, Thomas, and Matthew, and assign it to an object called selected_names. Hint: set up a vector to contain the names using %in%

selected_names <- babynames %>%
  # Filter for the names Steven, Thomas, and Matthew 
  filter(name %in% c("Steven", "Thomas", "Matthew"))

Visualize those three names as a line plot over time, with each name represented by a different color.

# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

5.4.4 Grouped mutates

Just like group_by and summarize work well together, group_by and mutate are a great pair.

5.4.4.1 Finding the year each name is most common

In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a slice_max().

First, calculate the total number of people born in that year in this dataset as year_total. Next, use year_total to calculate the fraction of people born in each year that have each name.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)
## # A tibble: 332,595 × 5
##     year name    number year_total  fraction
##    <dbl> <chr>    <int>      <int>     <dbl>
##  1  1880 Aaron      102     201478 0.000506 
##  2  1880 Ab           5     201478 0.0000248
##  3  1880 Abbie       71     201478 0.000352 
##  4  1880 Abbott       5     201478 0.0000248
##  5  1880 Abby         6     201478 0.0000298
##  6  1880 Abe         50     201478 0.000248 
##  7  1880 Abel         9     201478 0.0000447
##  8  1880 Abigail     12     201478 0.0000596
##  9  1880 Abner       27     201478 0.000134 
## 10  1880 Abraham     81     201478 0.000402 
## # ℹ 332,585 more rows

Now use your newly calculated fraction column, in combination with slice_max(), to identify the year each name was most common.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
  # Find the year each name is most common
  group_by(name) %>%
  slice_max(fraction, n = 1)
## # A tibble: 48,040 × 5
## # Groups:   name [48,040]
##     year name     number year_total   fraction
##    <dbl> <chr>     <int>      <int>      <dbl>
##  1  2015 Aaban        15    3648781 0.00000411
##  2  2015 Aadam        22    3648781 0.00000603
##  3  2010 Aadan        11    3672066 0.00000300
##  4  2015 Aadarsh      15    3648781 0.00000411
##  5  2010 Aaden       450    3672066 0.000123  
##  6  2015 Aadhav       31    3648781 0.00000850
##  7  2015 Aadhavan      5    3648781 0.00000137
##  8  2015 Aadhya      265    3648781 0.0000726 
##  9  2010 Aadi         54    3672066 0.0000147 
## 10  2005 Aadil        20    3828460 0.00000522
## # ℹ 48,030 more rows

5.4.4.2 Adding the total and maximum for each name

In the video, you learned how you could group by the year and use mutate() to add a total for that year.

In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates slower.

Use a grouped mutate to add two columns: name_total: the sum of the number of babies born with that name in the entire dataset. name_max: the maximum number of babies born with that name in any year.

babynames %>%
  # Add columns name_total and name_max for each name
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number))
## # A tibble: 332,595 × 5
## # Groups:   name [48,040]
##     year name    number name_total name_max
##    <dbl> <chr>    <int>      <int>    <int>
##  1  1880 Aaron      102     114739    14635
##  2  1880 Ab           5         77       31
##  3  1880 Abbie       71       4330      445
##  4  1880 Abbott       5        217       51
##  5  1880 Abby         6      11272     1753
##  6  1880 Abe         50       1832      271
##  7  1880 Abel         9      10565     3245
##  8  1880 Abigail     12      72600    15762
##  9  1880 Abner       27       1552      199
## 10  1880 Abraham     81      17882     2449
## # ℹ 332,585 more rows

Add another step to ungroup the table. Add a column called fraction_max containing the number in the year divided by name_max.

babynames %>%
  # Add columns name_total and name_max for each name
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number / name_max)
## # A tibble: 332,595 × 6
##     year name    number name_total name_max fraction_max
##    <dbl> <chr>    <int>      <int>    <int>        <dbl>
##  1  1880 Aaron      102     114739    14635     0.00697 
##  2  1880 Ab           5         77       31     0.161   
##  3  1880 Abbie       71       4330      445     0.160   
##  4  1880 Abbott       5        217       51     0.0980  
##  5  1880 Abby         6      11272     1753     0.00342 
##  6  1880 Abe         50       1832      271     0.185   
##  7  1880 Abel         9      10565     3245     0.00277 
##  8  1880 Abigail     12      72600    15762     0.000761
##  9  1880 Abner       27       1552      199     0.136   
## 10  1880 Abraham     81      17882     2449     0.0331  
## # ℹ 332,585 more rows

5.4.4.3 Visualizing the normalized change in popularity

You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

Filter the names_normalized table to limit it to the three names Steven, Thomas, and Matthew. Create a line plot from names_filtered to visualize fraction_max over time, colored by name.

# Table provided in datacamp
names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)

names_filtered <- names_normalized %>%
  # Filter for the names Steven, Thomas, and Matthew
  filter(name %in% c("Steven", "Thomas", "Matthew"))

# Visualize the names in names_filtered over time
ggplot(data = names_filtered, aes(x=year, y=fraction_max, color = name)) +
geom_line()

5.4.5 Window functions

A window function takes a vector, and returns another vector of the same length.

5.4.5.1 Using ratios to describe the frequency of a name

In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?

You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.

Arrange the data in ascending order of name and then year. Group by name so that your mutate works within each name. Add a column ratio containing the ratio (not difference) of fraction between each year.

Hint: when creating a new variable with the lag function, he format is variable _ lag(variable) the _ will be different depending on if you are calculation a ratio or a difference

# Data set given in data camp
babynames_fraction <- babynames %>%
  group_by(year) %>% 
  mutate(year_total = sum(number)) %>%
  ungroup() %>% 
  mutate(fraction = number / year_total)

babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio of fraction between each year 
  mutate(ratio = fraction / lag(fraction))
## # A tibble: 332,595 × 6
## # Groups:   name [48,040]
##     year name    number year_total   fraction  ratio
##    <dbl> <chr>    <int>      <int>      <dbl>  <dbl>
##  1  2010 Aaban        9    3672066 0.00000245 NA    
##  2  2015 Aaban       15    3648781 0.00000411  1.68 
##  3  1995 Aadam        6    3652750 0.00000164 NA    
##  4  2000 Aadam        6    3767293 0.00000159  0.970
##  5  2005 Aadam        6    3828460 0.00000157  0.984
##  6  2010 Aadam        7    3672066 0.00000191  1.22 
##  7  2015 Aadam       22    3648781 0.00000603  3.16 
##  8  2010 Aadan       11    3672066 0.00000300 NA    
##  9  2015 Aadan       10    3648781 0.00000274  0.915
## 10  2000 Aadarsh      5    3767293 0.00000133 NA    
## # ℹ 332,585 more rows

5.4.5.2 Biggest jumps in a name

Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

From each name in the data, keep the observation (the year) with the largest ratio; note the data is already grouped by name. Sort the ratio column in descending order. Filter the babynames_ratios_filtered data further by filtering the fraction column to only display results greater than or equal to 0.001.

# Dataset given in datacamp
babynames_ratios_filtered <- babynames_fraction %>%
                     arrange(name, year) %>%
                     group_by(name) %>%
                     mutate(ratio = fraction / lag(fraction)) %>%
                     filter(fraction >= 0.00001)


babynames_ratios_filtered %>%
  # Extract the largest ratio from each name 
  slice_max(ratio, n=1) %>%
  # Sort the ratio column in descending order 
  arrange(desc(ratio)) %>%
  # Filter for fractions greater than or equal to 0.001
  filter(fraction >= 0.001)
## # A tibble: 291 × 6
## # Groups:   name [291]
##     year name    number year_total fraction ratio
##    <dbl> <chr>    <int>      <int>    <dbl> <dbl>
##  1  1960 Tammy    14365    4152075  0.00346  70.1
##  2  2005 Nevaeh    4610    3828460  0.00120  45.8
##  3  1940 Brenda    5460    2301630  0.00237  37.5
##  4  1885 Grover     774     240822  0.00321  36.0
##  5  1945 Cheryl    8170    2652029  0.00308  24.9
##  6  1955 Lori      4980    4012691  0.00124  23.2
##  7  2010 Khloe     5411    3672066  0.00147  23.2
##  8  1950 Debra     6189    3502592  0.00177  22.6
##  9  2010 Bentley   4001    3672066  0.00109  22.4
## 10  1935 Marlene   4840    2088487  0.00232  16.8
## # ℹ 281 more rows