3  Tidy Data

Learning Objectives

  • Understand the concept of tidy data (wide and long formats).
  • Be able to reshape data between long and wide formats.
  • Understand the benefits of using tidy data for data manipulation, visualization, and analysis.

3.1 “Wide” and “Long” Formatted Data

Datasets are usually structured in one of two ways:

  • Wide format: each variable has its own column
  • Long format: each observation has its own row

Take a look at this example of a dataset on federal R&D spending by U.S. government department:

fed_spend_wide <- read_csv(here('data', 'fed_spend_wide.csv'))

head(fed_spend_wide)
#> # A tibble: 6 × 15
#>    year   DHS   DOC   DOD   DOE   DOT   EPA   HHS Interior  NASA   NIH   NSF
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
#> 1  1976     0   819 35696 10882  1142   968  9226     1152 12513  8025  2372
#> 2  1977     0   837 37967 13741  1095   966  9507     1082 12553  8214  2395
#> 3  1978     0   871 37022 15663  1156  1175 10533     1125 12516  8802  2446
#> 4  1979     0   952 37174 15612  1004  1102 10127     1176 13079  9243  2404
#> 5  1980     0   945 37005 15226  1048   903 10045     1082 13837  9093  2407
#> 6  1981     0   829 41737 14798   978   901  9644      990 13276  8580  2300
#> # ℹ 3 more variables: Other <dbl>, USDA <dbl>, VA <dbl>

As the name suggests, this dataset is in wide format, where each department is a column.

Compare this to the long format:

fed_spend_long <- read_csv(here('data', 'fed_spend_long.csv'))

head(fed_spend_long)
#> # A tibble: 6 × 3
#>   department  year rd_budget_mil
#>   <chr>      <dbl>         <dbl>
#> 1 DOD         1976         35696
#> 2 NASA        1976         12513
#> 3 DOE         1976         10882
#> 4 HHS         1976          9226
#> 5 NIH         1976          8025
#> 6 NSF         1976          2372

This is the same dataset, but in long format, where each observation has its own row. Here you will notice that the department names are now variables, and the R&D spending is a single column. The year is repeated for each department.

How do we know which format a dataset is in?

A helpful heuristic is to ask yourself:

Do the names describe the values?

  • If Yes: “Long” format
  • If No: “Wide” format

3.2 Tidy data = “Long” format

When we refer to “tidy data”, we are referring to data in long format that follows the “tidy” principles:

  • Each variable has its own column
  • Each observation has its own row

You can verify that the federal spending dataset is in long format:

head(fed_spend_long)
#> # A tibble: 6 × 3
#>   department  year rd_budget_mil
#>   <chr>      <dbl>         <dbl>
#> 1 DOD         1976         35696
#> 2 NASA        1976         12513
#> 3 DOE         1976         10882
#> 4 HHS         1976          9226
#> 5 NIH         1976          8025
#> 6 NSF         1976          2372

3.3 Reshaping data

We use the pivot_longer() and pivot_wider() functions to reshape data between long and wide formats.

3.3.1 From “long” to “wide” with pivot_wider()

head(fed_spend_long)
#> # A tibble: 6 × 3
#>   department  year rd_budget_mil
#>   <chr>      <dbl>         <dbl>
#> 1 DOD         1976         35696
#> 2 NASA        1976         12513
#> 3 DOE         1976         10882
#> 4 HHS         1976          9226
#> 5 NIH         1976          8025
#> 6 NSF         1976          2372

To convert to wide format, we need to specify which column to convert into the column names and which column to convert into the column values:

fed_spend_wide <- fed_spend_long %>%
  pivot_wider(
    names_from = department,
    values_from = rd_budget_mil
  )

head(fed_spend_wide)
#> # A tibble: 6 × 15
#>    year   DOD  NASA   DOE   HHS   NIH   NSF  USDA Interior   DOT   EPA   DOC
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
#> 1  1976 35696 12513 10882  9226  8025  2372  1837     1152  1142   968   819
#> 2  1977 37967 12553 13741  9507  8214  2395  1796     1082  1095   966   837
#> 3  1978 37022 12516 15663 10533  8802  2446  1962     1125  1156  1175   871
#> 4  1979 37174 13079 15612 10127  9243  2404  2054     1176  1004  1102   952
#> 5  1980 37005 13837 15226 10045  9093  2407  1887     1082  1048   903   945
#> 6  1981 41737 13276 14798  9644  8580  2300  1964      990   978   901   829
#> # ℹ 3 more variables: DHS <dbl>, VA <dbl>, Other <dbl>

3.3.2 From “wide” to “long” with pivot_longer()

head(fed_spend_wide)
#> # A tibble: 6 × 15
#>    year   DOD  NASA   DOE   HHS   NIH   NSF  USDA Interior   DOT   EPA   DOC
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
#> 1  1976 35696 12513 10882  9226  8025  2372  1837     1152  1142   968   819
#> 2  1977 37967 12553 13741  9507  8214  2395  1796     1082  1095   966   837
#> 3  1978 37022 12516 15663 10533  8802  2446  1962     1125  1156  1175   871
#> 4  1979 37174 13079 15612 10127  9243  2404  2054     1176  1004  1102   952
#> 5  1980 37005 13837 15226 10045  9093  2407  1887     1082  1048   903   945
#> 6  1981 41737 13276 14798  9644  8580  2300  1964      990   978   901   829
#> # ℹ 3 more variables: DHS <dbl>, VA <dbl>, Other <dbl>

To convert to long format, we need to create new column names for the names and values, and we also need to specify which columns to convert:

fed_spend_long <- fed_spend_wide %>%
  pivot_longer( 
    names_to = "department",
    values_to = "rd_budget_mil",
    cols = DOD:Other
  )

head(fed_spend_long)
#> # A tibble: 6 × 3
#>    year department rd_budget_mil
#>   <dbl> <chr>              <dbl>
#> 1  1976 DOD                35696
#> 2  1976 NASA               12513
#> 3  1976 DOE                10882
#> 4  1976 HHS                 9226
#> 5  1976 NIH                 8025
#> 6  1976 NSF                 2372

You can also set cols by selecting which columns not to use, like this:

fed_spend_long <- fed_spend_wide %>%
  pivot_longer(
    names_to = "department", 
    values_to = "rd_budget_mil",
    cols = -year
  )

head(fed_spend_long)
#> # A tibble: 6 × 3
#>    year department rd_budget_mil
#>   <dbl> <chr>              <dbl>
#> 1  1976 DOD                35696
#> 2  1976 NASA               12513
#> 3  1976 DOE                10882
#> 4  1976 HHS                 9226
#> 5  1976 NIH                 8025
#> 6  1976 NSF                 2372

3.4 Tidy data wrangling

There’s a good reason why we like to keep our data in a long (tidy) format: it lets us use a consistent set of functions for manipulating, summarizing, and visualizing data.

Here’s a quick explanation with cute graphics, by Allison Horst:

3.4.1 Example 1: Total R&D spending in each year

This is a simple enough task, but with our data in wide format, we need to add each column manually like this:

fed_spend_wide %>%
  mutate(total = DHS + DOC + DOD + DOE + DOT + EPA + HHS + Interior + NASA + NIH + NSF + Other + USDA + VA) %>%
  select(year, total)
#> # A tibble: 42 × 2
#>    year total
#>   <dbl> <dbl>
#> 1  1976 86227
#> 2  1977 91807
#> 3  1978 94864
#> 4  1979 96601
#> 5  1980 96305
#> 6  1981 98304
#> # ℹ 36 more rows

This is cumbersome, and prone to error.

An alternative approach is to embrace the tidy data format. We’ll use pivot_longer() to convert the data into long format first, and then we’ll summarise the data:

fed_spend_long <- fed_spend_wide %>%
  pivot_longer(
    names_to = "department", 
    values_to = "rd_budget_mil",
    cols = -year
  ) 

head(fed_spend_long)
#> # A tibble: 6 × 3
#>    year department rd_budget_mil
#>   <dbl> <chr>              <dbl>
#> 1  1976 DOD                35696
#> 2  1976 NASA               12513
#> 3  1976 DOE                10882
#> 4  1976 HHS                 9226
#> 5  1976 NIH                 8025
#> 6  1976 NSF                 2372

Now that our data is in long format, we can use the group_by() and summarise() functions to compute the total R&D spending in each year:

fed_spend_long %>%
  group_by(year) %>%
  summarise(total = sum(rd_budget_mil))
#> # A tibble: 42 × 2
#>    year total
#>   <dbl> <dbl>
#> 1  1976 86227
#> 2  1977 91807
#> 3  1978 94864
#> 4  1979 96601
#> 5  1980 96305
#> 6  1981 98304
#> # ℹ 36 more rows

This is a much cleaner approach, and it’s much less prone to error as we don’t need to manually specify each column that we are summing.

3.4.2 Example 2: Visualizing total spending by department

Let’s see how we can do this with our data in wide format:

head(fed_spend_wide)
#> # A tibble: 6 × 15
#>    year   DOD  NASA   DOE   HHS   NIH   NSF  USDA Interior   DOT   EPA   DOC
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
#> 1  1976 35696 12513 10882  9226  8025  2372  1837     1152  1142   968   819
#> 2  1977 37967 12553 13741  9507  8214  2395  1796     1082  1095   966   837
#> 3  1978 37022 12516 15663 10533  8802  2446  1962     1125  1156  1175   871
#> 4  1979 37174 13079 15612 10127  9243  2404  2054     1176  1004  1102   952
#> 5  1980 37005 13837 15226 10045  9093  2407  1887     1082  1048   903   945
#> 6  1981 41737 13276 14798  9644  8580  2300  1964      990   978   901   829
#> # ℹ 3 more variables: DHS <dbl>, VA <dbl>, Other <dbl>
ggplot(fed_spend_wide) +
  geom_col(aes(x = rd_budget_mil, y = department)) +
  theme_bw() +
  labs(
      x = "R&D Spending ($Millions)",
      y = "Federal Agency"
  )
#> Error in `geom_col()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! object 'rd_budget_mil' not found

In wide format, we actually cannot plot the total by department, because there is no department variable!

Here we must first convert to long format, and then we can plot the data:

fed_spend_wide %>%
  pivot_longer(
    names_to = "department",
    values_to = "rd_budget_mil",
    cols = -year
  ) %>%
  ggplot() +
  geom_col(aes(x = rd_budget_mil, y = reorder(department, rd_budget_mil))) +
  theme_bw() +
  labs(
    x = "R&D Spending ($Millions)",
    y = "Federal Agency"
  )