4  Cleaning Data

Learning Objectives

  • Merge multiple datasets effectively with joins.
  • Clean, organize, and recode variables.
  • Handle and process date data.

4.1 Merging Data Sets with Joins

Let’s say we have two data sets: state names, and state abbreviations. Now we want to merge them into a data frame so that the state names and abbreviations are in the same row.

head(names)
#>   state_name
#> 1    Alabama
#> 2     Alaska
#> 3    Arizona
#> 4   Arkansas
#> 5 California
#> 6   Colorado
head(abbs)
#>   state_abb
#> 1        AK
#> 2        AL
#> 3        AR
#> 4        AZ
#> 5        CA
#> 6        CO

We might be tempted to simply merge the two data sets using bind_cols(), like this:

result <- bind_cols(names, abbs)
head(result)
#>   state_name state_abb
#> 1    Alabama        AK
#> 2     Alaska        AL
#> 3    Arizona        AR
#> 4   Arkansas        AZ
#> 5 California        CA
#> 6   Colorado        CO

While this looks like it works, if you look closely, you’ll see that the state abbreviations are all mixed up (e.g., Alabama and Alaska are swapped).

This is where joins come into play.

Joining is a way to combine two data sets based on a common variable.

There are three types of joins that we can utilize to properly merge data sets.

  1. inner_join()
  2. left_join() / right_join()
  3. full_join()

We’ll take a look at the different joins applied to the band_members & band_instruments data sets.

band_members
#> # A tibble: 3 × 2
#>   name  band   
#>   <chr> <chr>  
#> 1 Mick  Stones 
#> 2 John  Beatles
#> 3 Paul  Beatles
band_instruments
#> # A tibble: 3 × 2
#>   name  plays 
#>   <chr> <chr> 
#> 1 John  guitar
#> 2 Paul  bass  
#> 3 Keith guitar

4.1.1 inner_join()

In inner joins, only the rows that have a match on the common variable are kept. This gif illustrates the inner join:

Applied to our example, we get:

band_members %>%
    inner_join(band_instruments)
#> # A tibble: 2 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 John  Beatles guitar
#> 2 Paul  Beatles bass

4.1.2 full_join()

In full joins, all rows from both data sets are kept. This gif illustrates the full join:

Applied to our example, we get:

band_members %>%
    full_join(band_instruments)
#> # A tibble: 4 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass  
#> 4 Keith <NA>    guitar

Note here that NA values are introduced for the missing matches.

4.1.3 left_join()

In left joins, all rows from the left data set are kept. This gif illustrates the left join:

Applied to our example, we get:

band_members %>%
    left_join(band_instruments)
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass

Here we get an NA for Mick Stones as we don’t know what instrument he plays (he is not in the band_instruments data set).

4.1.4 right_join()

Right joins are just the opposite of left joins. All rows from the right data set are kept. This gif illustrates the right join:

Applied to our example, we get:

band_members %>%
    right_join(band_instruments)
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 John  Beatles guitar
#> 2 Paul  Beatles bass  
#> 3 Keith <NA>    guitar

4.1.5 We can also specify the joining variable name

If you don’t specify the joining variable name, R will use the first variable name that appears in both data sets:

band_members %>%
    left_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass

Here you see the message Joining with `by = join_by(name)`, because R realized that the name variable appears in each dataset.

We can be more explicit by specifying the joining variable name, like this:

band_members %>%
    left_join(
        band_instruments,
        by = 'name'
    )
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass

Here we get no message because we told R which variable to use to join.

4.1.6 What if the names differ?

If the matching column has a different name in each data frame, use by = c("left_name" = "joining_name") to specify the correct joining name.

For example, in the two data frames below we have name in one of them and artist in the other. Since these don’t match, R won’t know which column to use by default to do the join. Here is how we can tell it that these two columns should be used for matching:

band_members
#> # A tibble: 3 × 2
#>   name  band   
#>   <chr> <chr>  
#> 1 Mick  Stones 
#> 2 John  Beatles
#> 3 Paul  Beatles
band_instruments2
#> # A tibble: 3 × 2
#>   artist plays 
#>   <chr>  <chr> 
#> 1 John   guitar
#> 2 Paul   bass  
#> 3 Keith  guitar
band_members %>%
  left_join(
    band_instruments2,
    by = c("name" = "artist")
  ) 
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass

We can also just rename the joining variable before joining, like this:

band_members %>%
  rename(artist = name) %>%
  left_join(
    band_instruments2,
    by = "artist"
  )
#> # A tibble: 3 × 3
#>   artist band    plays 
#>   <chr>  <chr>   <chr> 
#> 1 Mick   Stones  <NA>  
#> 2 John   Beatles guitar
#> 3 Paul   Beatles bass

4.2 Variable Types and Names

4.2.1 Fixing Variable Types

Always check variable types after reading in data!

A lot of times variables get encoded into a data type that is not what you would want after reading in an external data file. Take a look at this data set:

wind <- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx'))

glimpse(wind)
#> Rows: 50
#> Columns: 7
#> $ Ranking                         <chr> "1.0", "2.0", "3.0", "4.0", "5.0", …
#> $ State                           <chr> "TEXAS", "OKLAHOMA", "IOWA", "CALIF…
#> $ `Installed Capacity (MW)`       <dbl> 23262, 7495, 7312, 5686, 5110, 4464…
#> $ `Equivalent Homes Powered`      <chr> "6235000.0", "2268000.0", "1935000.…
#> $ `Total Investment ($ Millions)` <chr> "42000.0", "13700.0", "14200.0", "1…
#> $ `Wind Projects Online`          <dbl> 136, 45, 107, 104, 35, 49, 98, 31, …
#> $ `# of Wind Turbines`            <chr> "12750.0", "3717.0", "4145.0", "697…

Here we see that many variables that should be numeric are actually characters (e.g. Ranking, Equivalent Homes Powered, Total Investment ($ Millions), and # of Wind Turbines).

To address this, one of the first things we usually do is convert incorrect data types to the appropriate ones, like this:

wind <- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
  mutate(
    Ranking = as.numeric(Ranking),
    `Equivalent Homes Powered` = as.numeric(`Equivalent Homes Powered`),
    `Total Investment ($ Millions)` = as.numeric(`Total Investment ($ Millions)`),
    `# of Wind Turbines` = as.numeric(`# of Wind Turbines`)
  )
    
glimpse(wind)
#> Rows: 50
#> Columns: 7
#> $ Ranking                         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, …
#> $ State                           <chr> "TEXAS", "OKLAHOMA", "IOWA", "CALIF…
#> $ `Installed Capacity (MW)`       <dbl> 23262, 7495, 7312, 5686, 5110, 4464…
#> $ `Equivalent Homes Powered`      <dbl> 6235000, 2268000, 1935000, 1298000,…
#> $ `Total Investment ($ Millions)` <dbl> 42000, 13700, 14200, 12600, 9400, 8…
#> $ `Wind Projects Online`          <dbl> 136, 45, 107, 104, 35, 49, 98, 31, …
#> $ `# of Wind Turbines`            <dbl> 12750, 3717, 4145, 6972, 2795, 2632…

Be careful converting strings to numbers!

If you use as.numeric() (as we did above), you can get NAs if there are any non-numeric characters in the data. For example:

This works great:

as.numeric(c("2.1", "3.7", "4.50"))
#> [1] 2.1 3.7 4.5

The $ symbol breaks this:

as.numeric(c("$2.1", "$3.7", "$4.50"))
#> [1] NA NA NA

A solution is to use parse_number() instead, which parses strings for numbers:

This works great:

parse_number(c("2.1", "3.7", "4.50"))
#> [1] 2.1 3.7 4.5

The $ symbol breaks this:

parse_number(c("$2.1", "$3.7", "$4.50"))
#> [1] 2.1 3.7 4.5

4.2.2 Fixing variable names

You might also notice that the variables names in the wind dataset are long and messy:

wind <- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx'))

names(wind)
#> [1] "Ranking"                       "State"                        
#> [3] "Installed Capacity (MW)"       "Equivalent Homes Powered"     
#> [5] "Total Investment ($ Millions)" "Wind Projects Online"         
#> [7] "# of Wind Turbines"

We can easily fix these with janitor::clean_names():

library(janitor)

wind <- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
  clean_names()

names(wind)
#> [1] "ranking"                   "state"                    
#> [3] "installed_capacity_mw"     "equivalent_homes_powered" 
#> [5] "total_investment_millions" "wind_projects_online"     
#> [7] "number_of_wind_turbines"

We can even specify the style of the names:

library(janitor)

wind <- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
  clean_names(case = 'lower_camel')

names(wind)
#> [1] "ranking"                 "state"                  
#> [3] "installedCapacityMw"     "equivalentHomesPowered" 
#> [5] "totalInvestmentMillions" "windProjectsOnline"     
#> [7] "numberOfWindTurbines"

4.3 Using select() to rename and reorder variables

Here’s an example data set on the sleeping patterns of different mammals:

glimpse(msleep)
#> Rows: 83
#> Columns: 11
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…

The primary use of select() is to choose which columns to keep or drop:

Selecting variables to keep

msleep %>%
  select(name:order, sleep_total:sleep_cycle) %>%
  glimpse()
#> Rows: 83
#> Columns: 7
#> $ name        <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater sh…
#> $ genus       <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "B…
#> $ vore        <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "ca…
#> $ order       <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "A…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem   <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…

Selecting variables to drop

msleep %>%
  select(-(name:order)) %>%
  glimpse()
#> Rows: 83
#> Columns: 7
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…

We can also select columns based on partial column names

For example, we can select columns that start with “sleep”:

msleep %>%
  select(name, starts_with("sleep")) %>%
  glimpse()
#> Rows: 83
#> Columns: 4
#> $ name        <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater sh…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem   <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…

Or we can select columns that contain “eep” and end with “wt”:

msleep %>%
  select(contains("eep"), ends_with("wt")) %>%
  glimpse()
#> Rows: 83
#> Columns: 5
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem   <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
#> $ brainwt     <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000,…
#> $ bodywt      <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.…

You can also use select() to select columns based on their data type

Select only numeric columns:

msleep %>%
    select_if(is.numeric) %>%
    glimpse()
#> Rows: 83
#> Columns: 6
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem   <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
#> $ awake       <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0,…
#> $ brainwt     <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000,…
#> $ bodywt      <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.…

Select only character columns:

msleep %>%
    select_if(is.character) %>%
    glimpse()
#> Rows: 83
#> Columns: 5
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…

You can also use select() to reorder variables. The key is to use everything(), which tells R to keep all the other variables in the data frame.

For example, if we wanted to move the conservation and awake variables to the first two columns, we could do this:

msleep %>%
    select(conservation, awake, everything()) %>%
    glimpse()
#> Rows: 83
#> Columns: 11
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…

You can also use select() to rename variables.

The pattern here is new = old:

msleep %>%
  select(
    animal = name,
    extinction_threat = conservation
  ) %>%
  glimpse()
#> Rows: 83
#> Columns: 2
#> $ animal            <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea…
#> $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N…

Notice however that this drops everything else.

If you wanted to just rename them but keep all other variables, use rename():

msleep %>%
  rename(
    animal = name,
    extinction_threat = conservation
  ) %>%
  glimpse()
#> Rows: 83
#> Columns: 11
#> $ animal            <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea…
#> $ genus             <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo…
#> $ vore              <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi…
#> $ order             <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph…
#> $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N…
#> $ sleep_total       <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1…
#> $ sleep_rem         <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.…
#> $ sleep_cycle       <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38…
#> $ awake             <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,…
#> $ brainwt           <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.…
#> $ bodywt            <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4…

4.4 Recoding Variables

4.4.1 Recoding with ifelse()

The ifelse() function takes the pattern:

if <condition>, <value if condition is TRUE>, <value if condition if FALSE>

Example: Create a variable, cost_high, that is TRUE if the repair costs were greater than the median costs and FALSE otherwise.

wildlife_impacts1 <- wildlife_impacts %>%
  rename(cost = cost_repairs_infl_adj) %>%
  filter(!is.na(cost)) %>%
  mutate(
    cost_median = median(cost),
    cost_high = ifelse(cost > cost_median, TRUE, FALSE)
  )

wildlife_impacts1 %>%
  select(cost, cost_median, cost_high) %>%
  head()
#> # A tibble: 6 × 3
#>     cost cost_median cost_high
#>    <dbl>       <dbl> <lgl>    
#> 1   1000       26783 FALSE    
#> 2    200       26783 FALSE    
#> 3  10000       26783 FALSE    
#> 4 100000       26783 TRUE     
#> 5  20000       26783 FALSE    
#> 6 487000       26783 TRUE

4.4.2 Recoding with nested ifelse()

Often times you may want to make a category variable that takes different values based on different conditions. You can “nest” multiple ifelse() statements to achieve this.

For example, let’s say we wanted to create a variable, season, based on the incident_month variable. We could do this like so:

wildlife_impacts2 <- wildlife_impacts %>%
  mutate(season = ifelse(
    incident_month %in% c(3, 4, 5), 'spring', ifelse(
    incident_month %in% c(6, 7, 8), 'summer', ifelse(
    incident_month %in% c(9, 10, 11), 'fall', 'winter')))
  ) 

wildlife_impacts2 %>%
  distinct(incident_month, season) %>%
  head()
#> # A tibble: 6 × 2
#>   incident_month season
#>            <dbl> <chr> 
#> 1             12 winter
#> 2             11 fall  
#> 3             10 fall  
#> 4              9 fall  
#> 5              8 summer
#> 6              7 summer

Here we’re entering into another ifelse() statement as the “else” condition of an earlier condition.

4.4.3 Recoding with case_when()

You can achieve the same thing as the example about usin ghte case_when() function, which produces slightly cleaner code.

Note: If you don’t include the final TRUE ~ 'winter' condition, you’ll get NA for those cases.

wildlife_impacts2 <- wildlife_impacts %>%
  mutate(season = case_when(
    incident_month %in% c(3, 4, 5) ~ 'spring',
    incident_month %in% c(6, 7, 8) ~ 'summer',
    incident_month %in% c(9, 10, 11) ~ 'fall',
    TRUE ~ 'winter')
  ) 

wildlife_impacts2 %>%
  distinct(incident_month, season) %>%
  head()
#> # A tibble: 6 × 2
#>   incident_month season
#>            <dbl> <chr> 
#> 1             12 winter
#> 2             11 fall  
#> 3             10 fall  
#> 4              9 fall  
#> 5              8 summer
#> 6              7 summer

4.4.4 Recoding with case_when() and between()

Here’s another approach to achieve the same thing but using the between() function to note any months between a start and end month:

wildlife_impacts2 <- wildlife_impacts %>%
  mutate(season = case_when(
    between(incident_month, 3, 5) ~ 'spring',
    between(incident_month, 6, 8) ~ 'summer',
    between(incident_month, 9, 11) ~ 'fall',
    TRUE ~ 'winter') 
  )

wildlife_impacts2 %>%
    distinct(incident_month, season) %>%
    head()
#> # A tibble: 6 × 2
#>   incident_month season
#>            <dbl> <chr> 
#> 1             12 winter
#> 2             11 fall  
#> 3             10 fall  
#> 4              9 fall  
#> 5              8 summer
#> 6              7 summer

4.4.5 Break a single variable into two with separate()

The tb_rates data set contains information on tuberculosis rates in different countries. It looks like this:

tb_rates
#> # A tibble: 6 × 3
#>   country      year rate             
#>   <chr>       <dbl> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

We can break the rate variable into two separate variables, cases and population, using the separate() function:

tb_rates %>%
  separate(rate, into = c("cases", "population"))
#> # A tibble: 6 × 4
#>   country      year cases  population
#>   <chr>       <dbl> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

You can also specify the separator:

tb_rates %>%
  separate(
      rate,
      into = c("cases", "population"),
      sep = "/"
  )
#> # A tibble: 6 × 4
#>   country      year cases  population
#>   <chr>       <dbl> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

You can also convert the new variables to numeric by adding convert = TRUE:

tb_rates %>%
  separate(
    rate, 
    into = c("cases", "population"),
    sep = "/", 
    convert = TRUE
  )
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

4.5 Dealing with Dates Data

Dates are notoriously annoying to deal with.

4.5.1 Create Dates from Strings

To deal with dates, we’ll use the lubridate package.

Check out the lubridate cheat sheet

The key concept to remember with lubridate is:

order is the ONLY thing that matters!

The package has special functions that convert date strings into formal date formats. The function names are based on the order of the date:

Year-Month-Day

ymd('2020-02-26')
#> [1] "2020-02-26"
ymd('2020 Feb 26')
#> [1] "2020-02-26"
ymd('2020 Feb. 26')
#> [1] "2020-02-26"
ymd('2020 february 26')
#> [1] "2020-02-26"

Month-Day-Year

mdy('February 26, 2020')
#> [1] "2020-02-26"
mdy('Feb. 26, 2020')
#> [1] "2020-02-26"
mdy('Feb 26 2020')
#> [1] "2020-02-26"

Day-Month-Year

dmy('26 February 2020')
#> [1] "2020-02-26"
dmy('26 Feb. 2020')
#> [1] "2020-02-26"
dmy('26 Feb, 2020')
#> [1] "2020-02-26"

4.5.2 Extracting information from dates

The today() function returns the current date:

date <- today()
date
#> [1] "2024-09-06"

You can extract different components of the date using the year(), month(), and day() functions:

year(date)
#> [1] 2024
month(date)
#> [1] 9
day(date)
#> [1] 6

You can also get the month name in a specific format using the month() function:

month(date, label = TRUE, abbr = FALSE)
#> [1] September
#> 12 Levels: January < February < March < April < May < June < ... < December

You can also get the weekday name in a specific format using the wday() function:

wday(date)
#> [1] 6
wday(date)
#> [1] 6
wday(date, label = TRUE, abbr = TRUE)
#> [1] Fri
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

The same functions can also be used to modify elements of dates:

date <- today()
date
#> [1] "2024-09-06"
# Change the year
year(date) <- 2016
date 
#> [1] "2016-09-06"
# Change the day
day(date) <- 30
date
#> [1] "2016-09-30"