head(names)
#> state_name
#> 1 Alabama
#> 2 Alaska
#> 3 Arizona
#> 4 Arkansas
#> 5 California
#> 6 Colorado
head(abbs)
#> state_abb
#> 1 AK
#> 2 AL
#> 3 AR
#> 4 AZ
#> 5 CA
#> 6 CO
Learning Objectives
- Merge multiple datasets effectively with joins.
- Clean, organize, and recode variables.
- Handle and process date data.
Let’s say we have two data sets: state names, and state abbreviations. Now we want to merge them into a data frame so that the state names and abbreviations are in the same row.
head(names)
#> state_name
#> 1 Alabama
#> 2 Alaska
#> 3 Arizona
#> 4 Arkansas
#> 5 California
#> 6 Colorado
head(abbs)
#> state_abb
#> 1 AK
#> 2 AL
#> 3 AR
#> 4 AZ
#> 5 CA
#> 6 CO
We might be tempted to simply merge the two data sets using bind_cols()
, like this:
<- bind_cols(names, abbs)
result head(result)
#> state_name state_abb
#> 1 Alabama AK
#> 2 Alaska AL
#> 3 Arizona AR
#> 4 Arkansas AZ
#> 5 California CA
#> 6 Colorado CO
While this looks like it works, if you look closely, you’ll see that the state abbreviations are all mixed up (e.g., Alabama and Alaska are swapped).
This is where joins come into play.
Joining is a way to combine two data sets based on a common variable.
There are three types of joins that we can utilize to properly merge data sets.
inner_join()
left_join()
/ right_join()
full_join()
We’ll take a look at the different joins applied to the band_members
& band_instruments
data sets.
band_members
#> # A tibble: 3 × 2
#> name band
#> <chr> <chr>
#> 1 Mick Stones
#> 2 John Beatles
#> 3 Paul Beatles
band_instruments
#> # A tibble: 3 × 2
#> name plays
#> <chr> <chr>
#> 1 John guitar
#> 2 Paul bass
#> 3 Keith guitar
inner_join()
In inner joins, only the rows that have a match on the common variable are kept. This gif illustrates the inner join:
Applied to our example, we get:
%>%
band_members inner_join(band_instruments)
#> # A tibble: 2 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 John Beatles guitar
#> 2 Paul Beatles bass
full_join()
In full joins, all rows from both data sets are kept. This gif illustrates the full join:
Applied to our example, we get:
%>%
band_members full_join(band_instruments)
#> # A tibble: 4 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
#> 4 Keith <NA> guitar
Note here that NA
values are introduced for the missing matches.
left_join()
In left joins, all rows from the left data set are kept. This gif illustrates the left join:
Applied to our example, we get:
%>%
band_members left_join(band_instruments)
#> # A tibble: 3 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Here we get an NA
for Mick Stones as we don’t know what instrument he plays (he is not in the band_instruments
data set).
right_join()
Right joins are just the opposite of left joins. All rows from the right data set are kept. This gif illustrates the right join:
Applied to our example, we get:
%>%
band_members right_join(band_instruments)
#> # A tibble: 3 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 John Beatles guitar
#> 2 Paul Beatles bass
#> 3 Keith <NA> guitar
If you don’t specify the joining variable name, R will use the first variable name that appears in both data sets:
%>%
band_members left_join(band_instruments)
#> Joining with `by = join_by(name)`
#> # A tibble: 3 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Here you see the message Joining with `by = join_by(name)`
, because R realized that the name
variable appears in each dataset.
We can be more explicit by specifying the joining variable name, like this:
%>%
band_members left_join(
band_instruments,by = 'name'
)
#> # A tibble: 3 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Here we get no message because we told R which variable to use to join.
If the matching column has a different name in each data frame, use by = c("left_name" = "joining_name")
to specify the correct joining name.
For example, in the two data frames below we have name
in one of them and artist
in the other. Since these don’t match, R won’t know which column to use by default to do the join. Here is how we can tell it that these two columns should be used for matching:
band_members
#> # A tibble: 3 × 2
#> name band
#> <chr> <chr>
#> 1 Mick Stones
#> 2 John Beatles
#> 3 Paul Beatles
band_instruments2
#> # A tibble: 3 × 2
#> artist plays
#> <chr> <chr>
#> 1 John guitar
#> 2 Paul bass
#> 3 Keith guitar
%>%
band_members left_join(
band_instruments2,by = c("name" = "artist")
)
#> # A tibble: 3 × 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
We can also just rename the joining variable before joining, like this:
%>%
band_members rename(artist = name) %>%
left_join(
band_instruments2,by = "artist"
)
#> # A tibble: 3 × 3
#> artist band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Always check variable types after reading in data!
A lot of times variables get encoded into a data type that is not what you would want after reading in an external data file. Take a look at this data set:
<- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx'))
wind
glimpse(wind)
#> Rows: 50
#> Columns: 7
#> $ Ranking <chr> "1.0", "2.0", "3.0", "4.0", "5.0", …
#> $ State <chr> "TEXAS", "OKLAHOMA", "IOWA", "CALIF…
#> $ `Installed Capacity (MW)` <dbl> 23262, 7495, 7312, 5686, 5110, 4464…
#> $ `Equivalent Homes Powered` <chr> "6235000.0", "2268000.0", "1935000.…
#> $ `Total Investment ($ Millions)` <chr> "42000.0", "13700.0", "14200.0", "1…
#> $ `Wind Projects Online` <dbl> 136, 45, 107, 104, 35, 49, 98, 31, …
#> $ `# of Wind Turbines` <chr> "12750.0", "3717.0", "4145.0", "697…
Here we see that many variables that should be numeric are actually characters (e.g. Ranking
, Equivalent Homes Powered
, Total Investment ($ Millions)
, and # of Wind Turbines
).
To address this, one of the first things we usually do is convert incorrect data types to the appropriate ones, like this:
<- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
wind mutate(
Ranking = as.numeric(Ranking),
`Equivalent Homes Powered` = as.numeric(`Equivalent Homes Powered`),
`Total Investment ($ Millions)` = as.numeric(`Total Investment ($ Millions)`),
`# of Wind Turbines` = as.numeric(`# of Wind Turbines`)
)
glimpse(wind)
#> Rows: 50
#> Columns: 7
#> $ Ranking <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, …
#> $ State <chr> "TEXAS", "OKLAHOMA", "IOWA", "CALIF…
#> $ `Installed Capacity (MW)` <dbl> 23262, 7495, 7312, 5686, 5110, 4464…
#> $ `Equivalent Homes Powered` <dbl> 6235000, 2268000, 1935000, 1298000,…
#> $ `Total Investment ($ Millions)` <dbl> 42000, 13700, 14200, 12600, 9400, 8…
#> $ `Wind Projects Online` <dbl> 136, 45, 107, 104, 35, 49, 98, 31, …
#> $ `# of Wind Turbines` <dbl> 12750, 3717, 4145, 6972, 2795, 2632…
Be careful converting strings to numbers!
If you use as.numeric()
(as we did above), you can get NAs if there are any non-numeric characters in the data. For example:
This works great:
as.numeric(c("2.1", "3.7", "4.50"))
#> [1] 2.1 3.7 4.5
The $
symbol breaks this:
as.numeric(c("$2.1", "$3.7", "$4.50"))
#> [1] NA NA NA
A solution is to use parse_number()
instead, which parses strings for numbers:
This works great:
parse_number(c("2.1", "3.7", "4.50"))
#> [1] 2.1 3.7 4.5
The $
symbol breaks this:
parse_number(c("$2.1", "$3.7", "$4.50"))
#> [1] 2.1 3.7 4.5
You might also notice that the variables names in the wind
dataset are long and messy:
<- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx'))
wind
names(wind)
#> [1] "Ranking" "State"
#> [3] "Installed Capacity (MW)" "Equivalent Homes Powered"
#> [5] "Total Investment ($ Millions)" "Wind Projects Online"
#> [7] "# of Wind Turbines"
We can easily fix these with janitor::clean_names()
:
library(janitor)
<- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
wind clean_names()
names(wind)
#> [1] "ranking" "state"
#> [3] "installed_capacity_mw" "equivalent_homes_powered"
#> [5] "total_investment_millions" "wind_projects_online"
#> [7] "number_of_wind_turbines"
We can even specify the style of the names:
library(janitor)
<- read_excel(here::here('data', 'US_State_Wind_Energy_Facts_2018.xlsx')) %>%
wind clean_names(case = 'lower_camel')
names(wind)
#> [1] "ranking" "state"
#> [3] "installedCapacityMw" "equivalentHomesPowered"
#> [5] "totalInvestmentMillions" "windProjectsOnline"
#> [7] "numberOfWindTurbines"
select()
to rename and reorder variablesHere’s an example data set on the sleeping patterns of different mammals:
glimpse(msleep)
#> Rows: 83
#> Columns: 11
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…
The primary use of select()
is to choose which columns to keep or drop:
Selecting variables to keep
%>%
msleep select(name:order, sleep_total:sleep_cycle) %>%
glimpse()
#> Rows: 83
#> Columns: 7
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater sh…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "B…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "ca…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "A…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
Selecting variables to drop
%>%
msleep select(-(name:order)) %>%
glimpse()
#> Rows: 83
#> Columns: 7
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…
We can also select columns based on partial column names
For example, we can select columns that start with “sleep”:
%>%
msleep select(name, starts_with("sleep")) %>%
glimpse()
#> Rows: 83
#> Columns: 4
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater sh…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
Or we can select columns that contain “eep” and end with “wt”:
%>%
msleep select(contains("eep"), ends_with("wt")) %>%
glimpse()
#> Rows: 83
#> Columns: 5
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000,…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.…
You can also use select()
to select columns based on their data type
Select only numeric columns:
%>%
msleep select_if(is.numeric) %>%
glimpse()
#> Rows: 83
#> Columns: 6
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0,…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333,…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0,…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000,…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.…
Select only character columns:
%>%
msleep select_if(is.character) %>%
glimpse()
#> Rows: 83
#> Columns: 5
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
You can also use select()
to reorder variables. The key is to use everything()
, which tells R to keep all the other variables in the data frame.
For example, if we wanted to move the conservation
and awake
variables to the first two columns, we could do this:
%>%
msleep select(conservation, awake, everything()) %>%
glimpse()
#> Rows: 83
#> Columns: 11
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0…
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0…
You can also use select()
to rename variables.
The pattern here is new = old
:
%>%
msleep select(
animal = name,
extinction_threat = conservation
%>%
) glimpse()
#> Rows: 83
#> Columns: 2
#> $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea…
#> $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N…
Notice however that this drops everything else.
If you wanted to just rename them but keep all other variables, use rename()
:
%>%
msleep rename(
animal = name,
extinction_threat = conservation
%>%
) glimpse()
#> Rows: 83
#> Columns: 11
#> $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph…
#> $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.…
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4…
ifelse()
The ifelse()
function takes the pattern:
if <condition>, <value if condition is TRUE>, <value if condition if FALSE>
Example: Create a variable, cost_high
, that is TRUE
if the repair costs were greater than the median costs and FALSE
otherwise.
<- wildlife_impacts %>%
wildlife_impacts1 rename(cost = cost_repairs_infl_adj) %>%
filter(!is.na(cost)) %>%
mutate(
cost_median = median(cost),
cost_high = ifelse(cost > cost_median, TRUE, FALSE)
)
%>%
wildlife_impacts1 select(cost, cost_median, cost_high) %>%
head()
#> # A tibble: 6 × 3
#> cost cost_median cost_high
#> <dbl> <dbl> <lgl>
#> 1 1000 26783 FALSE
#> 2 200 26783 FALSE
#> 3 10000 26783 FALSE
#> 4 100000 26783 TRUE
#> 5 20000 26783 FALSE
#> 6 487000 26783 TRUE
ifelse()
Often times you may want to make a category variable that takes different values based on different conditions. You can “nest” multiple ifelse()
statements to achieve this.
For example, let’s say we wanted to create a variable, season
, based on the incident_month
variable. We could do this like so:
<- wildlife_impacts %>%
wildlife_impacts2 mutate(season = ifelse(
%in% c(3, 4, 5), 'spring', ifelse(
incident_month %in% c(6, 7, 8), 'summer', ifelse(
incident_month %in% c(9, 10, 11), 'fall', 'winter')))
incident_month
)
%>%
wildlife_impacts2 distinct(incident_month, season) %>%
head()
#> # A tibble: 6 × 2
#> incident_month season
#> <dbl> <chr>
#> 1 12 winter
#> 2 11 fall
#> 3 10 fall
#> 4 9 fall
#> 5 8 summer
#> 6 7 summer
Here we’re entering into another ifelse()
statement as the “else” condition of an earlier condition.
case_when()
You can achieve the same thing as the example about usin ghte case_when()
function, which produces slightly cleaner code.
Note: If you don’t include the final TRUE ~ 'winter'
condition, you’ll get NA
for those cases.
<- wildlife_impacts %>%
wildlife_impacts2 mutate(season = case_when(
%in% c(3, 4, 5) ~ 'spring',
incident_month %in% c(6, 7, 8) ~ 'summer',
incident_month %in% c(9, 10, 11) ~ 'fall',
incident_month TRUE ~ 'winter')
)
%>%
wildlife_impacts2 distinct(incident_month, season) %>%
head()
#> # A tibble: 6 × 2
#> incident_month season
#> <dbl> <chr>
#> 1 12 winter
#> 2 11 fall
#> 3 10 fall
#> 4 9 fall
#> 5 8 summer
#> 6 7 summer
case_when()
and between()
Here’s another approach to achieve the same thing but using the between()
function to note any months between a start and end month:
<- wildlife_impacts %>%
wildlife_impacts2 mutate(season = case_when(
between(incident_month, 3, 5) ~ 'spring',
between(incident_month, 6, 8) ~ 'summer',
between(incident_month, 9, 11) ~ 'fall',
TRUE ~ 'winter')
)
%>%
wildlife_impacts2 distinct(incident_month, season) %>%
head()
#> # A tibble: 6 × 2
#> incident_month season
#> <dbl> <chr>
#> 1 12 winter
#> 2 11 fall
#> 3 10 fall
#> 4 9 fall
#> 5 8 summer
#> 6 7 summer
separate()
The tb_rates
data set contains information on tuberculosis rates in different countries. It looks like this:
tb_rates
#> # A tibble: 6 × 3
#> country year rate
#> <chr> <dbl> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
We can break the rate
variable into two separate variables, cases
and population
, using the separate()
function:
%>%
tb_rates separate(rate, into = c("cases", "population"))
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
You can also specify the separator:
%>%
tb_rates separate(
rate,into = c("cases", "population"),
sep = "/"
)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
You can also convert the new variables to numeric by adding convert = TRUE
:
%>%
tb_rates separate(
rate, into = c("cases", "population"),
sep = "/",
convert = TRUE
)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Dates are notoriously annoying to deal with.
To deal with dates, we’ll use the lubridate
package.
Check out the lubridate
cheat sheet
The key concept to remember with lubridate
is:
order is the ONLY thing that matters!
The package has special functions that convert date strings into formal date formats. The function names are based on the order of the date:
Year-Month-Day
ymd('2020-02-26')
#> [1] "2020-02-26"
ymd('2020 Feb 26')
#> [1] "2020-02-26"
ymd('2020 Feb. 26')
#> [1] "2020-02-26"
ymd('2020 february 26')
#> [1] "2020-02-26"
Month-Day-Year
mdy('February 26, 2020')
#> [1] "2020-02-26"
mdy('Feb. 26, 2020')
#> [1] "2020-02-26"
mdy('Feb 26 2020')
#> [1] "2020-02-26"
Day-Month-Year
dmy('26 February 2020')
#> [1] "2020-02-26"
dmy('26 Feb. 2020')
#> [1] "2020-02-26"
dmy('26 Feb, 2020')
#> [1] "2020-02-26"
The today()
function returns the current date:
<- today()
date date
#> [1] "2024-09-06"
You can extract different components of the date using the year()
, month()
, and day()
functions:
year(date)
#> [1] 2024
month(date)
#> [1] 9
day(date)
#> [1] 6
You can also get the month name in a specific format using the month()
function:
month(date, label = TRUE, abbr = FALSE)
#> [1] September
#> 12 Levels: January < February < March < April < May < June < ... < December
You can also get the weekday name in a specific format using the wday()
function:
wday(date)
#> [1] 6
wday(date)
#> [1] 6
wday(date, label = TRUE, abbr = TRUE)
#> [1] Fri
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
The same functions can also be used to modify elements of dates:
<- today()
date date
#> [1] "2024-09-06"
# Change the year
year(date) <- 2016
date
#> [1] "2016-09-06"
# Change the day
day(date) <- 30
date
#> [1] "2016-09-30"