2 Intermediate R

https://learn.datacamp.com/courses/intermediate-r

Main functions and concepts covered in this BP chapter:

  1. Relational (> >= == != <= <) and Logical (& |) Operators
  2. if() else() else if()
  3. while() and for() loops, with break and next
  4. Functions, and writing your own using function()
  5. Installing and loading packages
    • install.packages()
    • library()
    • require()
  6. lapply(), sapply(), vapply()
  7. unlist()
  8. Some useful functions (see 2.5)
  9. Regular Expressions (RegEx)
    • grepl() and grep() for searching
    • sub() gsub()for replacing
    • Note: regular expressions are crucial in many aspects of programming, so learning how to use them at some point would be beneficial
  10. Dates and times

Packages used in this chapter:

## Load all packages used in this chapter
# There are no packages used in this chapter
# The exception is that ggplot2 is loaded below, but it's loaded as an example of how to load a package
# So, for this chapter only we'll load it down there

Datasets used in this chapter:

## Load datasets used in this chapter
# There are no datasets used in this chapter

2.1 Conditionals and Control Flow

See section 1.2.5 for additional discussion.

Relational and Logical Operators:

  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to
  • == equal
  • != not equal
  • & and
  • | or

In general, you can think of TRUE == 1 and FALSE == 0. That means:

TRUE > FALSE
## [1] TRUE

R determines greater than based on alphabetical order "a" < "b". For example:

"Jack" < "John"
## [1] TRUE
"John" < "John Smith"
## [1] TRUE

Logical comparisons also work with vectors, both vector compared to constant and vector compared to another vector:

# The linkedin and facebook vectors have already been created for you
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

# vector compared to constant
linkedin > 15
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
# vector compared to vector
linkedin > facebook
## [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

2.1.0.1 And, Or

And (&) is TRUE if both parts are TRUE:

y <- 4
y > 5 & y < 10
## [1] FALSE
y <- 8
y > 5 & y < 10
## [1] TRUE
y <- 12
y > 5 & y < 10
## [1] FALSE

or (|) is TRUE if one part or the other, or both, are TRUE. Using the same example as above for and, all three are true because at least one side is true. It’s only FALSE if we add a fourth example were neither comparison is TRUE.

y <- 4
y > 5 | y < 10
## [1] TRUE
y <- 8
y > 5 | y < 10
## [1] TRUE
y <- 12
y > 5 | y < 10
## [1] TRUE
y <- 12
y < 5 | y < 10
## [1] FALSE

2.1.0.2 Not (!)

not (!) is an exclamation point. It can be used to check if two things are NOT equal. It can also be used to swap TRUE and FALSE (if something is TRUE, putting ! before it makes it false and if something is FALSE, putting ! before it makes it TRUE)

5 != 7
## [1] TRUE
5 == 7
## [1] FALSE
y <- FALSE
y
## [1] FALSE
!y
## [1] TRUE

Can also use ! before function that returns logical

z <- 7
is.numeric(z)
## [1] TRUE
!is.numeric(z)
## [1] FALSE

You can put multiple conditions in parentheses and NOT the entire thing:

y <- 4
y > 5 | y < 10
## [1] TRUE
!(y > 5 | y < 10)
## [1] FALSE

2.1.0.3 Double and and or

Double and/or (&& ||) only compares the first entry in a vector vs returning TRUE/FALSE for each element in a vector

x <- c(TRUE,FALSE,TRUE,FALSE)
y <- c(FALSE,TRUE,TRUE,FALSE)
x & y
## [1] FALSE FALSE  TRUE FALSE
x | y
## [1]  TRUE  TRUE  TRUE FALSE
x && y

ERROR: in x && y: ‘length = 4’ in coercion to ‘logical(1)’

x || y

ERROR: in x || y: ‘length = 4’ in coercion to ‘logical(1)’

2.1.1 Conditional (If Else) Statements

if() Statement: Runs specified code if the condition is met else() and else if() Statements: Used with an if statement for what to do if initial if() isn’t true

x <- 2
if(x<0) {
  print("x is a negative number")
} else if(x == 0) {
  print("x is equal to 0")
} else {
  print("x is a positive number")
}
## [1] "x is a positive number"

2.1.1.1 ifelse

I meant to Google “R if else” but left out the space and found the ifelse() function. It lets you do a simple version of if() and else() in one line, just like the Excel If() function from QDM.

x <- 7
y <- ifelse(x<10,"It is", "It is not")
y
## [1] "It is"
z <- ifelse(x>10,"It is", "It is not")
z
## [1] "It is not"

2.2 Loops

2.2.1 While loop

While loops continue to execute the code over and over as long as the condition remains true.

Basic while loop

ctr <- 1
while(ctr <= 7) {
  print(paste("ctr is set to", ctr))
  ctr <- ctr + 1
}
## [1] "ctr is set to 1"
## [1] "ctr is set to 2"
## [1] "ctr is set to 3"
## [1] "ctr is set to 4"
## [1] "ctr is set to 5"
## [1] "ctr is set to 6"
## [1] "ctr is set to 7"
print(paste("reached the end with value ctr = ",ctr))
## [1] "reached the end with value ctr =  8"

2.2.1.1 break in a while loop

The while loop keeps going until the condition is met. You can also add a break() statement to stop the while loop.

# break out of loop if i is divisible by 4
i <- 1
while (i <= 10) {
  print(paste("i: ",i))
  if ((i)%%4==0) {
    break
  }
  i <- i + 1
}
## [1] "i:  1"
## [1] "i:  2"
## [1] "i:  3"
## [1] "i:  4"
print(paste("reached the end with value i = ",i))
## [1] "reached the end with value i =  4"

2.2.2 For Loop

for([variable] in [sequence]) {}

sequence can be numbers. In this case, it works similar to a while loop except you don’t need to manually go to the next value of the counter

for(ctr in 1:4){
  print(ctr)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Sequence can also be matrices, lists, etc, and let R figure it out for you. Here’s an example with a list:

# The nyc list is already specified
nyc <- list(pop = 8405837, 
            boroughs = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"), 
            capital = FALSE)
# Loop version 1
for (n in nyc) {
    print(n)
}
## [1] 8405837
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
## [1] FALSE
# Loop version 2
for (i in 1:length(nyc)) {
    print(nyc[[i]])
}
## [1] 8405837
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
## [1] FALSE

You can print out a matrix leaving it to go through the elements, or nest the for loops

m <- matrix(1:9,byrow=TRUE,nrow=3)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
for(i in m){
  print(i)
}
## [1] 1
## [1] 4
## [1] 7
## [1] 2
## [1] 5
## [1] 8
## [1] 3
## [1] 6
## [1] 9
for(r in 1:nrow(m)){
  for(c in 1:ncol(m)){
    print(paste("m[", r, ",", c ,"] = " ,m[r,c],sep=""))
  }
}
## [1] "m[1,1] = 1"
## [1] "m[1,2] = 2"
## [1] "m[1,3] = 3"
## [1] "m[2,1] = 4"
## [1] "m[2,2] = 5"
## [1] "m[2,3] = 6"
## [1] "m[3,1] = 7"
## [1] "m[3,2] = 8"
## [1] "m[3,3] = 9"

2.2.2.1 next and break in a for loop

The break statement abandons the active loop: the remaining code in the loop is skipped and the loop is not iterated over anymore.

The next statement skips the remainder of the code in the loop, but continues the iteration.

# The linkedin vector has already been defined for you
linkedin <- c(16, 9, 13, 5, 2, 17, 14)

# Extend the for loop
for (li in linkedin) {
  if (li > 10) {
    print("You're popular!")
  } else {
    print("Be more visible!")
  }
  
  # Add if statement with break
  if(li>16){
    print("This is ridiculous, I'm outta here!")
    break
  }
  
  # Add if statement with next
  if(li<5){
    print( "This is too embarrassing!")
    next
  }
  
  print(li)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "Be more visible!"
## [1] "This is too embarrassing!"
## [1] "You're popular!"
## [1] "This is ridiculous, I'm outta here!"

2.3 Functions

Functions take an input or inputs and the do something with it, sometimes making changes, sometimes printing something, sometimes returning something.

Type a ? before a function to find out what it does (it opens in the help window to the right).

args(function_name) returns a short list of the arguments so you don’t need to look trough the longer help description. From trying it, it doesn’t seem to work very well though (e.g., look at ?mean versus args(mean))

args(sd)
## function (x, na.rm = FALSE) 
## NULL

Reading about the function can help you get thing to work correctly when they don’t initially. For example, suppose there’s an NA value and you try to take the mean. If you read about the mean() function, you learn how to deal with the NA value.

linkedin <- c(16, 9, 13, 5, NA, 17, 14)
# Doesn't work because of NA
mean(linkedin)
## [1] NA
# Remove NA and it works
mean(linkedin,na.rm=TRUE)
## [1] 12.33333

2.3.1 Required vs optional arguments

Here’s what is listed for mean: mean(x, trim = 0, na.rm = FALSE, ...)

x is required. It doesn’t have a default, so if you don’t supply x, it won’t work

trim and na.rm both have default values. This makes them optional arguments. You can specify the default value explicitly if you want, but if you don’t (i.e., if you just leave them out entirely), the function will use the default value. If you don’t want to use the default value, then you need to specify another value.

2.3.2 Writing your own function

Use the function() function to write your own function. Assign it into a variable with the name you want to name the function (i.e., what you’ll call later to use your function). Arguments for your function go in parentheses of the function() function. The body of the function goes in curly brackets. It will return whatever comes last in the function, but if you want to return something earlier, use the return() function.

divideAndAdd1 <- function(a,b) {
  if(b == 0) {
    return("Can't divide by 0")
  }
  a/b + 1
}
divideAndAdd1(10,0)
## [1] "Can't divide by 0"
divideAndAdd1(10,2)
## [1] 6

2.3.3 Function Scoping

Function Scoping implies that variables that are defined inside a function are not accessible outside that function. In this example, the variable used as the argument for the function and the one used inside the function aren’t accessible outside the function. Trying to use them gives an error.

pow_two <- function(x_is_an_arg) {
  y_is_inside_function <- x_is_an_arg ^ 2
  return(y_is_inside_function)
}
pow_two(4)
## [1] 16
y_is_inside_function

ERROR: in eval(expr, envir, enclos): object ‘y_is_inside_function’ not found

x_is_an_arg

ERROR: in eval(expr, envir, enclos): object ‘x_is_an_arg’ not found

2.3.3.1 By value, not by reference

R functions pass arguments by value, not by reference. This means that when you call a function and use a variable as an argument when you call the function, if the function changes that argument’s value, once the function is done and returns, the value of that variable is unchanged. Here’s an example:

addOne <- function(x_in_func) {
  x_in_func <- x_in_func + 1
  return(x_in_func)
}
x <- 7
x
## [1] 7
addOne(x)
## [1] 8
# The value of x is still 7
x
## [1] 7

This is true even if the name of the variable is the same

addOneVersion2 <- function(y) {
  y <- y + 1
  return(y)
}
y <- 7
y
## [1] 7
addOneVersion2(y)
## [1] 8
# The value of y is still 7
y
## [1] 7

2.3.3.2 Nested functions

It’s fine to write a function, and then write a second function that calls that first function (just like it’s fine to call built in functions inside custom functions).

Here’s an example:

interpret <- function(num_views) {
  if (num_views > 15) {
    print("You're popular!")
    return(num_views)
  } else {
    print("Try to be more visible!")
    return(0)
  }
}
# views: vector with data to interpret
# return_sum: return total number of views on popular days
interpret_all <- function(views, return_sum = TRUE) {
  count <- 0
  for (v in views) {
    count <- count + interpret(v)
  }
  if (return_sum == TRUE) {
    count
  } else {
     NULL
  }
}
# Call the interpret_all() function on both linkedin and facebook
facebook
## [1] 17  7  5 16  8 13 14
interpret_all(facebook)
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] 33

2.3.4 Install/Load Packages

Before you can load a package it must be installed. You only install once but you must load the package in each session you want to use it.

install.packages ("ggplot2") installs the ggplot2 package

Do NOT leave an install.packages() line in a file so that every time you run the file it installs the package. This makes it run very slow.

library(ggplot2) loads the ggplot2 function once it is loaded

search () will return the packages currently loaded. You don’t have to run this function unless you want to see packages that are loaded. Usually we don’t run this (so don’t think you have to run this each time you load a package).

# do NOT leave this line so that it runs each time you build the book!!!!!
# install.packages("ggplot2")
search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
# This will give an error because ggplot2 isn't loaded yet
qplot(mtcars$wt, mtcars$hp)

ERROR: in qplot(mtcars\(wt, mtcars\)hp): could not find function “qplot”

# Load ggplot 2
library(ggplot2)
# Now this will work:
qplot(mtcars$wt, mtcars$hp)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Now search() will display that ggplot2 is loaded
search()
##  [1] ".GlobalEnv"        "package:ggplot2"   "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

You can also load packages using require(). require() returns TRUE if it’s already installed (and loads it) and FALSE if it’s not.

isItInstalled <- require(ggplot2)
isItInstalled
## [1] TRUE
isItInstalled <- require(not_a_real_package)
## Loading required package: not_a_real_package
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'not_a_real_package'
isItInstalled
## [1] FALSE

This is most useful if you want to check if a package is installed, install it if it’s not, and then load it.

The following example calls require(ggplot2). If it is installed, it loads it and returns TRUE because the package is installed. The not (!) before require means the TRUE become FALSE, so the if statement is not executed. However, if the package is not installed require(ggplot2) returns FALSE. The not (!) makes that TRUE, so the if statement is executed. The if statement installs it and then loads it. In this way you know that the package will be installed, but it won’t install it unless it’s not already installed.

if(!require(ggplot2)){
  install.packages("ggplot2")
  library(ggplot2)
}
qplot(mtcars$mpg, mtcars$hp) 

2.4 The apply family

2.4.1 Overview

  • lapply()
    • apply function over list or vector
    • output = list
  • sapply()
    • apply function over list or vector
    • try to simplify list to array
  • vapply()
    • apply function over list or vector
    • explicitly specify output format

2.4.2 lapply()

lapply ([list/vector], [function]) applies the function to each item in the list or vector, but the output will always be a list.

unlist () can turn a list back into a vector.

How to split data!

pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
# Split names from birth year
split_math <- strsplit (pioneers, split = ":")
split_math
## [[1]]
## [1] "GAUSS" "1777" 
## 
## [[2]]
## [1] "BAYES" "1702" 
## 
## [[3]]
## [1] "PASCAL" "1623"  
## 
## [[4]]
## [1] "PEARSON" "1857"
class(split_math)
## [1] "list"

Use lapply () to change each element in the list split_math

# Convert to lowercase strings: split_low
split_low <- lapply (split_math, tolower)
# Take a look at the structure of split_low
str (split_low)
## List of 4
##  $ : chr [1:2] "gauss" "1777"
##  $ : chr [1:2] "bayes" "1702"
##  $ : chr [1:2] "pascal" "1623"
##  $ : chr [1:2] "pearson" "1857"

when you add the function in lapply () do NOT put parenthesis after it lapply (x, function). If the function has more arguments put them after another comma in lapply().

2.4.2.1 Anonymous function

You can also use lapply on a function that isn’t explicitly defined. Anonymous functions are when you don’t assign your own function to a name with <- instead just type the function where you want it. Here’s an example

split_low
## [[1]]
## [1] "gauss" "1777" 
## 
## [[2]]
## [1] "bayes" "1702" 
## 
## [[3]]
## [1] "pascal" "1623"  
## 
## [[4]]
## [1] "pearson" "1857"
# lapply with anonymous function that returns NULL if name has more than 5 characters
res <- lapply(split_low, function(x) {
  if (nchar(x[1]) > 5) {
    return(NULL)
  } else {
    return(x[2])
  }
})
res
## [[1]]
## [1] "1777"
## 
## [[2]]
## [1] "1702"
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

Example of what unlist() does

res
## [[1]]
## [1] "1777"
## 
## [[2]]
## [1] "1702"
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
class(res)
## [1] "list"
resUnlisted <- unlist(res)
resUnlisted
## [1] "1777" "1702"
class(resUnlisted)
## [1] "character"

So it turns the list into a vector (in this case, character vector), which also removes the NULL elements

2.4.3 sapply()

Does the same thing as lapply, except it then tries to turn the resulting list into a an array (a vector or matrix) if it can usingunlist(). If it can’t, it leaves it as a list.

temp <- list(c(3,7,9,6,-1),c(6,9,12,13,5),c(4,8,3,-1,-3),c(1,4,7,2,-2),c(5,7,9,4,2),c(-3,5,8,9,4),c(3,6,9,4,1))

resL <- lapply(temp,min)
resL
## [[1]]
## [1] -1
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] -3
## 
## [[4]]
## [1] -2
## 
## [[5]]
## [1] 2
## 
## [[6]]
## [1] -3
## 
## [[7]]
## [1] 1
class(resL)
## [1] "list"
resS <- sapply(temp,min)
resS
## [1] -1  5 -3 -2  2 -3  1
class(resS)
## [1] "numeric"

If sapply can’t turn it into a vector, it leaves it as a list. Here’s an example

# Definition of below_zero()
below_zero <- function(x) {
  return(x[x < 0])
}

# Apply below_zero over temp using sapply(): freezing_s
freezing_s <- sapply(temp, below_zero)

# Apply below_zero over temp using lapply(): freezing_l
freezing_l <- lapply(temp, below_zero)

# Are freezing_s and freezing_l identical?
identical(freezing_s, freezing_l)
## [1] TRUE

identical() can tell you if two or more vectors, matrices, or lists are the same. Because sapply couldn’t turn the result into a vector, the result of sapply and lapply are identical.

Here is how it deals with NULL:

# temp is already available in the workspace

# Definition of print_info()
print_info <- function(x) {
  cat("The average temperature is", mean(x), "\n")
}

# Apply print_info() over temp using sapply()
res <- sapply(temp, print_info)
## The average temperature is 4.8 
## The average temperature is 9 
## The average temperature is 2.2 
## The average temperature is 2.4 
## The average temperature is 5.4 
## The average temperature is 4.6 
## The average temperature is 4.6
res
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
class(res)
## [1] "list"
# Apply print_info() over temp using lapply()
res <- lapply(temp, print_info)
## The average temperature is 4.8 
## The average temperature is 9 
## The average temperature is 2.2 
## The average temperature is 2.4 
## The average temperature is 5.4 
## The average temperature is 4.6 
## The average temperature is 4.6
res
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
class(res)
## [1] "list"

2.4.4 vapply()

You must specify the output you are looking for. This makes vapply() safer than sapply() because you know what structure the function will return. If it can’t return that, it gives an error (which is good because we know something didn’t work the way we expected it to).

The two examples below demonstrate how to specify the data type vapply() will return.

cities <- c("New York", "Paris", "London", "Tokyo","Rio", "Cape Town")
vapply(cities, nchar, numeric(1))
##  New York     Paris    London     Tokyo       Rio Cape Town 
##         8         5         6         5         3         9
# If you tell it to expect something numeric with length 2 instead of 1, we get an error
vapply(cities, nchar, numeric(2))

ERROR: in vapply(cities, nchar, numeric(2)): values must be length 2, ## but FUN(X[[1]]) result is length 1

first_and_last <- function(name) {  
  name <- gsub(" ", "", name)  
  letters <- strsplit(name, split = "")[[1]]
  return(c(first = min(letters), last = max(letters)))
}
vapply(cities, first_and_last, character(2))
##       New York Paris London Tokyo Rio Cape Town
## first "e"      "a"   "d"    "k"   "i" "a"      
## last  "Y"      "s"   "o"    "y"   "R" "w"
## If you tell this one to expect numeric length 2, you get an error
vapply(cities, first_and_last, numeric(2))

ERROR: in vapply(cities, first_and_last, numeric(2)): values must be type ‘double’, ## but FUN(X[[1]]) result is type ‘character’

2.5 Utilities

2.5.1 Easy & Useful Functions

  • rev() reverses elements
  • sort() sorts elements
  • print()
  • identical()
  • abs() gives absolute value of each vector element
  • round() rounds each vector element (1.1 to 1)
  • sum() adds up all vector elements
  • mean() finds the average of all elements in a vector
  • list() creates a list of different elements
  • log() a logical (gives TRUE or FALSE)
  • ch() a character string
  • seq(Arg1, arg2, by = #) generates a sequence of numbers, arg1 tells where to start the sequence, arg2 says where to end the sequence, by argument specifies the increment value
seq(1,10, by = 3)
## [1]  1  4  7 10
  • rep(vector, times = #) replicates its input (typically a vector or list), times argument specifies how often the replication should happen. OR use the each() argument to specify how often each individual element should be replicated
rep(seq(1,10, by = 3), times = 2)
## [1]  1  4  7 10  1  4  7 10
rep(seq(1,10, by = 3), each = 2)
## [1]  1  1  4  4  7  7 10 10
  • sort(vector, decreasing=FALSE) sorts an input vector by numerics, characters, or logical values by ascending order, setting decreasing=TRUE will reverse the order of arranging
sort(rep(seq(1,10, by = 3), times = 2))
## [1]  1  1  4  4  7  7 10 10
  • str() shows content of data structures in concise way
  • is.*(X) checks if data structure X is a * (list, vector,matrix…)
  • as.*(X) converts data structure X to a * (list, vector,matrix…)
  • unlist() converts list to vector
  • append() adds elements to a vector or list, merges vectors or lists
  • rev() reverses list
  • diff() calculates the difference between consecutive elements

2.5.2 Regular Expressions

Uses of regular expressions:

  • a sequence of characters and metacharacters that form a search pattern, which you can use to match strings
  • check whether certain patterns exist in a text
  • to replace patterns with R elements
  • to extract certain patterns out of a string

2.5.2.1 Searching for patters (grepl, grep)

The grepl() function: returns TRUE when a pattern is found in the corresponding character string

The grep() function: returns a vector of indices of the character strings that contains the pattern

Both grep() and grepl() needs a pattern and an x argument, where pattern is the regular expression you want to match for, and the x argument is the character vector from which matches should be sought

Meta characters:

  • pattern = “a” searches for any “a” anywhere
  • pattern = “^a” searches for “a” only at the beginning
  • pattern = “a$” searches for “a” only at the end
  • pattern = “a|i” searches for “a” or “i” (i.e., | is the “or” meta character)
  • A few more are listed below
  • There are many others. See ?regex (and DC has a course on regex)

Regular character add-ons to make a more robust pattern:

  • @, because a valid email must contain an at-sign.
  • .*, which matches any character (.) zero or more times (*). Both the dot and the asterisk are metacharacters. You can use them to match any character between the at-sign and the “.edu” portion of an email address.
  • \\.edu$, to match the “.edu” part of the email at the end of the string. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.

Example

# The emails vector has already been defined for you
emails <- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
            "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv", "my.education@gmail.com")
# Use grepl() to match for "edu"
grepl(pattern = "edu", emails)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
# Use grep() to match for "edu", save result to hits
hits <- grep(pattern = "edu", emails)
hits
## [1] 1 2 4 5 7
# Subset emails using hits
emails[hits]
## [1] "john.doe@ivyleague.edu"   "education@world.gov"     
## [3] "invalid.edu"              "quant@bigdatacollege.edu"
## [5] "my.education@gmail.com"

Correct the code so it matches only “.edu”

# Use grepl() to match for .edu addresses more robustly
grepl(pattern = "@.*\\.edu$", emails)
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
# Use grep() to match for .edu addresses more robustly, save result to hits
hits <- grep(pattern = "@.*\\.edu$", emails)
# Subset emails using hits
emails[hits]
## [1] "john.doe@ivyleague.edu"   "quant@bigdatacollege.edu"

2.5.2.2 Replacing (sub, gsub)

The sub(pattern = <regex>, replacement = <str>, x = <str>) looks for a pattern from x and replaces the first match from that pattern with replacement

The gsub(pattern = <regex>, replacement = <str>, x = <str>) looks for a pattern from x and replaces all matches from that pattern with replacement

Regular expressions used in the example below:

  • .*: “any character that is matched zero or more times”.
  • \\s: Match a space. The “s” is normally a character (i.e., the letter “s”). “” is a space. The extra “" is the”escape” character that results in it included “”..
  • [0-9]+: Match the numbers 0 to 9, at least once (+).
  • ([0-9]+): The parentheses are used to make parts of the matching string available to define the replacement.
  • The \\1 in the replacement argument of sub() gets set to the string that is captured by the regular expression [0-9]+
# Use sub() to convert the .edu email domains to datacamp.edu
sub(pattern="@.*\\.edu$", replacement="@datacamp.edu", x=emails)
## [1] "john.doe@datacamp.edu"    "education@world.gov"     
## [3] "dalai.lama@peace.org"     "invalid.edu"             
## [5] "quant@datacamp.edu"       "cookie.monster@sesame.tv"
## [7] "my.education@gmail.com"
# Use sub() to convert all valid email domains to datacamp.edu
sub(pattern="@.*\\.*$", replacement="@datacamp.edu", x=emails)
## [1] "john.doe@datacamp.edu"       "education@datacamp.edu"     
## [3] "dalai.lama@datacamp.edu"     "invalid.edu"                
## [5] "quant@datacamp.edu"          "cookie.monster@datacamp.edu"
## [7] "my.education@datacamp.edu"

Example: Awards

awards <- c("Won 1 Oscar.",
  "Won 1 Oscar. Another 9 wins & 24 nominations.",
  "1 win and 2 nominations.",
  "2 wins & 3 nominations.",
  "Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
  "4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
## [1] "Won 1 Oscar." "24"           "2"            "3"            "2"           
## [6] "1"

2.5.3 Dates and Times

Dates are represented by Date objects, which store the number of days since January 1, 1970. Note that this is different from Excel, that stores dates as the number of days since January 1, 1900

# Get the current date: today
today <- Sys.Date()
# How many days since January 1, 1970?
unclass(today)
## [1] 19793

Times are represented by POSIXct objects, which store the number of seconds since January 1st, 1970 - Sys.time()

# Get the current time: now
now <- Sys.time()
# How many seconds since January 1, 1970?
unclass(now)
## [1] 1710121043

Dates and times before January 1, 1970 are negative

as.Date() takes a character string of the date and uses a set of symbols to format the date. Here are the symbols:

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)

The default formats are "%Y-%m-%d" or "%Y/%m/%d"

Convert Character strings to dates:

as.Date("1982-01-13")
## [1] "1982-01-13"
as.Date("Jan-13-82", format = "%b-%d-%y")
## [1] "1982-01-13"
as.Date("13 January, 1982", format = "%d %B, %Y")
## [1] "1982-01-13"
# Definition of character strings representing dates
str1 <- "May 23, '96"
str2 <- "2012-03-15"
str3 <- "30/January/2006"
# Convert the strings to dates: date1, date2, date3
date1 <- as.Date(str1, format = "%b %d, '%y")
date2 <- as.Date(str2, format = "%Y-%m-%d")
date3 <- as.Date(str3, format = "%d/%B/%Y")

Convert dates to character strings using different date notation:

today <- Sys.Date()
format(Sys.Date(), format = "%d %B, %Y")
## [1] "11 March, 2024"
format(Sys.Date(), format = "Today is a %A!")
## [1] "Today is a Monday!"
format(date1, "%A")
## [1] "Thursday"
format(date2, "%d")
## [1] "15"
format(date3, "%b %Y")
## [1] "Jan 2006"

as.POSIXct() converts a character string to a POSIXct object

format() converts a POSIXct object to a character string

Symbols to format the time:

  • %H: military hours as a decimal number (00-23)
  • %I: hours as a decimal number (01-12)
  • %M: minutes as a decimal number
  • %S: seconds as a decimal number
  • %T: shorthand notation for the typical format %H:%M:%S
  • %p: AM/PM indicator

The default formats is "%Y-%m-%d %H:%M:%S"

Examples:

# Definition of character strings representing times
str1 <- "May 23, '96 hours:23 minutes:01 seconds:45"
str2 <- "2012-3-12 14:23:08"
# Convert the strings to POSIXct objects: time1, time2
time1 <- as.POSIXct(str1, format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time2 <- as.POSIXct(str2, format = "%Y-%m-%d %H:%M:%S")
# Convert times to formatted strings
format(time1, "%M")
## [1] "01"
format(time2, "%I:%M %p")
## [1] "02:23 PM"

2.5.3.1 Calculations with dates and times

today <- Sys.Date()
today + 1
## [1] "2024-03-12"
today - 1
## [1] "2024-03-10"
as.Date("2022-12-20") - as.Date("2022-12-15")
## Time difference of 5 days

And with Times:

now <- Sys.time()
now + 60*60          # add an hour
## [1] "2024-03-11 02:37:22 UTC"
now - 60*60 * 24     # subtract a day
## [1] "2024-03-10 01:37:22 UTC"
birth <- as.POSIXct("1879-03-14 14:37:23")
death <- as.POSIXct("1955-04-18 03:47:12")
einstein <- death - birth
einstein
## Time difference of 27792.55 days