1 Introduction to R

https://learn.datacamp.com/courses/free-introduction-to-r

Main functions and concepts covered in this BP chapter:

  1. Arithmetic
  2. Variables and assignment (<-)
  3. Data Types
  4. Vectors, matrices, data frames, lists
  5. c()
  6. names()
  7. matrix()
  8. rbind(), cbind()
  9. rowSums(), colSums()
  10. Factors
  11. levels()
  12. head(), tail()
  13. Data frame $ notation
  14. subset()
  15. order()
  16. list()

Packages used in this chapter:

## Load all packages used in this chapter
# There are no packages used in this chapter

Datasets used in this chapter:

## Load datasets used in this chapter
# The only dataset used is mtcars, which is built into R so we don't have to load it

1.1 Intro to basics

# symbol is used to add comments to R code in a code chunk. Outside of a code chunk, # indicates a heading. So don’t use # outside code chunks except for creating headings. More # means smaller heading, with 1 indicating BP chapters (e.g., “Introduction to R” above), 2 indicating the DataCamp course chapters (e.g., “Intro to basics” above), 3 or more for subsections within each DataCamp chapter (e.g., “Arithmetic operators” below).

1.1.1 Artythmatic opperators

In its most basic form, R can be used as a simple calculator:

Math R Code Output
Addition 6 + 5 11
Subtraction 6 - 5 1
Multiplication 6 * 5 30
Division 6/5 1.2
Exponentiation 6^2 36
Modulo (remainder) 6%%5 1

1.1.2 Variables

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R for later access using the command. You can then use variables later like you would the values for arithmetic or anywhere a value is used. You can later change the value of variables.

# Assign a value to the variables my_apples and my_oranges
my_apples <- 5
my_oranges <- 6

# Add these two variables together
my_apples + my_oranges
## [1] 11
# Create the variable my_fruit and then display its value
my_fruit <-  my_apples + my_oranges
my_fruit
## [1] 11
# Change value of my_apples to 7
my_apples <- 7

# note how sum will change, but value of my_fruit hasn't changed
my_apples + my_oranges
## [1] 13
my_fruit
## [1] 11

You can also display a variable created in a previous code chunk

my_apples
## [1] 7

If you try to use the name of a variable that doesn’t exist, you’ll get an error. Usually that will stop the output file from being created (“knit” for a single RMD file, or “built” when building a bookdown book). However, for this BP we’re telling R to keep going when there’s an error and instead display the error message with yellow font and crimson background. Unless it is being included for the purpose of demonstrating code that produces an error, all errors should be fixed

my_fruit <- my_apples + my_oranges + my_pears

ERROR: in eval(expr, envir, enclos): object ‘my_pears’ not found

1.1.3 Data Types

Basic data types in R include:

  • Decimal values like 4.5 are called numerics.
  • Whole numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) values are called characters. These are put in quotation marks.

You can check the type of a variable using the class() function.

# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
## [1] "numeric"
# Check class of my_character
class(my_character)
## [1] "character"
# Check class of my_logical
class(my_logical)
## [1] "logical"

Combining variables with different types usually results in an error

my_numeric + my_character

ERROR: in my_numeric + my_character: non-numeric argument to binary operator

1.2 Vectors

1.2.1 Create a vector

A vector stores elements of the same data type (numeric, character, logical). Use the combine Function c() to create a vector c(vector element #1, #2, #3,...)

# Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday
roulette_vector <-  c(-24,-50,100,-350,10)

1.2.2 Name elements of a vector

The names() function lets you assign names to parts of a vector. If we want to assign the same names to multiple vectors, it’s best to type it once and use it each time we need to use it

# The variable days_vector
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
 
# Assign the names of the day to roulette_vector and poker_vector
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

1.2.3 Selecting elements of a vector

Select elements of a vector with square brackets, or by name

poker_vector[2]
## Tuesday 
##     -50
poker_vector["Tuesday"]
## Tuesday 
##     -50

Select multiple elements using c(). For example, select Monday, Wednesday, Friday:

poker_vector[c(1,3,5)]
##    Monday Wednesday    Friday 
##       140        20       240
poker_vector[c("Monday","Wednesday","Friday")]
##    Monday Wednesday    Friday 
##       140        20       240

select between a start and an end point:

poker_vector[3:5]
## Wednesday  Thursday    Friday 
##        20      -120       240

1.2.4 Vector arithmatic

You can do arithmetic on numeric vectors of the same length.

# Assign to total_daily how much you won/lost on each day
total_daily <- poker_vector + roulette_vector
poker_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       140       -50        20      -120       240
roulette_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       -24       -50       100      -350        10
total_daily
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       116      -100       120      -470       250

Use the sum() function to add up all elements (here, all days of the week), and themean() function to find the average.

sum(total_daily)
## [1] -84
mean(total_daily)
## [1] -16.8

1.2.5 Logical Comparisons

See section 2.1 for additional discussion.

  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to
  • == equal
  • != not equal

These work on individual values and on vectors

poker_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       140       -50        20      -120       240
roulette_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       -24       -50       100      -350        10
total_daily
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       116      -100       120      -470       250
# Compare two vectors
poker_vector >= roulette_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE      TRUE     FALSE      TRUE      TRUE
# Compare vector elements to number
total_daily >= 0
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE

1.2.6 Selection by comparison

You can combine logical operators with selection to select elements that meet specific requirements. To do this, you use a selection vector

# Which days did you make money on poker?
selection_vector <- poker_vector > 0

The selection vector has has values TRUE when that day’s value is >0 and FALSE when that day’s value is <0:

selection_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE

Then use the selection vector in the [] to select the elements that say TRUE and not the ones that say FALSE

# Select from poker_vector these days
poker_winning_days <- poker_vector[selection_vector]

poker_winning_days
##    Monday Wednesday    Friday 
##       140        20       240

You can also do it directly without creating the separate selection vector

poker_vector[poker_vector > 0]
##    Monday Wednesday    Friday 
##       140        20       240

1.3 Matrices

A matrix matrix() is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

1.3.1 Construct a matrix from values

We can use the matrix() function. The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. We can use any numbers we want though, for example, c(7,1,1,4,6,2,1,7,7).Here, we’re going to use the numbers 1 to 9 in order. To do this, we could use c(1, 2, 3, 4, 5, 6, 7, 8, 9), but for numbers that go in order, a useful shortcut is 1:9.

The byrow=TRUE argument indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just use byrow = FALSE.

The nrow=3 argument indicates that the matrix should have three rows.

# Construct a matrix with 3 rows that contain the numbers 1 up to 9
matrix(1:9,byrow=TRUE,nrow=3)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

1.3.2 Construct a matrix from vectors

Suppose we have 3 vectors (which we’ll create here, but we might have them from somewhere else)

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

Combine the three vectors into one long vector

# Create box_office
box_office <- c(new_hope,empire_strikes,return_jedi)
box_office
## [1] 460.998 314.400 290.475 247.900 309.306 165.800

Turn the long vector into a matrix like we did with the values 1:9

# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office,byrow=TRUE,nrow=3)
star_wars_matrix
##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

1.3.3 Naming a matrix

# Name the columns with region
colnames(star_wars_matrix) <- c("US", "non-US")
# Name the rows with titles
rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

We could also do this all in one step with the help of the dimnames argument:

star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                      dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
                      c("US", "non-US")))
star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

1.3.4 Matrix arithmatic

1.3.4.1 Opperations by row and column

rowsum() adds up all columns for each in a matrix. colsum() adds up all rows for each column of a matrix. These are the matrix equivalent of how sum() added up all elements of a vector

# For each movie, calculate how much it made across all regions
rowSums(star_wars_matrix)
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106
# For each region, calculate how much all movies made
colSums(star_wars_matrix)
##       US   non-US 
## 1060.779  728.100

There is also rowMeans() and colMeans(), the matrix equivalent (by row and column) to mean() for a vector

# For each movie, calculate how much it made across all regions
rowMeans(star_wars_matrix)
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                387.6990                269.1875                237.5530
# For each region, calculate how much all movies made
colMeans(star_wars_matrix)
##      US  non-US 
## 353.593 242.700

1.3.4.2 Element-wise opperations

You can also do arithmetic to each element individually. For example, suppose we want to convert sales from millions to thousands. To do this, we need to multiply each value by 1000

star_wars_matrix_Thousands <- star_wars_matrix * 1000
star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
star_wars_matrix_Thousands
##                             US non-US
## A New Hope              460998 314400
## The Empire Strikes Back 290475 247900
## Return of the Jedi      309306 165800

You can also do arithmetic element-by-element using entire matrices as long as the dimensions are the same. Suppose we have a matrix of average ticket prices (created manually using DC output)

ticket_prices <- matrix(c(5, 5, 6, 6, 7, 7), nrow = 3, byrow = TRUE,
                      dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
                      c("US", "non-US")))

Then calculate the number of people who purchased tickets (in millions) by dividing the sales (in millions) by the ticket prices

star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
ticket_prices
##                         US non-US
## A New Hope               5      5
## The Empire Strikes Back  6      6
## Return of the Jedi       7      7
star_wars_matrix / ticket_prices
##                               US   non-US
## A New Hope              92.19960 62.88000
## The Empire Strikes Back 48.41250 41.31667
## Return of the Jedi      44.18657 23.68571

(Note about approaching the BP…for the most part I went in the same order as DC, but this last example was actually given as the last exercise of the Matrix part. I thought it fit better here, so I created my own example very similar to theirs and put it here.)

1.3.5 Adding columns and rows to a matrix

1.3.5.1 Add column

For each movie, calculate how much it made across all regions. Then use cbind() to bind this column to the matrix (i.e., add a “totals” column to the right of the matrix). DataCamp named it worldwide_vector, but I’m going to name it Total so that the matrix column is named Total.

# For each movie, calculate how much it made across all regions
Total <- rowSums(star_wars_matrix)

# Bind the new variable Total as a column to star_wars_matrix
cbind(star_wars_matrix,Total)
##                              US non-US   Total
## A New Hope              460.998  314.4 775.398
## The Empire Strikes Back 290.475  247.9 538.375
## Return of the Jedi      309.306  165.8 475.106

1.3.5.2 Add rows (and stacking matricies)

Stack two matrices on top of each other (note that to create the star_wars_matrix2 matrix I copy/pasted values from the output, thereby also testing what I’ve learned about creating a matrix)

star_wars_matrix2 <- matrix(c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5), nrow = 3, byrow = TRUE,
                      dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
                      c("US", "non-US")))
star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
star_wars_matrix2
##                         US non-US
## The Phantom Menace   474.5  552.5
## Attack of the Clones 310.7  338.7
## Revenge of the Sith  380.3  468.5

Combine the original matrix with the second one by stacking them using rbind()

star_wars_all <- rbind(star_wars_matrix,star_wars_matrix2)
star_wars_all
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5

You can also add a single row. Let’s add the total for each region for all movies.

For each region, calculate how much all movies made. Store this in a vector named Total so the row will have the name Total. Then use rbind() to bind it as the bottom row of the matrix

Total <- colSums(star_wars_all)
star_wars_all_WithTotals <- rbind(star_wars_all,Total)

Note that you can add both rows and columns by just doing one and then the other (once you’ve added rows to a matrix, it’s just a matrix, so you can add columns just like any other matrix)

As an example, let’s start with the star_wars_all_WithTotals matrix and add a Total column with the

# For each movie, calculate how much it made across all regions
Total <- rowSums(star_wars_all_WithTotals)

# Bind the new variable Total as a column to star_wars_matrix
star_wars_all_WithTotals <- cbind(star_wars_all_WithTotals,Total)
star_wars_all_WithTotals
##                               US non-US    Total
## A New Hope               460.998  314.4  775.398
## The Empire Strikes Back  290.475  247.9  538.375
## Return of the Jedi       309.306  165.8  475.106
## The Phantom Menace       474.500  552.5 1027.000
## Attack of the Clones     310.700  338.7  649.400
## Revenge of the Sith      380.300  468.5  848.800
## Total                   2226.279 2087.8 4314.079

1.3.6 Selection of matrix elements

1.3.6.1 Select entire columns

To select an entire column or columns, leave the row part blank and select the column(s) like you do from a vector. We can type a specific number, use a range (2:3), or use the c() function.

# Full matrix
star_wars_all_WithTotals
##                               US non-US    Total
## A New Hope               460.998  314.4  775.398
## The Empire Strikes Back  290.475  247.9  538.375
## Return of the Jedi       309.306  165.8  475.106
## The Phantom Menace       474.500  552.5 1027.000
## Attack of the Clones     310.700  338.7  649.400
## Revenge of the Sith      380.300  468.5  848.800
## Total                   2226.279 2087.8 4314.079
# Just column 2
star_wars_all_WithTotals[,2]
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                   314.4                   247.9                   165.8 
##      The Phantom Menace    Attack of the Clones     Revenge of the Sith 
##                   552.5                   338.7                   468.5 
##                   Total 
##                  2087.8
# Just columns 2 and 3
star_wars_all_WithTotals[,2:3]
##                         non-US    Total
## A New Hope               314.4  775.398
## The Empire Strikes Back  247.9  538.375
## Return of the Jedi       165.8  475.106
## The Phantom Menace       552.5 1027.000
## Attack of the Clones     338.7  649.400
## Revenge of the Sith      468.5  848.800
## Total                   2087.8 4314.079
# Just columns 1 and 3
star_wars_all_WithTotals[,c(1,3)]
##                               US    Total
## A New Hope               460.998  775.398
## The Empire Strikes Back  290.475  538.375
## Return of the Jedi       309.306  475.106
## The Phantom Menace       474.500 1027.000
## Attack of the Clones     310.700  649.400
## Revenge of the Sith      380.300  848.800
## Total                   2226.279 4314.079

1.3.6.2 Select entire rows

To select and entire row or rows, leave the column number blank and select row(s) like you select from a vector.

# Full matrix
star_wars_all_WithTotals
##                               US non-US    Total
## A New Hope               460.998  314.4  775.398
## The Empire Strikes Back  290.475  247.9  538.375
## Return of the Jedi       309.306  165.8  475.106
## The Phantom Menace       474.500  552.5 1027.000
## Attack of the Clones     310.700  338.7  649.400
## Revenge of the Sith      380.300  468.5  848.800
## Total                   2226.279 2087.8 4314.079
# Just row 2
star_wars_all_WithTotals[2,]
##      US  non-US   Total 
## 290.475 247.900 538.375
# Just rows 2 and 3
star_wars_all_WithTotals[2:3,]
##                              US non-US   Total
## The Empire Strikes Back 290.475  247.9 538.375
## Return of the Jedi      309.306  165.8 475.106
# Just rows 1 and 3
star_wars_all_WithTotals[c(1,3),]
##                         US non-US   Total
## A New Hope         460.998  314.4 775.398
## Return of the Jedi 309.306  165.8 475.106

1.3.6.3 Selecting specific rows and columns (and specific elements)

To select specific row(s) and column(s), just select from both instead of leaving one blank

# Full matrix
star_wars_all_WithTotals
##                               US non-US    Total
## A New Hope               460.998  314.4  775.398
## The Empire Strikes Back  290.475  247.9  538.375
## Return of the Jedi       309.306  165.8  475.106
## The Phantom Menace       474.500  552.5 1027.000
## Attack of the Clones     310.700  338.7  649.400
## Revenge of the Sith      380.300  468.5  848.800
## Total                   2226.279 2087.8 4314.079
# Just row 2 and column 1
star_wars_all_WithTotals[2,1]
## [1] 290.475
# Just rows 4 and 6 and column 1
star_wars_all_WithTotals[4:6,1]
##   The Phantom Menace Attack of the Clones  Revenge of the Sith 
##                474.5                310.7                380.3
# Just rows 1 and 3 and columns 1 and 2
star_wars_all_WithTotals[c(1,3),c(1,2)]
##                         US non-US
## A New Hope         460.998  314.4
## Return of the Jedi 309.306  165.8

1.4 Factors

A factor variable is a variable with categorical data with a finite number of categories (as opposed to a continuous variable which can have an infinite number of values).

1.4.1 Nominal vs ordinal factor variables

Factor variables come in two types, “nominal” and “ordinal”.

A nominal factor variable is a categorical variable without an implied order. For example, a list of animals:

animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
## [1] Elephant Giraffe  Donkey   Horse   
## Levels: Donkey Elephant Giraffe Horse

Note that it does list the levels in alphabetical order, but just because Donkey comes before Elephant alphabetically doesn’t make it better or worse, higher or lower.

In contrast, an ordinal factor variable does have a natural ordering. For example, weather reports about the temperature being “Low”, “Medium”, or “High”:

temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High   Low    High   Low    Medium
## Levels: Low < Medium < High

For an ordinal factor variable, the order is external based on knowledge of what the levels mean rather than alphabetical.

1.4.2 Comparing factor values vs levels

1.4.2.1 Nominal factor variable comparison

For a nominal factor variable, you can’t compare levels (so you can’t compare elements of the vector). In fact, it displays a warning that “> not meaningful for factors.”

# comparing values of a nominal factor variable (you can't) 
factor_animals_vector[1]
## [1] Elephant
## Levels: Donkey Elephant Giraffe Horse
factor_animals_vector[2]
## [1] Giraffe
## Levels: Donkey Elephant Giraffe Horse
factor_animals_vector[1] > factor_animals_vector[2]
## Warning in Ops.factor(factor_animals_vector[1], factor_animals_vector[2]): '>'
## not meaningful for factors
## [1] NA

However, if you try to compare the text values it will let you. However, the ordering is just alphabetical and doesn’t actually make sense.

# comparing text values (works, but doesn't make sense)
animals_vector[1]
## [1] "Elephant"
animals_vector[2]
## [1] "Giraffe"
animals_vector[1] > animals_vector[2]
## [1] FALSE

1.4.2.2 Ordinal factor variable comparison

For an ordinal factor variable, you can compare the levels (and can thus compare elements of the vector). It has an order, so you can compare them just like you can compare numbers.

# comparing values of an ordinal factor variable
factor_temperature_vector[1]
## [1] High
## Levels: Low < Medium < High
factor_temperature_vector[2]
## [1] Low
## Levels: Low < Medium < High
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE

You can also compare the values as text, but it does so alphabetically with “High” being smaller that “Low” because “H” comes before “L” alphabetically. In other words, the ordering will be wrong (unless the alphabetical ordering happens to match exactly with the correct ordering)

# comparing text values (doesn't make sense though based on actual ordering)
temperature_vector[1]
## [1] "High"
temperature_vector[2]
## [1] "Low"
temperature_vector[1] > temperature_vector[2]
## [1] FALSE

1.4.3 Setting factor levels

Sometimes you will want to change the names of levels of a factor variable for clarity or other reasons. R allows you to do this with the function levels(): levels(factor_vector) <- c("name1", "name2",...).

The order in which you list the levels is very important. By default, it lists them alphabetically. You should look at the order they are displayed and re-assign the levels in that same order

Here’s an example of survey data that recorded genders reported by respondents as “M” and “F”. But we might want these to be written out as “Male” and “Female”. Here is how you re-code the levels:

# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
factor_survey_vector
## [1] M F F M M
## Levels: F M
# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female","Male")
factor_survey_vector
## [1] Male   Female Female Male   Male  
## Levels: Female Male

1.4.4 Summarizing a factor variable

Use the summary() function to get a summary of how many of each level you have. Here is an example with a nominal factor:

factor_survey_vector
## [1] Male   Female Female Male   Male  
## Levels: Female Male
summary(factor_survey_vector)
## Female   Male 
##      2      3

This also works for ordinal factors

factor_temperature_vector
## [1] High   Low    High   Low    Medium
## Levels: Low < Medium < High
summary(factor_temperature_vector)
##    Low Medium   High 
##      2      1      2

Note that this doesn’t work if the vector isn’t a factor

survey_vector
## [1] "M" "F" "F" "M" "M"
summary(survey_vector)
##    Length     Class      Mode 
##         5 character character
temperature_vector
## [1] "High"   "Low"    "High"   "Low"    "Medium"
summary(temperature_vector)
##    Length     Class      Mode 
##         5 character character

1.5 Data frames

In a matrix all columns have to be the same type. Often datasets include variables with different types. Data frames are sort of like matrices, except the different columns can be of different data types. A data frame has the variables of a dataset as columns and the observations as rows.

An example is the mtcars data frame built into R:

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

(Note that later in the BP you do NOT want to display an entire dataset in your book. It will greatly slow down how fast it builds and displays. Some datasets will have millions of rows, and you won’t be able to do anything in R for hours or build your book at all. Instead, save anything that creates hundreds of rows into a variable and only display its head() or str())

1.5.1 Displaying part of a dataset

Many datasets have hundreds (or thousands or millions) of rows. We clearly don’t want to display them like we just displayed all 32 rows of mtcars.

Display the first few rows using head() or the last few rows using tail()

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Display the first few values, along with info on variable types, and the total number of observations, using str() (str for structure):

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1.5.2 Constructing a data frame

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name,type,diameter,rotation,rings)

str(planets_df)
## 'data.frame':    8 obs. of  5 variables:
##  $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
##  $ type    : chr  "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
##  $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

1.5.3 Selecting rows, columns, and values from data frame

Select from a data frame just like you do from a matrix (as explained in more detail above in the Matrix section). Recall that you can select a range using the colon (e.g., 2:4 for 2, 3 and 4), and a list using c() (e.g., c(1,3,5)).

# Entire data frame
planets_df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
# select entire 1st row
planets_df[1,]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
# select entire 1st and 2nd rows
planets_df[1:2,]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
# select entire 2nd column
planets_df[,2]
## [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
## [4] "Terrestrial planet" "Gas giant"          "Gas giant"         
## [7] "Gas giant"          "Gas giant"
# select entire 1st and 2nd columns
planets_df[,1:2]
##      name               type
## 1 Mercury Terrestrial planet
## 2   Venus Terrestrial planet
## 3   Earth Terrestrial planet
## 4    Mars Terrestrial planet
## 5 Jupiter          Gas giant
## 6  Saturn          Gas giant
## 7  Uranus          Gas giant
## 8 Neptune          Gas giant
# select a few columns and a few rows
planets_df[c(2,4,5),1:3]
##      name               type diameter
## 2   Venus Terrestrial planet    0.949
## 4    Mars Terrestrial planet    0.532
## 5 Jupiter          Gas giant   11.209

1.5.3.1 Using column names

You can also select by name using [] and the name in quotation marks.

planets_df["name"]
##      name
## 1 Mercury
## 2   Venus
## 3   Earth
## 4    Mars
## 5 Jupiter
## 6  Saturn
## 7  Uranus
## 8 Neptune
planets_df[c(2,3,5),"name"]
## [1] "Venus"   "Earth"   "Jupiter"
planets_df[c(2,3,5),c("name","type")]
##      name               type
## 2   Venus Terrestrial planet
## 3   Earth Terrestrial planet
## 5 Jupiter          Gas giant

1.5.3.2 Using dollar sign notation

You can also use $ notation. This method is particularly useful.

planets_df$name
## [1] "Mercury" "Venus"   "Earth"   "Mars"    "Jupiter" "Saturn"  "Uranus" 
## [8] "Neptune"

Once you select an entire column using $ notation, it is like a vector and you can select elements just like you do from a vector:

planets_df$name[2]
## [1] "Venus"
planets_df$name[c(2,4,5)]
## [1] "Venus"   "Mars"    "Jupiter"
planets_df$name[1:3]
## [1] "Mercury" "Venus"   "Earth"

1.5.3.3 Using selection vector

The rings variable is logical with TRUE and FALSE values. It can be used like a selection vector as we did above in the vector section.

planets_df$rings
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
planets_df[planets_df$rings,]
##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

1.5.3.4 Subsetting

Use the subset() function to select observations that meet a specific criteria, for example, every planet with a diameter less than 1:

subset(planets_df,subset = (diameter<1))
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE

(Note that we’ll learn and will usually use other methods like filter() from the dplyr package that is part of the tidyverse, but occasionally it’s useful to know this “base” approach too.)

1.5.4 Order and sorting

The order() function creates a list of values corresponding with the order of the values (i.e., if you sorted the list and then labeled them 1, 2, …, n)

planets_df$diameter
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883
order(planets_df$diameter)
## [1] 1 4 2 3 8 7 6 5

You can select the rows using the output of order() to sort it by selecting those rows

planets_df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
planets_df[order(planets_df$diameter),]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

(Note that we’ll learn and will usually use other methods like arrange() from the dplyr package that is part of the tidyverse, but occasionally it’s useful to know this “base” approach too.)

1.6 Lists

A list is a collection of R objects of any type. You can put together vectors, matrices, data frames, and other types of R objects together in a list.

1.6.1 Creating a list

Use the list() function to create a list

# vector from above
my_fruit
## [1] 11
# matrix from above
star_wars_all
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5
# data frame from above
planets_df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
# Construct list with these different elements:
my_list <- list(my_fruit, star_wars_all, planets_df)
my_list
## [[1]]
## [1] 11
## 
## [[2]]
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5
## 
## [[3]]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
str(my_list)
## List of 3
##  $ : num 11
##  $ : num [1:6, 1:2] 461 290 309 474 311 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
##   .. ..$ : chr [1:2] "US" "non-US"
##  $ :'data.frame':    8 obs. of  5 variables:
##   ..$ name    : chr [1:8] "Mercury" "Venus" "Earth" "Mars" ...
##   ..$ type    : chr [1:8] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
##   ..$ diameter: num [1:8] 0.382 0.949 1 0.532 11.209 ...
##   ..$ rotation: num [1:8] 58.64 -243.02 1 1.03 0.41 ...
##   ..$ rings   : logi [1:8] FALSE FALSE FALSE FALSE TRUE TRUE ...

Lists can also have names. Continuing the previous example:

my_list <- list(vec=my_fruit, mat=star_wars_all, df=planets_df)

# Print out my_list
my_list
## $vec
## [1] 11
## 
## $mat
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5
## 
## $df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE

1.6.2 Selecting elements of a list

1.6.2.1 Double brackets

Lists use double square brackets: [[]]

This is different from matrices and vectors that use a single square bracket: []

Once you’ve selected a single element, you can use it just like the underlying object (e.g., once you select a matrix, you can use it just like a matrix)

# select the matrix that is spot 2 in the list
my_list[[2]]
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8
## The Phantom Menace      474.500  552.5
## Attack of the Clones    310.700  338.7
## Revenge of the Sith     380.300  468.5
# select the element in the 3rd row and 2nd column
my_list[[2]][3,2]
## [1] 165.8

1.6.2.2 Dollar sign notation

You can also use $ notation on a list

# Select the data frame that we named "df"
my_list$df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
# Select the 1st column of the data frame
my_list$df[1,]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
# Do the same thing using $ notation
my_list$df$name
## [1] "Mercury" "Venus"   "Earth"   "Mars"    "Jupiter" "Saturn"  "Uranus" 
## [8] "Neptune"
# Select specific parts of the data frame
my_list$df[c(2,3,5),c("name","type")]
##      name               type
## 2   Venus Terrestrial planet
## 3   Earth Terrestrial planet
## 5 Jupiter          Gas giant