1 Introduction to R
https://learn.datacamp.com/courses/free-introduction-to-r
Main functions and concepts covered in this BP chapter:
- Arithmetic
- Variables and assignment (<-)
- Data Types
- Vectors, matrices, data frames, lists
c()
names()
matrix()
rbind()
,cbind()
rowSums()
,colSums()
- Factors
levels()
head()
,tail()
- Data frame $ notation
subset()
order()
list()
Packages used in this chapter:
Datasets used in this chapter:
## Load datasets used in this chapter
# The only dataset used is mtcars, which is built into R so we don't have to load it
1.1 Intro to basics
# symbol is used to add comments to R code in a code chunk. Outside of a code chunk, # indicates a heading. So don’t use # outside code chunks except for creating headings. More # means smaller heading, with 1 indicating BP chapters (e.g., “Introduction to R” above), 2 indicating the DataCamp course chapters (e.g., “Intro to basics” above), 3 or more for subsections within each DataCamp chapter (e.g., “Arithmetic operators” below).
1.1.1 Artythmatic opperators
In its most basic form, R can be used as a simple calculator:
Math | R Code | Output |
---|---|---|
Addition | 6 + 5 |
11 |
Subtraction | 6 - 5 |
1 |
Multiplication | 6 * 5 |
30 |
Division | 6/5 |
1.2 |
Exponentiation | 6^2 |
36 |
Modulo (remainder) | 6%%5 |
1 |
1.1.2 Variables
A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R for later access using the command. You can then use variables later like you would the values for arithmetic or anywhere a value is used. You can later change the value of variables.
# Assign a value to the variables my_apples and my_oranges
my_apples <- 5
my_oranges <- 6
# Add these two variables together
my_apples + my_oranges
## [1] 11
# Create the variable my_fruit and then display its value
my_fruit <- my_apples + my_oranges
my_fruit
## [1] 11
# Change value of my_apples to 7
my_apples <- 7
# note how sum will change, but value of my_fruit hasn't changed
my_apples + my_oranges
## [1] 13
## [1] 11
You can also display a variable created in a previous code chunk
## [1] 7
If you try to use the name of a variable that doesn’t exist, you’ll get an error. Usually that will stop the output file from being created (“knit” for a single RMD file, or “built” when building a bookdown book). However, for this BP we’re telling R to keep going when there’s an error and instead display the error message with yellow font and crimson background. Unless it is being included for the purpose of demonstrating code that produces an error, all errors should be fixed
ERROR: in eval(expr, envir, enclos): object ‘my_pears’ not found
1.1.3 Data Types
Basic data types in R include:
- Decimal values like 4.5 are called numerics.
- Whole numbers like 4 are called integers. Integers are also numerics.
- Boolean values (TRUE or FALSE) are called logical.
- Text (or string) values are called characters. These are put in quotation marks.
You can check the type of a variable using the class()
function.
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE
# Check class of my_numeric
class(my_numeric)
## [1] "numeric"
## [1] "character"
## [1] "logical"
Combining variables with different types usually results in an error
ERROR: in my_numeric + my_character: non-numeric argument to binary operator
1.2 Vectors
1.2.1 Create a vector
A vector stores elements of the same data type (numeric, character, logical). Use the combine Function c()
to create a vector c(vector element #1, #2, #3,...)
1.2.2 Name elements of a vector
The names()
function lets you assign names to parts of a vector. If we want to assign the same names to multiple vectors, it’s best to type it once and use it each time we need to use it
1.2.3 Selecting elements of a vector
Select elements of a vector with square brackets, or by name
## Tuesday
## -50
## Tuesday
## -50
Select multiple elements using c()
. For example, select Monday, Wednesday, Friday:
## Monday Wednesday Friday
## 140 20 240
## Monday Wednesday Friday
## 140 20 240
select between a start and an end point:
## Wednesday Thursday Friday
## 20 -120 240
1.2.4 Vector arithmatic
You can do arithmetic on numeric vectors of the same length.
# Assign to total_daily how much you won/lost on each day
total_daily <- poker_vector + roulette_vector
poker_vector
## Monday Tuesday Wednesday Thursday Friday
## 140 -50 20 -120 240
## Monday Tuesday Wednesday Thursday Friday
## -24 -50 100 -350 10
## Monday Tuesday Wednesday Thursday Friday
## 116 -100 120 -470 250
Use the sum()
function to add up all elements (here, all days of the week), and themean()
function to find the average.
## [1] -84
## [1] -16.8
1.2.5 Logical Comparisons
See section 2.1 for additional discussion.
<
less than>
greater than<=
less than or equal to>=
greater than or equal to==
equal!=
not equal
These work on individual values and on vectors
## Monday Tuesday Wednesday Thursday Friday
## 140 -50 20 -120 240
## Monday Tuesday Wednesday Thursday Friday
## -24 -50 100 -350 10
## Monday Tuesday Wednesday Thursday Friday
## 116 -100 120 -470 250
## Monday Tuesday Wednesday Thursday Friday
## TRUE TRUE FALSE TRUE TRUE
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE TRUE FALSE TRUE
1.2.6 Selection by comparison
You can combine logical operators with selection to select elements that meet specific requirements. To do this, you use a selection vector
The selection vector has has values TRUE when that day’s value is >0 and FALSE when that day’s value is <0:
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE TRUE FALSE TRUE
Then use the selection vector in the [] to select the elements that say TRUE and not the ones that say FALSE
# Select from poker_vector these days
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days
## Monday Wednesday Friday
## 140 20 240
You can also do it directly without creating the separate selection vector
## Monday Wednesday Friday
## 140 20 240
1.3 Matrices
A matrix matrix()
is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.
1.3.1 Construct a matrix from values
We can use the matrix()
function. The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. We can use any numbers we want though, for example, c(7,1,1,4,6,2,1,7,7)
.Here, we’re going to use the numbers 1 to 9 in order. To do this, we could use c(1, 2, 3, 4, 5, 6, 7, 8, 9)
, but for numbers that go in order, a useful shortcut is 1:9
.
The byrow=TRUE
argument indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just use byrow = FALSE
.
The nrow=3
argument indicates that the matrix should have three rows.
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
1.3.2 Construct a matrix from vectors
Suppose we have 3 vectors (which we’ll create here, but we might have them from somewhere else)
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
Combine the three vectors into one long vector
## [1] 460.998 314.400 290.475 247.900 309.306 165.800
Turn the long vector into a matrix like we did with the values 1:9
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office,byrow=TRUE,nrow=3)
star_wars_matrix
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
1.3.3 Naming a matrix
# Name the columns with region
colnames(star_wars_matrix) <- c("US", "non-US")
# Name the rows with titles
rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
We could also do this all in one step with the help of the dimnames
argument:
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
c("US", "non-US")))
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
1.3.4 Matrix arithmatic
1.3.4.1 Opperations by row and column
rowsum()
adds up all columns for each in a matrix. colsum()
adds up all rows for each column of a matrix. These are the matrix equivalent of how sum()
added up all elements of a vector
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
## US non-US
## 1060.779 728.100
There is also rowMeans()
and colMeans()
, the matrix equivalent (by row and column) to mean()
for a vector
## A New Hope The Empire Strikes Back Return of the Jedi
## 387.6990 269.1875 237.5530
## US non-US
## 353.593 242.700
1.3.4.2 Element-wise opperations
You can also do arithmetic to each element individually. For example, suppose we want to convert sales from millions to thousands. To do this, we need to multiply each value by 1000
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## US non-US
## A New Hope 460998 314400
## The Empire Strikes Back 290475 247900
## Return of the Jedi 309306 165800
You can also do arithmetic element-by-element using entire matrices as long as the dimensions are the same. Suppose we have a matrix of average ticket prices (created manually using DC output)
ticket_prices <- matrix(c(5, 5, 6, 6, 7, 7), nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
c("US", "non-US")))
Then calculate the number of people who purchased tickets (in millions) by dividing the sales (in millions) by the ticket prices
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## US non-US
## A New Hope 5 5
## The Empire Strikes Back 6 6
## Return of the Jedi 7 7
## US non-US
## A New Hope 92.19960 62.88000
## The Empire Strikes Back 48.41250 41.31667
## Return of the Jedi 44.18657 23.68571
(Note about approaching the BP…for the most part I went in the same order as DC, but this last example was actually given as the last exercise of the Matrix part. I thought it fit better here, so I created my own example very similar to theirs and put it here.)
1.3.5 Adding columns and rows to a matrix
1.3.5.1 Add column
For each movie, calculate how much it made across all regions. Then use cbind()
to bind this column to the matrix (i.e., add a “totals” column to the right of the matrix). DataCamp named it worldwide_vector
, but I’m going to name it Total
so that the matrix column is named Total
.
# For each movie, calculate how much it made across all regions
Total <- rowSums(star_wars_matrix)
# Bind the new variable Total as a column to star_wars_matrix
cbind(star_wars_matrix,Total)
## US non-US Total
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
1.3.5.2 Add rows (and stacking matricies)
Stack two matrices on top of each other (note that to create the star_wars_matrix2
matrix I copy/pasted values from the output, thereby also testing what I’ve learned about creating a matrix)
star_wars_matrix2 <- matrix(c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5), nrow = 3, byrow = TRUE,
dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
c("US", "non-US")))
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## US non-US
## The Phantom Menace 474.5 552.5
## Attack of the Clones 310.7 338.7
## Revenge of the Sith 380.3 468.5
Combine the original matrix with the second one by stacking them using rbind()
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
You can also add a single row. Let’s add the total for each region for all movies.
For each region, calculate how much all movies made. Store this in a vector named Total
so the row will have the name Total
. Then use rbind()
to bind it as the bottom row of the matrix
Note that you can add both rows and columns by just doing one and then the other (once you’ve added rows to a matrix, it’s just a matrix, so you can add columns just like any other matrix)
As an example, let’s start with the star_wars_all_WithTotals
matrix and add a Total column with the
# For each movie, calculate how much it made across all regions
Total <- rowSums(star_wars_all_WithTotals)
# Bind the new variable Total as a column to star_wars_matrix
star_wars_all_WithTotals <- cbind(star_wars_all_WithTotals,Total)
star_wars_all_WithTotals
## US non-US Total
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## The Phantom Menace 474.500 552.5 1027.000
## Attack of the Clones 310.700 338.7 649.400
## Revenge of the Sith 380.300 468.5 848.800
## Total 2226.279 2087.8 4314.079
1.3.6 Selection of matrix elements
1.3.6.1 Select entire columns
To select an entire column or columns, leave the row part blank and select the column(s) like you do from a vector. We can type a specific number, use a range (2:3
), or use the c()
function.
## US non-US Total
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## The Phantom Menace 474.500 552.5 1027.000
## Attack of the Clones 310.700 338.7 649.400
## Revenge of the Sith 380.300 468.5 848.800
## Total 2226.279 2087.8 4314.079
## A New Hope The Empire Strikes Back Return of the Jedi
## 314.4 247.9 165.8
## The Phantom Menace Attack of the Clones Revenge of the Sith
## 552.5 338.7 468.5
## Total
## 2087.8
## non-US Total
## A New Hope 314.4 775.398
## The Empire Strikes Back 247.9 538.375
## Return of the Jedi 165.8 475.106
## The Phantom Menace 552.5 1027.000
## Attack of the Clones 338.7 649.400
## Revenge of the Sith 468.5 848.800
## Total 2087.8 4314.079
## US Total
## A New Hope 460.998 775.398
## The Empire Strikes Back 290.475 538.375
## Return of the Jedi 309.306 475.106
## The Phantom Menace 474.500 1027.000
## Attack of the Clones 310.700 649.400
## Revenge of the Sith 380.300 848.800
## Total 2226.279 4314.079
1.3.6.2 Select entire rows
To select and entire row or rows, leave the column number blank and select row(s) like you select from a vector.
## US non-US Total
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## The Phantom Menace 474.500 552.5 1027.000
## Attack of the Clones 310.700 338.7 649.400
## Revenge of the Sith 380.300 468.5 848.800
## Total 2226.279 2087.8 4314.079
## US non-US Total
## 290.475 247.900 538.375
## US non-US Total
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## US non-US Total
## A New Hope 460.998 314.4 775.398
## Return of the Jedi 309.306 165.8 475.106
1.3.6.3 Selecting specific rows and columns (and specific elements)
To select specific row(s) and column(s), just select from both instead of leaving one blank
## US non-US Total
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## The Phantom Menace 474.500 552.5 1027.000
## Attack of the Clones 310.700 338.7 649.400
## Revenge of the Sith 380.300 468.5 848.800
## Total 2226.279 2087.8 4314.079
## [1] 290.475
## The Phantom Menace Attack of the Clones Revenge of the Sith
## 474.5 310.7 380.3
## US non-US
## A New Hope 460.998 314.4
## Return of the Jedi 309.306 165.8
1.4 Factors
A factor variable is a variable with categorical data with a finite number of categories (as opposed to a continuous variable which can have an infinite number of values).
1.4.1 Nominal vs ordinal factor variables
Factor variables come in two types, “nominal” and “ordinal”.
A nominal factor variable is a categorical variable without an implied order. For example, a list of animals:
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
## [1] Elephant Giraffe Donkey Horse
## Levels: Donkey Elephant Giraffe Horse
Note that it does list the levels in alphabetical order, but just because Donkey comes before Elephant alphabetically doesn’t make it better or worse, higher or lower.
In contrast, an ordinal factor variable does have a natural ordering. For example, weather reports about the temperature being “Low”, “Medium”, or “High”:
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High Low High Low Medium
## Levels: Low < Medium < High
For an ordinal factor variable, the order is external based on knowledge of what the levels mean rather than alphabetical.
1.4.2 Comparing factor values vs levels
1.4.2.1 Nominal factor variable comparison
For a nominal factor variable, you can’t compare levels (so you can’t compare elements of the vector). In fact, it displays a warning that “> not meaningful for factors.”
## [1] Elephant
## Levels: Donkey Elephant Giraffe Horse
## [1] Giraffe
## Levels: Donkey Elephant Giraffe Horse
## Warning in Ops.factor(factor_animals_vector[1], factor_animals_vector[2]): '>'
## not meaningful for factors
## [1] NA
However, if you try to compare the text values it will let you. However, the ordering is just alphabetical and doesn’t actually make sense.
## [1] "Elephant"
## [1] "Giraffe"
## [1] FALSE
1.4.2.2 Ordinal factor variable comparison
For an ordinal factor variable, you can compare the levels (and can thus compare elements of the vector). It has an order, so you can compare them just like you can compare numbers.
## [1] High
## Levels: Low < Medium < High
## [1] Low
## Levels: Low < Medium < High
## [1] TRUE
You can also compare the values as text, but it does so alphabetically with “High” being smaller that “Low” because “H” comes before “L” alphabetically. In other words, the ordering will be wrong (unless the alphabetical ordering happens to match exactly with the correct ordering)
## [1] "High"
## [1] "Low"
## [1] FALSE
1.4.3 Setting factor levels
Sometimes you will want to change the names of levels of a factor variable for clarity or other reasons. R allows you to do this with the function levels(): levels(factor_vector) <- c("name1", "name2",...)
.
The order in which you list the levels is very important. By default, it lists them alphabetically. You should look at the order they are displayed and re-assign the levels in that same order
Here’s an example of survey data that recorded genders reported by respondents as “M” and “F”. But we might want these to be written out as “Male” and “Female”. Here is how you re-code the levels:
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
factor_survey_vector
## [1] M F F M M
## Levels: F M
# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female","Male")
factor_survey_vector
## [1] Male Female Female Male Male
## Levels: Female Male
1.4.4 Summarizing a factor variable
Use the summary()
function to get a summary of how many of each level you have. Here is an example with a nominal factor:
## [1] Male Female Female Male Male
## Levels: Female Male
## Female Male
## 2 3
This also works for ordinal factors
## [1] High Low High Low Medium
## Levels: Low < Medium < High
## Low Medium High
## 2 1 2
Note that this doesn’t work if the vector isn’t a factor
## [1] "M" "F" "F" "M" "M"
## Length Class Mode
## 5 character character
## [1] "High" "Low" "High" "Low" "Medium"
## Length Class Mode
## 5 character character
1.5 Data frames
In a matrix all columns have to be the same type. Often datasets include variables with different types. Data frames are sort of like matrices, except the different columns can be of different data types. A data frame has the variables of a dataset as columns and the observations as rows.
An example is the mtcars
data frame built into R:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
(Note that later in the BP you do NOT want to display an entire dataset in your book. It will greatly slow down how fast it builds and displays. Some datasets will have millions of rows, and you won’t be able to do anything in R for hours or build your book at all. Instead, save anything that creates hundreds of rows into a variable and only display its head()
or str()
)
1.5.1 Displaying part of a dataset
Many datasets have hundreds (or thousands or millions) of rows. We clearly don’t want to display them like we just displayed all 32 rows of mtcars.
Display the first few rows using head()
or the last few rows using tail()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Display the first few values, along with info on variable types, and the total number of observations, using str()
(str for structure):
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
1.5.2 Constructing a data frame
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name,type,diameter,rotation,rings)
str(planets_df)
## 'data.frame': 8 obs. of 5 variables:
## $ name : chr "Mercury" "Venus" "Earth" "Mars" ...
## $ type : chr "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
## $ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
1.5.3 Selecting rows, columns, and values from data frame
Select from a data frame just like you do from a matrix (as explained in more detail above in the Matrix section). Recall that you can select a range using the colon (e.g., 2:4
for 2, 3 and 4), and a list using c()
(e.g., c(1,3,5)
).
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
## [4] "Terrestrial planet" "Gas giant" "Gas giant"
## [7] "Gas giant" "Gas giant"
## name type
## 1 Mercury Terrestrial planet
## 2 Venus Terrestrial planet
## 3 Earth Terrestrial planet
## 4 Mars Terrestrial planet
## 5 Jupiter Gas giant
## 6 Saturn Gas giant
## 7 Uranus Gas giant
## 8 Neptune Gas giant
## name type diameter
## 2 Venus Terrestrial planet 0.949
## 4 Mars Terrestrial planet 0.532
## 5 Jupiter Gas giant 11.209
1.5.3.1 Using column names
You can also select by name using [] and the name in quotation marks.
## name
## 1 Mercury
## 2 Venus
## 3 Earth
## 4 Mars
## 5 Jupiter
## 6 Saturn
## 7 Uranus
## 8 Neptune
## [1] "Venus" "Earth" "Jupiter"
## name type
## 2 Venus Terrestrial planet
## 3 Earth Terrestrial planet
## 5 Jupiter Gas giant
1.5.3.2 Using dollar sign notation
You can also use $ notation. This method is particularly useful.
## [1] "Mercury" "Venus" "Earth" "Mars" "Jupiter" "Saturn" "Uranus"
## [8] "Neptune"
Once you select an entire column using $ notation, it is like a vector and you can select elements just like you do from a vector:
## [1] "Venus"
## [1] "Venus" "Mars" "Jupiter"
## [1] "Mercury" "Venus" "Earth"
1.5.3.3 Using selection vector
The rings
variable is logical with TRUE and FALSE values. It can be used like a selection vector as we did above in the vector section.
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
1.5.3.4 Subsetting
Use the subset()
function to select observations that meet a specific criteria, for example, every planet with a diameter less than 1:
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
(Note that we’ll learn and will usually use other methods like filter()
from the dplyr
package that is part of the tidyverse, but occasionally it’s useful to know this “base” approach too.)
1.5.4 Order and sorting
The order()
function creates a list of values corresponding with the order of the values (i.e., if you sorted the list and then labeled them 1, 2, …, n)
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
## [1] 1 4 2 3 8 7 6 5
You can select the rows using the output of order()
to sort it by selecting those rows
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
(Note that we’ll learn and will usually use other methods like arrange()
from the dplyr
package that is part of the tidyverse, but occasionally it’s useful to know this “base” approach too.)
1.6 Lists
A list is a collection of R objects of any type. You can put together vectors, matrices, data frames, and other types of R objects together in a list.
1.6.1 Creating a list
Use the list()
function to create a list
## [1] 11
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
# Construct list with these different elements:
my_list <- list(my_fruit, star_wars_all, planets_df)
my_list
## [[1]]
## [1] 11
##
## [[2]]
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
##
## [[3]]
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## List of 3
## $ : num 11
## $ : num [1:6, 1:2] 461 290 309 474 311 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
## .. ..$ : chr [1:2] "US" "non-US"
## $ :'data.frame': 8 obs. of 5 variables:
## ..$ name : chr [1:8] "Mercury" "Venus" "Earth" "Mars" ...
## ..$ type : chr [1:8] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
## ..$ diameter: num [1:8] 0.382 0.949 1 0.532 11.209 ...
## ..$ rotation: num [1:8] 58.64 -243.02 1 1.03 0.41 ...
## ..$ rings : logi [1:8] FALSE FALSE FALSE FALSE TRUE TRUE ...
Lists can also have names. Continuing the previous example:
## $vec
## [1] 11
##
## $mat
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
##
## $df
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
1.6.2 Selecting elements of a list
1.6.2.1 Double brackets
Lists use double square brackets: [[]]
This is different from matrices and vectors that use a single square bracket: []
Once you’ve selected a single element, you can use it just like the underlying object (e.g., once you select a matrix, you can use it just like a matrix)
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
## [1] 165.8
1.6.2.2 Dollar sign notation
You can also use $ notation on a list
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## [1] "Mercury" "Venus" "Earth" "Mars" "Jupiter" "Saturn" "Uranus"
## [8] "Neptune"
## name type
## 2 Venus Terrestrial planet
## 3 Earth Terrestrial planet
## 5 Jupiter Gas giant