8 Web Scraping in R

https://learn.datacamp.com/courses/web-scraping-in-r

Main functions and concepts covered in this BP chapter:

Read in HTML
Navigating HTML
- Select all children of a list
- Parse hyperlinks into a data frame
Scrape your first table
Turn a table into a data frame with html_table()
Introduction to CSS
- Select multiple HTML types
CSS classes and IDs
- Leverage the uniqueness of IDs
- Select the last child with a pseudo-class
CSS combinators
- Select direct descendants with the child combinator
- Simply the best!
- Not every sibling is the same
Introduction to XPATH
- Select by class and ID with XPATH
- Use predicates to select nodes based on their children
XPATH functions and advanced predicates
- Get to know the position() function
- Extract nodes based on the number of their children
The XPATH text() function
- The shortcomings of html_table() with badly structured tables
- Select directly from a parent element with XPATH’s text()
- Combine extracted data into a data frame
- Scrape an element based on its text
The nature of HTTP requests
- Do it the httr way
- Houston, we got a 404!
Telling who you are with custom user agents
- Check out your user agent
- Add a custom user agent
How to be gentle and slow down your requests
- Apply throttling to a multi-page crawler

Packages used in this chapter:

## Load all packages used in this chapter
library(tidyverse) #includes dplyr, ggplot2, and other common packages

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(xml2)
library(httr)
library(purrr)

Datasets used in this chapter:

## Load datasets used in this chapter
## For this chapter, they're probably all coming from the web though

8.1 Introduction to HTML and Web Scraping

While there are some data sets that can be easily downloaded, many others can’t. These may be on a browser and spread across multiple pages, and not avaliable to download, but we can still use them through Web Scraping.

8.1.1 Read in HTML

Take the html_excerpt_raw variable and turn it into an HTML document that R understands using a function from the rvest package.

Use the xml_structure() function to get a better overview of the tag hierarchy of the HTML excerpt.

html_excerpt_raw <- '
<html> 
  <body> 
    <h1>Web scraping is cool</h1>
    <p>It involves writing code – be it R or Python.</p>
    <p><a href="https://datacamp.com">DataCamp</a> 
        has courses on it.</p>
  </body> 
</html>'
# Turn the raw excerpt into an HTML document R understands
html_excerpt <- read_html(html_excerpt_raw)
html_excerpt

## {html_document}
## <html>
## [1] <body> \n    <h1>Web scraping is cool</h1>\n    <p>It involves writing co ...

# Print the HTML excerpt with the xml_structure() function
xml_structure(html_excerpt)

## <html>
##   <body>
##     {text}
##     <h1>
##       {text}
##     {text}
##     <p>
##       {text}
##     {text}
##     <p>
##       <a [href]>
##         {text}
##       {text}
##     {text}

8.1.2 Navigating HTML

HTML is organized hierarchically in a tree structure. The tree data structure is like an actual tree turned upside down. There’s always a root node. The root has branches that lead to other nodes, or in the language of HTML, to children.

8.1.2.1 Select all children of a list

In this exercise, you’ll learn to apply the rvest function that allows you to directly select children of a certain node.

Turn the corresponding HTML string (list_raw_html) into an HTML document rvest can work with and call it list_html.

Extract the ol node from the list_html document, using the singular version of the rvest function that can be used to query nodes.

Lastly, extract all the children of the ol_node using yet another function from rvest, and print them directly to the console.

# Read in the corresponding HTML string
list_raw_html <- '

"\n<html>\n  <body>\n    <ol>\n      <li>Learn HTML</li>\n      <li>Learn CSS</li>\n      <li>Learn R</li>\n      <li>Scrape everything!*</li>\n    </ol>\n    <small>*Do it responsibly!</small>\n  </body>\n</html>”
'

list_html <- read_html(list_raw_html)
# Extract the ol node
ol_node <- list_html %>% 
    html_element('ol')
# Extract and print the nodeset of all the children of ol_node
ol_node %>%
    html_children ()

## {xml_nodeset (4)}
## [1] <li>Learn HTML</li>
## [2] <li>Learn CSS</li>
## [3] <li>Learn R</li>
## [4] <li>Scrape everything!*</li>

8.1.2.2 Parse hyperlinks into a data frame

In this exercise, you’ll parse these links into an R data frame by selecting only a elements that are within li elements.

Extract all the a nodes that are within the bulleted list, using html_elements().

Extract both the domain (href attribute) and the link name (text node) from links.

From them, construct a data frame with columns domain and name, respectively.

hyperlink_raw_html <- "\n<html>\n  <body>\n    <h3>Helpful links</h3>\n    <ul>\n      <li><a href=\"https://wikipedia.org\">Wikipedia</a></li>\n      <li><a href=\"https://dictionary.com\">Dictionary</a></li>\n      <li><a href=\"https://duckduckgo.com\">Search Engine</a></li>\n    </ul>\n    <small>\n      Compiled with help from <a href=\"https://google.com\">Google</a>.\n    </small>\n  </body>\n</html>"

# Extract all the a nodes from the bulleted list
links <- hyperlink_raw_html %>% 
  read_html() %>%
  html_elements('li a') # 'ul a' is also correct!

# Extract the needed values for the data frame
domain_value = links %>% html_attr('href')
name_value = links %>% html_text('href')

# Construct a data frame
link_df <- tibble(
  domain = domain_value,
  name = name_value
)

link_df

## # A tibble: 3 × 2
##   domain                 name         
##   <chr>                  <chr>        
## 1 https://wikipedia.org  Wikipedia    
## 2 https://dictionary.com Dictionary   
## 3 https://duckduckgo.com Search Engine

8.1.3 Scrape your first table

In its most basic form, a table may consist of only three different HTML tags: table, tr, and td. The table tag designates a table, as the name says. The tr tag designates rows and is wrapped around multiple td tags, which designate single cells. Normally, the number of td tags in each row should be identical. However, there is the colspan attribute for td, that allows a cell to span multiple columns.

Apart from using functions like html_element() and html_text(), rvest provides a helper function called html_table(). Note that the output of html_table() in this case is a list with one element, being the only table in the html document. If there were more than one table in the document, the list would have more entries. it can be explicitly told to regard the first row as the header row, even if it only consists of td tags.

8.1.4 Turn a table into a data frame with html_table()

If a table has a header row (with th elements) and no gaps, scraping it is straightforward, as with the table having ID "clean", opposed to having ID "dirty" For such cases, html_table() has an extra argument you can use to correctly parse the table, as shown in the video. Missing cells are automatically recognized and replaced with NA values.

Turn the table with ID "clean" into a data frame called mountains

mountains_raw_html <- "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html>\n<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n<body> \n  <table id=\"clean\">\n<tr>\n<th>Mountain</th>\n      <th>Height [m]</th>\n      <th>First ascent</th>\n      <th>Country</th>\n    </tr>\n<tr>\n<td>Mount Everest</td>\n      <td>8848</td>\n      <td>1953</td>\n      <td>Nepal, China</td>\n    </tr>\n<tr>\n<td>K2</td>\n      <td>8611</td>\n      <td>1954</td>\n      <td>Pakistan, China</td>\n    </tr>\n<tr>\n<td>Kanchenjunga</td>\n      <td>8586</td>\n      <td>1955</td>\n      <td>Nepal, India</td>\n    </tr>\n</table>\n<table id=\"dirty\">\n<tr>\n<td>Mountain </td>\n      <td>Height [m]</td>\n      <td>First ascent</td>\n      <td>Country</td>\n    </tr>\n<tr>\n<td>Mount Everest</td>\n      <td>8848</td>\n      <td>1953</td>\n    </tr>\n<tr>\n<td>K2</td>\n      <td>8611</td>\n      <td>1954</td>\n      <td>Pakistan, China</td>\n    </tr>\n<tr>\n<td>Kanchenjunga</td>\n      <td>8586</td>\n      <td>1955</td>\n      <td>Nepal, India</td>\n    </tr>\n</table>\n</body>\n</html>\n"

# Extract the "clean" table into a data frame 
mountains <- read_html(mountains_raw_html) %>% 
  html_element("table#clean") %>% 
  html_table()

mountains

## # A tibble: 3 × 4
##   Mountain      `Height [m]` `First ascent` Country        
##   <chr>                <int>          <int> <chr>          
## 1 Mount Everest         8848           1953 Nepal, China   
## 2 K2                    8611           1954 Pakistan, China
## 3 Kanchenjunga          8586           1955 Nepal, India

Do the same with the "dirty" table, but designate the first line as header.

# Note, I don't have access to the mountains data, if I did, this would be the code to extract the dirty table into a data frame.
mountains <- read_html(mountains_raw_html) %>% 
  html_element("table#dirty") %>% 
  html_table(header=TRUE)

mountains

## # A tibble: 3 × 4
##   Mountain      `Height [m]` `First ascent` Country        
##   <chr>                <int>          <int> <chr>          
## 1 Mount Everest         8848           1953 <NA>           
## 2 K2                    8611           1954 Pakistan, China
## 3 Kanchenjunga          8586           1955 Nepal, India

8.2 Navigation and Selection with CSS

8.2.1 Intro to CSS

With CSS or Cascading Style Sheets, a website can be styled according to a certain specification. These are rules that specify how the HTML is rendered in the browser.

8.2.1.1 Select multiple HTML types

CSS can be used to style a web page. In the most basic form, this happens via type selectors, where styles are defined for and applied to all HTML elements of a certain type. In turn, you can also use type selectors to scrape pages for specific HTML elements. You can also combine multiple type selectors via a comma, i.e. with html_elements("type1, type2"). This selects all elements that have type1 or type2.

Read in languages_raw_html.

Using the method shown above, select all div and p elements in this HTML.

languages_raw_html <- 
"<html> 
  <body> 
    <div>Python is perfect for programming.</div>
    <p>Still, R might be better suited for data analysis.</p>
    <small>(And has prettier charts, too.)</small>
  </body> 
</html>"

# Read in the HTML
languages_html <- read_html(languages_raw_html)
# Select the div and p tags and print their text
languages_html %>%
    html_elements("div, p") %>%
    html_text()

## [1] "Python is perfect for programming."                
## [2] "Still, R might be better suited for data analysis."

8.2.2 CSS classes and IDs

Using type selectors might be enough for very simple web sites with only basic styling. However, modern web sites are much more complex and usually have hundreds, if not thousands of HTML elements. Therefore, classes and IDs were introduced to better identify certain parts of a page. The class is just another attribute of an HTML element, analogous to the href attribute of an a element. In CSS, classes are specified with a dot, and this of course also applies to selecting them with rvest. IDs are written with a pound sign at the beginning. Likewise, you select the node with a specific ID with the pound sign and the name of the ID.

8.2.2.1 Leverage the uniqueness of IDs

IDs should be unique across a web page. If you can make sure this is the case, it can reduce the complexity of your scraping selectors drastically.

Using html_elements(), find the shortest possible selector to select the first div in structured_html.

structured_raw_html <- 
" <html>
  <body>
    <div id = 'first'>
      <h1 class = 'big'>Joe Biden</h1>
      <p class = 'first blue'>Democrat</p>
      <p class = 'second blue'>Male</p>
    </div>
    <div id = 'second'>...</div>
    <div id = 'third'>
      <h1 class = 'big'>Donald Trump</h1>
      <p class = 'first red'>Republican</p>
      <p class = 'second red'>Male</p>
    </div>
  </body>
</html> "

structured_html <- read_html(structured_raw_html)
 
 # Select the first div
structured_html %>% 
  html_elements('#first')

## {xml_nodeset (1)}
## [1] <div id="first">\n      <h1 class="big">Joe Biden</h1>\n      <p class="f ...

# to select a div, use the # sign

8.2.2.2 Select the last child with a pseudo-class

In this exercise, your job is to select the last p node within the div.

As you learned in the video, pseudo-classes can help you whenever you don’t have other means of selecting a specific node of page, e.g., through an ID selector or a unique class.

In a first attempt, use the pseudo-class that selects the last child to scrape the last p in each group.

nested_raw_html <- "<html>
  <body>
    <div>
      <p class = 'text'>A sophisticated text [...]</p>
      <p class = 'text'>Another paragraph following [...]</p>
      <p class = 'text'>Author: T.G.</p>
    </div>
    <p>Copyright: DC</p>
  </body>
</html>"

nested_html <-  read_html(nested_raw_html)
  
# Select the last child of each p group
nested_html %>% 
    html_elements('p:last-child')

## {xml_nodeset (2)}
## [1] <p class="text">Author: T.G.</p>
## [2] <p>Copyright: DC</p>

As this selected the last p node from both groups, make use of the text class to get only the authorship information.

# This time for real: Select only the last node of the p's wrapped by the div
nested_html  %>% 
    html_elements('p.text:last-child')

## {xml_nodeset (1)}
## [1] <p class="text">Author: T.G.</p>

8.2.3 CSS combinators

There are four different types of commonly used combinators. They always have the same structure: A CSS selector, followed by a combinator character or operator, followed by another CSS selector. The first operator is a space, which is the descendant combinator. The second operator is a “greater than” sign, which forms the child combinator. The third operator is a plus sign, which designates the so-called adjacent sibling combinator. The last operator is a tilde, which defines another type of sibling combinator, the general sibling combinator.

8.2.3.1 Select direct descendants with the child combinator

By now, you surely know how to select elements by type, class, or ID. However, there are cases where these selectors won’t work, for example, if you only want to extract direct descendants of the top ul element. For that, you will use the child combinator (>) introduced in the video.

Here, your goal is to scrape a list (contained in the languages_html document) of all mentioned computer languages

First, gather all the li elements in the nested list shown above and print their text.

languages_raw_html <- " <ul id = 'languages'>
    <li>SQL</li>
    <ul>    
      <li>Databases</li>
      <li>Query Language</li>
    </ul>
    <li>R</li>
    <ul>
      <li>Collection</li>
      <li>Analysis</li>
      <li>Visualization</li>
    </ul>
    <li>Python</li>
  </ul>"

languages_html <- read_html(languages_raw_html)

# Extract the text of all list elements
languages_html %>% 
    html_elements('li') %>% 
    html_text('.text')

## [1] "SQL"            "Databases"      "Query Language" "R"             
## [5] "Collection"     "Analysis"       "Visualization"  "Python"

Unlike before, try to extract only direct descendants of the top-level ul element, using the child combinator.

# Extract only the text of the computer languages (without the sub lists)
languages_html %>% 
    html_elements('ul#languages > li') %>% 
    html_text()

## [1] "SQL"    "R"      "Python"

8.2.3.2 Simply the best!

complicated_raw_html <- '
"<html>
  <body>
    <div class="first section">
      A text with a <a href="#">link</a>.
    </div>
    <div class="second section">
      Some text with <a href="#">another link</a>.
      <div class="first paragraph">Some text.</div>
      <div class="second paragraph">Some more text.
        <div>...</div>
      </div>
    </div>
  </body>
</html>"
'

complicated_html <- read_html(complicated_raw_html)

# Select the three divs with a simple selector
complicated_html %>%
    html_elements('div div')

## {xml_nodeset (3)}
## [1] <div class="first paragraph">Some text.</div>
## [2] <div class="second paragraph">Some more text.\n        <div>...</div>\n   ...
## [3] <div>...</div>

8.2.3.3 Not every sibling is the same

In the video, you got to know the adjacent and general sibling combinator (+ and ~). Note that the two code examples here are not hierarchically organized. The only obvious difference is the class of the h2 element that precedes each example.

Select the first code tag after the second example’s h2. For this, use html_elements() with the correct sibling combinator.

code_raw_html <- "<html> 
<body> 
  <h2 class = 'first'>First example:</h2>
  <code>some = code(2)</code>
  <span>will compile to...</span>
  <code>some = more_code()</code>
  <h2 class = 'second'>Second example:</h2>
  <code>another = code(3)</code>
  <span>will compile to...</span>
  <code>another = more_code()</code>
</body> 
</html>"

code_html <- read_html(code_raw_html)

# Select only the first code element in the second example
code_html %>% 
    html_elements('h2.second + code')

## {xml_nodeset (1)}
## [1] <code>another = code(3)</code>

Now select all code elements that come after the second example’s h2. This time, use another type of sibling combinator.

# Select all code elements in the second example
code_html %>% 
    html_elements('h2.second ~ code')

## {xml_nodeset (2)}
## [1] <code>another = code(3)</code>
## [2] <code>another = more_code()</code>

8.3 Advanced Selection with XPATH

8.3.1 Introduction to XPATH

XPATH stands for XML Path Language. With this language, a so-called path through an HTML tree can be formulated, which is a slightly different approach than the one with CSS selectors.

8.3.1.1 Select by class and ID with XPATH

For this chapter, this code looks a bit more like real life. Your goal is to extract the precipitation reading from this weather station. Unfortunately, it can’t be directly referenced through an ID.

Let’s do this by setting up the building blocks step by step and then using them in combination!

In order to warm up, start by selecting all p tags in the above HTML using XPATH.

weather_raw_html <- "<html>
  <body>
    <div id = 'first'>
      <h1 class = 'big'>Berlin Weather Station</h1>
      <p class = 'first'>Temperature: 20°C</p>
      <p class = 'second'>Humidity: 45%</p>
    </div>
    <div id = 'second'>...</div>
    <div id = 'third'>
      <p class = 'first'>Sunshine: 5hrs</p>
      <p class = 'second'>Precipitation: 0mm</p>
    </div>
  </body>
</html>"

weather_html <- read_html(weather_raw_html)

# Select all p elements
weather_html %>%
    html_elements(xpath = '//p')

## {xml_nodeset (4)}
## [1] <p class="first">Temperature: 20°C</p>
## [2] <p class="second">Humidity: 45%</p>
## [3] <p class="first">Sunshine: 5hrs</p>
## [4] <p class="second">Precipitation: 0mm</p>

Now select only the p elements with class second.

# Select p elements with the second class
weather_html %>%
    html_elements(xpath = '//p[@class = "second"]')

## {xml_nodeset (2)}
## [1] <p class="second">Humidity: 45%</p>
## [2] <p class="second">Precipitation: 0mm</p>

Now select all p elements that are children of the element with ID third.

# Select p elements that are children of "#third"
weather_html %>%
    html_elements(xpath = '//*[@id = "third"]/p')

## {xml_nodeset (2)}
## [1] <p class="first">Sunshine: 5hrs</p>
## [2] <p class="second">Precipitation: 0mm</p>

Now select only the p element with class second that is a direct child of #third, again using XPATH.

# Select p elements with class "second" that are children of "#third"
weather_html %>%
    html_elements(xpath = '//*[@id = "third"]/p[@class = "second"]')

## {xml_nodeset (1)}
## [1] <p class="second">Precipitation: 0mm</p>

8.3.1.2 Use predicates to select nodes based on their children

With XPATH, something that’s not possible with CSS can be done: selecting elements based on the properties of their descendants. For this, predicates may be used. Here, your eventual goal is to select only div elements that enclose a p element with the third class. For that, you’ll need to select only the div that matches a certain predicate — having the respective descendant (it needn’t be a direct child).

Using XPATH, select all the div elements.

weather_raw_html_2 <- "<html>
  <body>
    <div id = 'first'>
      <h1 class = 'big'>Berlin Weather Station</h1>
      <p class = 'first'>Temperature: 20°C</p>
      <p class = 'second'>Humidity: 45%</p>
    </div>
    <div id = 'second'>...</div>
    <div id = 'third'>
      <p class = 'first'>Sunshine: 5hrs</p>
      <p class = 'second'>Precipitation: 0mm</p>
      <p class = 'third'>Snowfall: 0mm</p>
    </div>
  </body>
</html>"

weather_html2 <- read_html(weather_raw_html_2)

# Select all divs
weather_html2 %>% 
  html_elements(xpath = '//div')

## {xml_nodeset (3)}
## [1] <div id="first">\n      <h1 class="big">Berlin Weather Station</h1>\n     ...
## [2] <div id="second">...</div>
## [3] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

Select all divs with p descendants using the predicate notation.

# Select all divs with p descendants
weather_html2 %>% 
  html_elements(xpath = '//div[p]')

## {xml_nodeset (2)}
## [1] <div id="first">\n      <h1 class="big">Berlin Weather Station</h1>\n     ...
## [2] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

Select divs with p descendants which have the third class.

# Select all divs with p descendants having the "third" class
weather_html2 %>% 
  html_elements(xpath = '//div[p[@class = "third"]]')

## {xml_nodeset (1)}
## [1] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

8.3.2 XPATH functions and advanced predicates

Besides axes, steps, and predicates, another building block of the XPATH notation are functions. With these, querying a website for specific elements becomes even easier. One of the most important - if not the most important - functions is position(). With it you can reference the current position of each element in your path selection, and then use that in a predicate.

8.3.2.1 Get to know the position() function

the position() function is very powerful when used within a predicate. Together with operators, you can basically select any node from those that match a certain path.

extract the text of the second p in every div using XPATH.

rules_raw_html <- "...
<div>
  <h2>Today's rules</h2>
  <p>Wear a mask</p>
  <p>Wash your hands</p>
</div>
<div>
  <h2>Tomorrow's rules</h2>
  <p>Wear a mask</p>
  <p>Wash your hands</p>
  <small>Bring hand sanitizer with you</small>
</div>
..."

rules_html <- read_html(rules_raw_html)

# Select the text of the second p in every div
rules_html %>% 
  html_elements(xpath = '//div/p[position() = 2]') %>%
  html_text()

## [1] "Wash your hands" "Wash your hands"

Now extract the text of every p (except the second) in every div.

# Select every p except the second from every div
rules_html %>% 
  html_elements(xpath = '//div/p[position() != 2]') %>%
  html_text()

## [1] "Wear a mask" "Wear a mask"

Extract the text of the last three children of the second div. Use the >= operator for selecting these children nodes.

# Select the text of the last three nodes of the second div
rules_html %>% 
  html_elements(xpath = '//div[position() = 2]/*[position() >= 2]') %>%
  html_text()

## [1] "Wear a mask"                   "Wash your hands"              
## [3] "Bring hand sanitizer with you"

8.3.2.2 Extract nodes based on the number of their children

As shown in the video, the XPATH count() function can be used within a predicate to narrow down a selection to these nodes that match a certain children count. This is especially helpful if your scraper depends on some nodes having a minimum amount of children. You’re only interested in divs that have exactly one h2 header and at least two paragraphs, because your application can’t really deal with incomplete weather forecasts.

Select the desired divs with the appropriate XPATH selector, making use of the count() function.

forecast_raw_html <- "...
<div>
  <h1>Tomorrow</h1>
</div>
<div>
  <h2>Berlin</h2>
  <p>Temperature: 20°C</p>
  <p>Humidity: 50%</p>
</div>
<div>
  <h2>London</h2>
  <p>Temperature: 15°C</p>
</div>
<div>
  <h2>Zurich</h2>
  <p>Temperature: 22°C</p>
  <p>Humidity: 60%</p>
</div>
..."

forecast_html <- read_html(forecast_raw_html)

# Select only divs with one header and at least two paragraphs
forecast_html %>%
    html_elements(xpath = '//div[count(h2) = 1 and count(p) > 1]')

## {xml_nodeset (2)}
## [1] <div>\n  <h2>Berlin</h2>\n  <p>Temperature: 20°C</p>\n  <p>Humidity: 50%< ...
## [2] <div>\n  <h2>Zurich</h2>\n  <p>Temperature: 22°C</p>\n  <p>Humidity: 60%< ...

8.3.3 The XPATH text() function

You have learned how CSS can be translated to XPATH and how you can query web pages using one or another. Also, you’ve been introduced to XPATH functions. An especially helpful one is the text() function.

8.3.3.1 The shortcomings of html_table() with badly structured tables

Sometimes, you only want to select text that’s a direct descendant of a parent element. In the following example table, however, the name of the role itself is wrapped in an em tag. But its function, e.g. “Voice”, is also contained in the same td element as the em part, which is not optimal for querying the data. In this exercise, you will try and scrape the table using a known rvest function. By doing so, you will recognize its limits.

Try to extract a data frame from the table with a function you have learned in the first chapter.

Have a look at the resulting data frame.

roles_raw_html <- '
"<table>
 <tr>
  <th>Actor</th>
  <th>Role</th>
 </tr>
 <tr>
  <td class = "actor">Jayden Carpenter</td>
  <td class = "role"><em>Mickey Mouse</em> (Voice)</td>
 </tr>
 ...
</table>"
'

roles_html <- read_html(roles_raw_html)

# Extract the data frame from the table using a known function from rvest
roles <- roles_html %>% 
  html_element(xpath = "//table") %>% 
  html_table()
# Print the contents of the role data frame
roles

## # A tibble: 1 × 2
##   Actor            Role                
##   <chr>            <chr>               
## 1 Jayden Carpenter Mickey Mouse (Voice)

8.3.3.2 Select directly from a parent element with XPATH’s text()

In this exercise, you’ll deal with the same table. This time, you’ll extract the function information in parentheses into their own column, so you are required to extract a data frame with not two, but three columns: actors, roles, and functions.

To do this, you’ll need to apply the specific XPATH function that was introduced in the video instead of html_table(), which often does not work in practice if the HTML table element is not well structured, as it is the case here.

First extract the actors and roles from the table using XPATH.

# Extract the actors in the cells having class "actor"
actors <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "actor"]') %>%
  html_text()
actors

## [1] "Jayden Carpenter"

# Extract the roles in the cells having class "role"
roles <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/em') %>% 
  html_text()
roles

## [1] "Mickey Mouse"

Then, extract the function using the XPATH text() function. Extract only the text with the parentheses, which is contained within the same cell as the corresponding role, and trim leading spaces.

# Extract the actors in the cells having class "actor"
actors <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "actor"]') %>%
  html_text()
actors

## [1] "Jayden Carpenter"

# Extract the roles in the cells having class "role"
roles <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/em') %>% 
  html_text()
roles

## [1] "Mickey Mouse"

# Extract the functions using the appropriate XPATH function
functions <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/text()') %>%
  html_text(trim = TRUE)
functions

## [1] "(Voice)"

8.3.3.3 Combine extracted data into a data frame

Extracting data like this is not as straightforward as with html_table(). So far, you’ve only extracted vectors of data. Now it’s time to combine them into their own data frame.

Combine the three vectors actors, roles, and functions into a data frame called cast (with columns Actor, Role and Function, respectively).

# Create a new data frame from the extracted vectors
cast <- tibble(
  Actor = actors, 
  Role = roles, 
  Function = functions)

cast

## # A tibble: 1 × 3
##   Actor            Role         Function
##   <chr>            <chr>        <chr>   
## 1 Jayden Carpenter Mickey Mouse (Voice)

8.3.3.4 Scrape an element based on its text

The text() function also allows you to select elements (and their parents) based on their text.

In this exercise, your goal is to extract the li element where “twice” is emphasized.

You might think that, here, it would be much easier to apply a CSS selector like li:last-child, but wait until you finish this exercise…

To start, select all li elements using XPATH

programming_raw_html <- "<h3>The rules of programming</h3>
<ol>
  <li>Have <em>fun</em>.</li>
  <li><strong>Don't</strong> repeat yourself.</li>
  <li>Think <em>twice</em> when naming variables.</li>
</ol>"

programming_html <- read_html(programming_raw_html)

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li')

## {xml_nodeset (3)}
## [1] <li>Have <em>fun</em>.</li>
## [2] <li>\n<strong>Don't</strong> repeat yourself.</li>
## [3] <li>Think <em>twice</em> when naming variables.</li>

Secondly, add a function call that selects all em tags that contain “twice” as text within these li elements.

As you do that second selection based on the selection of //li in the first function, you don’t need to specify // or / before em.

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li') %>%
    # Select all em elements within li elements that have "twice" as text
    html_elements(xpath = 'em[text() = "twice"]')

## {xml_nodeset (1)}
## [1] <em>twice</em>

Lastly, select the parent node of the selected em element.

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li') %>%
    # Select all em elements within li elements that have "twice" as text
    html_elements(xpath = 'em[text() = "twice"]') %>%
    # Wander up the tree to select the parent of the em 
    html_element(xpath = '..')

## {xml_nodeset (1)}
## [1] <li>Think <em>twice</em> when naming variables.</li>

8.4 Scraping Best Practices

In the last chapter of this course, we’ll look a bit behind the curtains and see what’s at the foundation of scraping: So-called HTTP requests.

8.4.1 The nature of HTTP requests

HTTP stands for Hypertext Transfer Protocol and is a relatively simple set of rules that dictate how modern web browsers, or clients, communicate with a web server. The most common request methods, or at least those that will become relevant when you scrape a page, are GET and POST. GET is always used when a resource, be it an HTML page or a mere image, is to be fetched without submitting any user data. POST, on the other hand, is used when you need to submit some data to a web server. This most often is the result of a form that was filled out by the user.

8.4.1.1 Do it the httr way

As you have learned in the video, read_html() actually issues an HTTP GET request if provided with a URL, like in this case.

The goal of this exercise is to replicate the same query without read_html(), but with httr methods instead.

Note: Usually rvest does the job, but if you want to customize requests like you’ll be shown later in this chapter, you’ll need to know the httr way.

For a little repetition, you’ll also translate the CSS selector used in html_elements() into an XPATH query.

Use only httr functions to replicate the behavior of read_html(), including getting the response from Wikipedia and parsing the response object into an HTML document.

Check the resulting HTTP status code with the appropriate httr function.

# Get the HTML document from Wikipedia using httr
wikipedia_response <- GET('https://en.wikipedia.org/wiki/Varigotti')
# Parse the response into an HTML doc
wikipedia_page <- content(wikipedia_response)

Now parse the page to extract the elevation, but using XPATH instead of the CSS selector I specified above (table tr:nth-child(9) > td).

Make sure to correctly translate every element of the above CSS selector into an XPATH selector.

wikipedia_page %>% 
    html_elements(xpath = '//table//tr[position() = 9]/td') %>% 
    html_text()

## [1] "0 m (0 ft)"

8.4.1.2 Houston, we got a 404!

A fundamental part of the HTTP system are status codes: They tell you if everything is okay or if there is a problem with your request.

It is good practice to always check the status code of a response before you start working with the downloaded page. For this, you can use the status_code() function from the httr() package. It takes as an argument a response object that results from a request method.

Now let’s assume you’re trying to scrape the same page as before, but somehow you got the URL wrong (Varigott instead of Varigotti).

Read out the status code of the response object from the GET request.

response <- GET('https://en.wikipedia.org/wiki/Varigott')
# Print status code of inexistent page
status_code(response)

## [1] 404

8.4.2 Telling who you are with custom user agents

One big advantage of working directly with the httr package is that you can customize your requests. A best practice is to explicitly tell the web server your name, perhaps an e-mail address, and the purpose of the request. It’s not something you’d do when normally surfing the web, of course. But when scraping a page intensively, it is actually good practice. If the owners of the web server notice an unusual spike in traffic, it might be helpful for them to know who they can contact.

8.4.2.1 Check out your user agent

Normally when sending out requests, you don’t get to see the headers that accompany them.

The test platform httpbin.org has got you covered: it has a special address that returns the headers of each request that it reaches. This address is: https://httpbin.org/headers.

Check out the headers that are returned when accessing the above URL in R via the GET() method.

# Access https://httpbin.org/headers with httr
response <- GET('https://httpbin.org/headers')
# Print its content
content(response)

## $headers
## $headers$Accept
## [1] "application/json, text/xml, application/xml, */*"
## 
## $headers$`Accept-Encoding`
## [1] "deflate, gzip, br"
## 
## $headers$Host
## [1] "httpbin.org"
## 
## $headers$`User-Agent`
## [1] "libcurl/7.68.0 r-curl/5.1.0 httr/1.4.7"
## 
## $headers$`X-Amzn-Trace-Id`
## [1] "Root=1-65ee60ad-569a628874f1f6f772d89181"

8.4.2.2 Add a custom user agent

There’s also a httpbin.org address that only returns the current user agent (https://httpbin.org/user-agent). You’ll use this for the current exercise, where you’ll manipulate your own user agent to turn it into something meaningful (for the owners of the website you’re scraping, that is).

There are two ways of customizing your user agent when using httr for fetching web resources:

Locally, i.e. as an argument to the current request method.
Globally via set_config().

Send a GET request to https://httpbin.org/user-agent with a custom user agent that says “A request from a DataCamp course on scraping” and print the response.

In this step, set the user agent locally.

# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
# Print the response content
content(response)

## $`user-agent`
## [1] "A request from a DataCamp course on scraping"

Now, make that custom user agent (“A request from a DataCamp course on scraping”) globally available across all future requests with set_config().

Test it out with another GET request.

set_config(add_headers(`User-Agent` = "A request from a DataCamp course on scraping"))
# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
# Print the response content
content(response)

## $`user-agent`
## [1] "A request from a DataCamp course on scraping"

8.4.3 How to be gentle and slow down your requests

Besides telling who you are with custom user agents or other HTTP headers, another thing you can do is throttle your requests. This greatly reduces the load on the scraped website. Throttling becomes relevant if you are scraping a lot of pages in succession.

8.4.3.1 Apply throttling to a multi-page crawler

The goal of this exercise is to get the coordinates of earth’s three highest mountain peaks, together with their names.

You’ll get this information from their corresponding Wikipedia pages, in real-time. In order not to stress Wikipedia too much, you’ll apply throttling using the slowly() function. After each call to a Wikipedia page, your program should wait a small amount of time. Three pages of Wikipedia might not be that much, but the principle holds for any amount of scraping: be gentle and add wait time between requests.

You’ll find the name of the peak within an element with the ID “firstHeading”, while the coordinates are inside an element with class “geo-dms”, which is a descendant of an element with ID “coordinates”.

mountain_wiki_pages <- "https://en.wikipedia.org/w/index.php?title=Mount_Everest&oldid=958643874"
"https://en.wikipedia.org/w/index.php?title=K2&oldid=956671989"

## [1] "https://en.wikipedia.org/w/index.php?title=K2&oldid=956671989"

"https://en.wikipedia.org/w/index.php?title=Kangchenjunga&oldid=957008408"

## [1] "https://en.wikipedia.org/w/index.php?title=Kangchenjunga&oldid=957008408"

Construct a read_html() function that executes with a delay of a half second when executed in a loop.

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))

Now write a for loop that goes over every page URL in the prepared variable mountain_wiki_pages and stores the HTML available at the corresponding Wikipedia URL into the html variable

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
  # Read in the html of each URL with the function defined above
  html <- read_html_delayed(page_url)
}

extract the name of the peak (available in #firstHeading)

extract the name of the coordinates (available in .geo-dms, which is a descendant of #coordinates) (be patient, this may take a few seconds).

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
  # Read in the html of each URL with a delay of 0.5s
  html <- read_html_delayed(page_url)
  # Extract the name of the peak and its coordinates
  peak <- html %>% 
    html_element("#firstHeading") %>% html_text()
  coords <- html %>% 
    html_element("#coordinates .geo-dms") %>% html_text()
  print(paste(peak, coords, sep = ": "))
}

## [1] "Mount Everest: 27°59′17″N 86°55′31″E"