8 Web Scraping in R

https://learn.datacamp.com/courses/web-scraping-in-r

Main functions and concepts covered in this BP chapter:

  1. Read in HTML
  2. Navigating HTML
    • Select all children of a list
    • Parse hyperlinks into a data frame
  3. Scrape your first table
  4. Turn a table into a data frame with html_table()
  5. Introduction to CSS
    • Select multiple HTML types
  6. CSS classes and IDs
    • Leverage the uniqueness of IDs
    • Select the last child with a pseudo-class
  7. CSS combinators
    • Select direct descendants with the child combinator
    • Simply the best!
    • Not every sibling is the same
  8. Introduction to XPATH
    • Select by class and ID with XPATH
    • Use predicates to select nodes based on their children
  9. XPATH functions and advanced predicates
    • Get to know the position() function
    • Extract nodes based on the number of their children
  10. The XPATH text() function
    • The shortcomings of html_table() with badly structured tables
    • Select directly from a parent element with XPATH’s text()
    • Combine extracted data into a data frame
    • Scrape an element based on its text
  11. The nature of HTTP requests
    • Do it the httr way
    • Houston, we got a 404!
  12. Telling who you are with custom user agents
    • Check out your user agent
    • Add a custom user agent
  13. How to be gentle and slow down your requests
    • Apply throttling to a multi-page crawler

Packages used in this chapter:

## Load all packages used in this chapter
library(tidyverse) #includes dplyr, ggplot2, and other common packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(xml2)
library(httr)
library(purrr)

Datasets used in this chapter:

## Load datasets used in this chapter
## For this chapter, they're probably all coming from the web though

8.1 Introduction to HTML and Web Scraping

While there are some data sets that can be easily downloaded, many others can’t. These may be on a browser and spread across multiple pages, and not avaliable to download, but we can still use them through Web Scraping.

8.1.1 Read in HTML

Take the html_excerpt_raw variable and turn it into an HTML document that R understands using a function from the rvest package.

Use the xml_structure() function to get a better overview of the tag hierarchy of the HTML excerpt.

html_excerpt_raw <- '
<html> 
  <body> 
    <h1>Web scraping is cool</h1>
    <p>It involves writing code – be it R or Python.</p>
    <p><a href="https://datacamp.com">DataCamp</a> 
        has courses on it.</p>
  </body> 
</html>'
# Turn the raw excerpt into an HTML document R understands
html_excerpt <- read_html(html_excerpt_raw)
html_excerpt
## {html_document}
## <html>
## [1] <body> \n    <h1>Web scraping is cool</h1>\n    <p>It involves writing co ...
# Print the HTML excerpt with the xml_structure() function
xml_structure(html_excerpt)
## <html>
##   <body>
##     {text}
##     <h1>
##       {text}
##     {text}
##     <p>
##       {text}
##     {text}
##     <p>
##       <a [href]>
##         {text}
##       {text}
##     {text}

8.1.3 Scrape your first table

In its most basic form, a table may consist of only three different HTML tags: table, tr, and td. The table tag designates a table, as the name says. The tr tag designates rows and is wrapped around multiple td tags, which designate single cells. Normally, the number of td tags in each row should be identical. However, there is the colspan attribute for td, that allows a cell to span multiple columns.

Apart from using functions like html_element() and html_text(), rvest provides a helper function called html_table(). Note that the output of html_table() in this case is a list with one element, being the only table in the html document. If there were more than one table in the document, the list would have more entries. it can be explicitly told to regard the first row as the header row, even if it only consists of td tags.

8.1.4 Turn a table into a data frame with html_table()

If a table has a header row (with th elements) and no gaps, scraping it is straightforward, as with the table having ID "clean", opposed to having ID "dirty" For such cases, html_table() has an extra argument you can use to correctly parse the table, as shown in the video. Missing cells are automatically recognized and replaced with NA values.

Turn the table with ID "clean" into a data frame called mountains

mountains_raw_html <- "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html>\n<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n<body> \n  <table id=\"clean\">\n<tr>\n<th>Mountain</th>\n      <th>Height [m]</th>\n      <th>First ascent</th>\n      <th>Country</th>\n    </tr>\n<tr>\n<td>Mount Everest</td>\n      <td>8848</td>\n      <td>1953</td>\n      <td>Nepal, China</td>\n    </tr>\n<tr>\n<td>K2</td>\n      <td>8611</td>\n      <td>1954</td>\n      <td>Pakistan, China</td>\n    </tr>\n<tr>\n<td>Kanchenjunga</td>\n      <td>8586</td>\n      <td>1955</td>\n      <td>Nepal, India</td>\n    </tr>\n</table>\n<table id=\"dirty\">\n<tr>\n<td>Mountain </td>\n      <td>Height [m]</td>\n      <td>First ascent</td>\n      <td>Country</td>\n    </tr>\n<tr>\n<td>Mount Everest</td>\n      <td>8848</td>\n      <td>1953</td>\n    </tr>\n<tr>\n<td>K2</td>\n      <td>8611</td>\n      <td>1954</td>\n      <td>Pakistan, China</td>\n    </tr>\n<tr>\n<td>Kanchenjunga</td>\n      <td>8586</td>\n      <td>1955</td>\n      <td>Nepal, India</td>\n    </tr>\n</table>\n</body>\n</html>\n"

# Extract the "clean" table into a data frame 
mountains <- read_html(mountains_raw_html) %>% 
  html_element("table#clean") %>% 
  html_table()

mountains
## # A tibble: 3 × 4
##   Mountain      `Height [m]` `First ascent` Country        
##   <chr>                <int>          <int> <chr>          
## 1 Mount Everest         8848           1953 Nepal, China   
## 2 K2                    8611           1954 Pakistan, China
## 3 Kanchenjunga          8586           1955 Nepal, India

Do the same with the "dirty" table, but designate the first line as header.

# Note, I don't have access to the mountains data, if I did, this would be the code to extract the dirty table into a data frame.
mountains <- read_html(mountains_raw_html) %>% 
  html_element("table#dirty") %>% 
  html_table(header=TRUE)

mountains
## # A tibble: 3 × 4
##   Mountain      `Height [m]` `First ascent` Country        
##   <chr>                <int>          <int> <chr>          
## 1 Mount Everest         8848           1953 <NA>           
## 2 K2                    8611           1954 Pakistan, China
## 3 Kanchenjunga          8586           1955 Nepal, India

8.3 Advanced Selection with XPATH

8.3.1 Introduction to XPATH

XPATH stands for XML Path Language. With this language, a so-called path through an HTML tree can be formulated, which is a slightly different approach than the one with CSS selectors.

8.3.1.1 Select by class and ID with XPATH

For this chapter, this code looks a bit more like real life. Your goal is to extract the precipitation reading from this weather station. Unfortunately, it can’t be directly referenced through an ID.

Let’s do this by setting up the building blocks step by step and then using them in combination!

In order to warm up, start by selecting all p tags in the above HTML using XPATH.

weather_raw_html <- "<html>
  <body>
    <div id = 'first'>
      <h1 class = 'big'>Berlin Weather Station</h1>
      <p class = 'first'>Temperature: 20°C</p>
      <p class = 'second'>Humidity: 45%</p>
    </div>
    <div id = 'second'>...</div>
    <div id = 'third'>
      <p class = 'first'>Sunshine: 5hrs</p>
      <p class = 'second'>Precipitation: 0mm</p>
    </div>
  </body>
</html>"

weather_html <- read_html(weather_raw_html)

# Select all p elements
weather_html %>%
    html_elements(xpath = '//p')
## {xml_nodeset (4)}
## [1] <p class="first">Temperature: 20°C</p>
## [2] <p class="second">Humidity: 45%</p>
## [3] <p class="first">Sunshine: 5hrs</p>
## [4] <p class="second">Precipitation: 0mm</p>

Now select only the p elements with class second.

# Select p elements with the second class
weather_html %>%
    html_elements(xpath = '//p[@class = "second"]')
## {xml_nodeset (2)}
## [1] <p class="second">Humidity: 45%</p>
## [2] <p class="second">Precipitation: 0mm</p>

Now select all p elements that are children of the element with ID third.

# Select p elements that are children of "#third"
weather_html %>%
    html_elements(xpath = '//*[@id = "third"]/p')
## {xml_nodeset (2)}
## [1] <p class="first">Sunshine: 5hrs</p>
## [2] <p class="second">Precipitation: 0mm</p>

Now select only the p element with class second that is a direct child of #third, again using XPATH.

# Select p elements with class "second" that are children of "#third"
weather_html %>%
    html_elements(xpath = '//*[@id = "third"]/p[@class = "second"]')
## {xml_nodeset (1)}
## [1] <p class="second">Precipitation: 0mm</p>

8.3.1.2 Use predicates to select nodes based on their children

With XPATH, something that’s not possible with CSS can be done: selecting elements based on the properties of their descendants. For this, predicates may be used. Here, your eventual goal is to select only div elements that enclose a p element with the third class. For that, you’ll need to select only the div that matches a certain predicate — having the respective descendant (it needn’t be a direct child).

Using XPATH, select all the div elements.

weather_raw_html_2 <- "<html>
  <body>
    <div id = 'first'>
      <h1 class = 'big'>Berlin Weather Station</h1>
      <p class = 'first'>Temperature: 20°C</p>
      <p class = 'second'>Humidity: 45%</p>
    </div>
    <div id = 'second'>...</div>
    <div id = 'third'>
      <p class = 'first'>Sunshine: 5hrs</p>
      <p class = 'second'>Precipitation: 0mm</p>
      <p class = 'third'>Snowfall: 0mm</p>
    </div>
  </body>
</html>"

weather_html2 <- read_html(weather_raw_html_2)

# Select all divs
weather_html2 %>% 
  html_elements(xpath = '//div')
## {xml_nodeset (3)}
## [1] <div id="first">\n      <h1 class="big">Berlin Weather Station</h1>\n     ...
## [2] <div id="second">...</div>
## [3] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

Select all divs with p descendants using the predicate notation.

# Select all divs with p descendants
weather_html2 %>% 
  html_elements(xpath = '//div[p]')
## {xml_nodeset (2)}
## [1] <div id="first">\n      <h1 class="big">Berlin Weather Station</h1>\n     ...
## [2] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

Select divs with p descendants which have the third class.

# Select all divs with p descendants having the "third" class
weather_html2 %>% 
  html_elements(xpath = '//div[p[@class = "third"]]')
## {xml_nodeset (1)}
## [1] <div id="third">\n      <p class="first">Sunshine: 5hrs</p>\n      <p cla ...

8.3.2 XPATH functions and advanced predicates

Besides axes, steps, and predicates, another building block of the XPATH notation are functions. With these, querying a website for specific elements becomes even easier. One of the most important - if not the most important - functions is position(). With it you can reference the current position of each element in your path selection, and then use that in a predicate.

8.3.2.1 Get to know the position() function

the position() function is very powerful when used within a predicate. Together with operators, you can basically select any node from those that match a certain path.

extract the text of the second p in every div using XPATH.

rules_raw_html <- "...
<div>
  <h2>Today's rules</h2>
  <p>Wear a mask</p>
  <p>Wash your hands</p>
</div>
<div>
  <h2>Tomorrow's rules</h2>
  <p>Wear a mask</p>
  <p>Wash your hands</p>
  <small>Bring hand sanitizer with you</small>
</div>
..."

rules_html <- read_html(rules_raw_html)

# Select the text of the second p in every div
rules_html %>% 
  html_elements(xpath = '//div/p[position() = 2]') %>%
  html_text()
## [1] "Wash your hands" "Wash your hands"

Now extract the text of every p (except the second) in every div.

# Select every p except the second from every div
rules_html %>% 
  html_elements(xpath = '//div/p[position() != 2]') %>%
  html_text()
## [1] "Wear a mask" "Wear a mask"

Extract the text of the last three children of the second div. Use the >= operator for selecting these children nodes.

# Select the text of the last three nodes of the second div
rules_html %>% 
  html_elements(xpath = '//div[position() = 2]/*[position() >= 2]') %>%
  html_text()
## [1] "Wear a mask"                   "Wash your hands"              
## [3] "Bring hand sanitizer with you"

8.3.2.2 Extract nodes based on the number of their children

As shown in the video, the XPATH count() function can be used within a predicate to narrow down a selection to these nodes that match a certain children count. This is especially helpful if your scraper depends on some nodes having a minimum amount of children. You’re only interested in divs that have exactly one h2 header and at least two paragraphs, because your application can’t really deal with incomplete weather forecasts.

Select the desired divs with the appropriate XPATH selector, making use of the count() function.

forecast_raw_html <- "...
<div>
  <h1>Tomorrow</h1>
</div>
<div>
  <h2>Berlin</h2>
  <p>Temperature: 20°C</p>
  <p>Humidity: 50%</p>
</div>
<div>
  <h2>London</h2>
  <p>Temperature: 15°C</p>
</div>
<div>
  <h2>Zurich</h2>
  <p>Temperature: 22°C</p>
  <p>Humidity: 60%</p>
</div>
..."

forecast_html <- read_html(forecast_raw_html)

# Select only divs with one header and at least two paragraphs
forecast_html %>%
    html_elements(xpath = '//div[count(h2) = 1 and count(p) > 1]')
## {xml_nodeset (2)}
## [1] <div>\n  <h2>Berlin</h2>\n  <p>Temperature: 20°C</p>\n  <p>Humidity: 50%< ...
## [2] <div>\n  <h2>Zurich</h2>\n  <p>Temperature: 22°C</p>\n  <p>Humidity: 60%< ...

8.3.3 The XPATH text() function

You have learned how CSS can be translated to XPATH and how you can query web pages using one or another. Also, you’ve been introduced to XPATH functions. An especially helpful one is the text() function.

8.3.3.1 The shortcomings of html_table() with badly structured tables

Sometimes, you only want to select text that’s a direct descendant of a parent element. In the following example table, however, the name of the role itself is wrapped in an em tag. But its function, e.g. “Voice”, is also contained in the same td element as the em part, which is not optimal for querying the data. In this exercise, you will try and scrape the table using a known rvest function. By doing so, you will recognize its limits.

Try to extract a data frame from the table with a function you have learned in the first chapter.

Have a look at the resulting data frame.

roles_raw_html <- '
"<table>
 <tr>
  <th>Actor</th>
  <th>Role</th>
 </tr>
 <tr>
  <td class = "actor">Jayden Carpenter</td>
  <td class = "role"><em>Mickey Mouse</em> (Voice)</td>
 </tr>
 ...
</table>"
'

roles_html <- read_html(roles_raw_html)

# Extract the data frame from the table using a known function from rvest
roles <- roles_html %>% 
  html_element(xpath = "//table") %>% 
  html_table()
# Print the contents of the role data frame
roles
## # A tibble: 1 × 2
##   Actor            Role                
##   <chr>            <chr>               
## 1 Jayden Carpenter Mickey Mouse (Voice)

8.3.3.2 Select directly from a parent element with XPATH’s text()

In this exercise, you’ll deal with the same table. This time, you’ll extract the function information in parentheses into their own column, so you are required to extract a data frame with not two, but three columns: actors, roles, and functions.

To do this, you’ll need to apply the specific XPATH function that was introduced in the video instead of html_table(), which often does not work in practice if the HTML table element is not well structured, as it is the case here.

First extract the actors and roles from the table using XPATH.

# Extract the actors in the cells having class "actor"
actors <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "actor"]') %>%
  html_text()
actors
## [1] "Jayden Carpenter"
# Extract the roles in the cells having class "role"
roles <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/em') %>% 
  html_text()
roles
## [1] "Mickey Mouse"

Then, extract the function using the XPATH text() function. Extract only the text with the parentheses, which is contained within the same cell as the corresponding role, and trim leading spaces.

# Extract the actors in the cells having class "actor"
actors <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "actor"]') %>%
  html_text()
actors
## [1] "Jayden Carpenter"
# Extract the roles in the cells having class "role"
roles <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/em') %>% 
  html_text()
roles
## [1] "Mickey Mouse"
# Extract the functions using the appropriate XPATH function
functions <- roles_html %>% 
  html_elements(xpath = '//table//td[@class = "role"]/text()') %>%
  html_text(trim = TRUE)
functions
## [1] "(Voice)"

8.3.3.3 Combine extracted data into a data frame

Extracting data like this is not as straightforward as with html_table(). So far, you’ve only extracted vectors of data. Now it’s time to combine them into their own data frame.

Combine the three vectors actors, roles, and functions into a data frame called cast (with columns Actor, Role and Function, respectively).

# Create a new data frame from the extracted vectors
cast <- tibble(
  Actor = actors, 
  Role = roles, 
  Function = functions)

cast
## # A tibble: 1 × 3
##   Actor            Role         Function
##   <chr>            <chr>        <chr>   
## 1 Jayden Carpenter Mickey Mouse (Voice)

8.3.3.4 Scrape an element based on its text

The text() function also allows you to select elements (and their parents) based on their text.

In this exercise, your goal is to extract the li element where “twice” is emphasized.

You might think that, here, it would be much easier to apply a CSS selector like li:last-child, but wait until you finish this exercise…

To start, select all li elements using XPATH

programming_raw_html <- "<h3>The rules of programming</h3>
<ol>
  <li>Have <em>fun</em>.</li>
  <li><strong>Don't</strong> repeat yourself.</li>
  <li>Think <em>twice</em> when naming variables.</li>
</ol>"

programming_html <- read_html(programming_raw_html)

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li')
## {xml_nodeset (3)}
## [1] <li>Have <em>fun</em>.</li>
## [2] <li>\n<strong>Don't</strong> repeat yourself.</li>
## [3] <li>Think <em>twice</em> when naming variables.</li>

Secondly, add a function call that selects all em tags that contain “twice” as text within these li elements.

As you do that second selection based on the selection of //li in the first function, you don’t need to specify // or / before em.

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li') %>%
    # Select all em elements within li elements that have "twice" as text
    html_elements(xpath = 'em[text() = "twice"]')
## {xml_nodeset (1)}
## [1] <em>twice</em>

Lastly, select the parent node of the selected em element.

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li') %>%
    # Select all em elements within li elements that have "twice" as text
    html_elements(xpath = 'em[text() = "twice"]') %>%
    # Wander up the tree to select the parent of the em 
    html_element(xpath = '..')
## {xml_nodeset (1)}
## [1] <li>Think <em>twice</em> when naming variables.</li>

8.4 Scraping Best Practices

In the last chapter of this course, we’ll look a bit behind the curtains and see what’s at the foundation of scraping: So-called HTTP requests.

8.4.1 The nature of HTTP requests

HTTP stands for Hypertext Transfer Protocol and is a relatively simple set of rules that dictate how modern web browsers, or clients, communicate with a web server. The most common request methods, or at least those that will become relevant when you scrape a page, are GET and POST. GET is always used when a resource, be it an HTML page or a mere image, is to be fetched without submitting any user data. POST, on the other hand, is used when you need to submit some data to a web server. This most often is the result of a form that was filled out by the user.

8.4.1.1 Do it the httr way

As you have learned in the video, read_html() actually issues an HTTP GET request if provided with a URL, like in this case.

The goal of this exercise is to replicate the same query without read_html(), but with httr methods instead.

Note: Usually rvest does the job, but if you want to customize requests like you’ll be shown later in this chapter, you’ll need to know the httr way.

For a little repetition, you’ll also translate the CSS selector used in html_elements() into an XPATH query.

Use only httr functions to replicate the behavior of read_html(), including getting the response from Wikipedia and parsing the response object into an HTML document.

Check the resulting HTTP status code with the appropriate httr function.

# Get the HTML document from Wikipedia using httr
wikipedia_response <- GET('https://en.wikipedia.org/wiki/Varigotti')
# Parse the response into an HTML doc
wikipedia_page <- content(wikipedia_response)

Now parse the page to extract the elevation, but using XPATH instead of the CSS selector I specified above (table tr:nth-child(9) > td).

Make sure to correctly translate every element of the above CSS selector into an XPATH selector.

wikipedia_page %>% 
    html_elements(xpath = '//table//tr[position() = 9]/td') %>% 
    html_text()
## [1] "0 m (0 ft)"

8.4.1.2 Houston, we got a 404!

A fundamental part of the HTTP system are status codes: They tell you if everything is okay or if there is a problem with your request.

It is good practice to always check the status code of a response before you start working with the downloaded page. For this, you can use the status_code() function from the httr() package. It takes as an argument a response object that results from a request method.

Now let’s assume you’re trying to scrape the same page as before, but somehow you got the URL wrong (Varigott instead of Varigotti).

Read out the status code of the response object from the GET request.

response <- GET('https://en.wikipedia.org/wiki/Varigott')
# Print status code of inexistent page
status_code(response)
## [1] 404

8.4.2 Telling who you are with custom user agents

One big advantage of working directly with the httr package is that you can customize your requests. A best practice is to explicitly tell the web server your name, perhaps an e-mail address, and the purpose of the request. It’s not something you’d do when normally surfing the web, of course. But when scraping a page intensively, it is actually good practice. If the owners of the web server notice an unusual spike in traffic, it might be helpful for them to know who they can contact.

8.4.2.1 Check out your user agent

Normally when sending out requests, you don’t get to see the headers that accompany them.

The test platform httpbin.org has got you covered: it has a special address that returns the headers of each request that it reaches. This address is: https://httpbin.org/headers.

Check out the headers that are returned when accessing the above URL in R via the GET() method.

# Access https://httpbin.org/headers with httr
response <- GET('https://httpbin.org/headers')
# Print its content
content(response)
## $headers
## $headers$Accept
## [1] "application/json, text/xml, application/xml, */*"
## 
## $headers$`Accept-Encoding`
## [1] "deflate, gzip, br"
## 
## $headers$Host
## [1] "httpbin.org"
## 
## $headers$`User-Agent`
## [1] "libcurl/7.68.0 r-curl/5.1.0 httr/1.4.7"
## 
## $headers$`X-Amzn-Trace-Id`
## [1] "Root=1-65ee60ad-569a628874f1f6f772d89181"

8.4.2.2 Add a custom user agent

There’s also a httpbin.org address that only returns the current user agent (https://httpbin.org/user-agent). You’ll use this for the current exercise, where you’ll manipulate your own user agent to turn it into something meaningful (for the owners of the website you’re scraping, that is).

There are two ways of customizing your user agent when using httr for fetching web resources:

  1. Locally, i.e. as an argument to the current request method.
  2. Globally via set_config().

Send a GET request to https://httpbin.org/user-agent with a custom user agent that says “A request from a DataCamp course on scraping” and print the response.

In this step, set the user agent locally.

# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
# Print the response content
content(response)
## $`user-agent`
## [1] "A request from a DataCamp course on scraping"

Now, make that custom user agent (“A request from a DataCamp course on scraping”) globally available across all future requests with set_config().

Test it out with another GET request.

set_config(add_headers(`User-Agent` = "A request from a DataCamp course on scraping"))
# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
# Print the response content
content(response)
## $`user-agent`
## [1] "A request from a DataCamp course on scraping"

8.4.3 How to be gentle and slow down your requests

Besides telling who you are with custom user agents or other HTTP headers, another thing you can do is throttle your requests. This greatly reduces the load on the scraped website. Throttling becomes relevant if you are scraping a lot of pages in succession.

8.4.3.1 Apply throttling to a multi-page crawler

The goal of this exercise is to get the coordinates of earth’s three highest mountain peaks, together with their names.

You’ll get this information from their corresponding Wikipedia pages, in real-time. In order not to stress Wikipedia too much, you’ll apply throttling using the slowly() function. After each call to a Wikipedia page, your program should wait a small amount of time. Three pages of Wikipedia might not be that much, but the principle holds for any amount of scraping: be gentle and add wait time between requests.

You’ll find the name of the peak within an element with the ID “firstHeading”, while the coordinates are inside an element with class “geo-dms”, which is a descendant of an element with ID “coordinates”.

mountain_wiki_pages <- "https://en.wikipedia.org/w/index.php?title=Mount_Everest&oldid=958643874"
"https://en.wikipedia.org/w/index.php?title=K2&oldid=956671989"           
## [1] "https://en.wikipedia.org/w/index.php?title=K2&oldid=956671989"
"https://en.wikipedia.org/w/index.php?title=Kangchenjunga&oldid=957008408"
## [1] "https://en.wikipedia.org/w/index.php?title=Kangchenjunga&oldid=957008408"

Construct a read_html() function that executes with a delay of a half second when executed in a loop.

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))

Now write a for loop that goes over every page URL in the prepared variable mountain_wiki_pages and stores the HTML available at the corresponding Wikipedia URL into the html variable

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
  # Read in the html of each URL with the function defined above
  html <- read_html_delayed(page_url)
}

extract the name of the peak (available in #firstHeading)

extract the name of the coordinates (available in .geo-dms, which is a descendant of #coordinates) (be patient, this may take a few seconds).

# Define a throttled read_html() function with a delay of 0.5s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(0.5))
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
  # Read in the html of each URL with a delay of 0.5s
  html <- read_html_delayed(page_url)
  # Extract the name of the peak and its coordinates
  peak <- html %>% 
    html_element("#firstHeading") %>% html_text()
  coords <- html %>% 
    html_element("#coordinates .geo-dms") %>% html_text()
  print(paste(peak, coords, sep = ": "))
}
## [1] "Mount Everest: 27°59′17″N 86°55′31″E"