Scraping and Cleaning Data From the Web

Goals

Use a reproducible workflow to extract data tables from HTML, clean it, and make some visualizations. This will involve the following steps:

  • Pulling the HTML from the site
  • Extracting the tables
  • Converting a table to a data frame
  • Cleaning the data in the table (for example, quantitative data is numeric, dates are parsed, undesireable symbols are removed, etc)
  • Other Wrangling and Visualization

Moreover we want to be able to do all of these steps with R code so that our visualization is fully reproducible. That means no manual editing, copy-pasting, or Excel fixes.

The Data

We’ll look at the data on the all time biggest grossing opening weekends for movies Box Office Mojo

Load the tidyverse:

Pulling th HTML from the Web

The rvest package has a read_html() function that reads HTML directly. It turns out we need to load this package separately.

Code:

## {html_document}
## <html class="a-no-js" data-19ax5a9jf="dingo">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="body" class="mojo-page-id-csw a-m-us a-aui_72554-c a-aui_a ...

Well, we’ve got the HTML into R. That’s step 1 down. But this isn’t in a format that we can use yet.

Extracting the tables

The HTML is structured as a collection of “nodes” (think page elements: text blocks, tables, etc.), which are arranged hierarchically in a tree (nodes are composed of other nodes). We don’t need to know the fine details, but the main point is that somewhere within the mess of HTML that we just read in are some tables that can be converted into data frames in R.

Fortunately, the good people behind rvest have figured out how to do this for us, and so all we need to do is call their nice function html_nodes() to get just the tables.

Code:

## {xml_nodeset (1)}
## [1] <table class="a-bordered a-horizontal-stripes a-size-base a-span12 m ...

It’s pretty hard to make sense of what is in this object, but if you inspect the tables object in your environment, you will see that it is a “list of length 1” with the xml_nodeset object class, and that its sole element is itself a “list of length 2” which is an xml_node. Looking at the first part of the markup for that first node, it looks like it has the table tag, which is promising.

Probably we want the actual content of that first entry in the tables list. This isn’t a tibble or data frame object, so we can’t extract elements by name using functions like pull(), but we can use pluck() as follows to extract the first entry.

Code:

## {html_node}
## <table class="a-bordered a-horizontal-stripes a-size-base a-span12 mojo-body-table mojo-table-annotated">
##  [1] <tr>\n<th class="a-text-right mojo-field-type-rank a-nowrap">\n<spa ...
##  [2] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [3] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [4] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [5] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [6] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [7] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [8] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [9] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [10] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [11] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [12] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [13] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [14] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [15] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [16] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [17] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [18] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [19] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [20] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## ...

This still has a lot of HTML markup, but it’s at least designated as a table in the HTML.

OK, we have the table we want. Step 2, extracting the table done.

Converting the HTML table to an R data frame object

Obviously this is still not a data frame. But getting one is easy (for us; as always there’s a lot of complexity wrapped in these simple functions). Just pass the HTML table to the html_table() function, which will return a tibble if the input has the right format.

Using the header = TRUE argument will cause html_table() to treat the first row of the table as listing variable names. One clue that this is what we want is that the HTML table starts with a node whose class seems to be called mojo-field-type-rank followed by the text Rank. Subsequent rows contain numerical ranks.

Code:

## # A tibble: 5 x 9
##    Rank Release   Opening `Total Gross` `% of Total` Theaters Average Date 
##   <int> <chr>     <chr>   <chr>         <chr>        <chr>    <chr>   <chr>
## 1     1 Avengers… $357,1… $858,373,000  41.6%        4,662    $76,601 Apr …
## 2     2 Avengers… $257,6… $678,815,482  38%          4,474    $57,599 Apr …
## 3     3 Star War… $247,9… $936,662,225  26.5%        4,134    $59,982 Dec …
## 4     4 Star War… $220,0… $620,181,382  35.5%        4,232    $51,987 Dec …
## 5     5 Jurassic… $208,8… $652,270,625  32%          4,274    $48,855 Jun …
## # … with 1 more variable: Distributor <chr>

Now we’re getting somewhere!

Aside: Finding the right table

In some cases you will get more than one table in your list of tables. How do we know which one is the one we’re trying to get?

We could just converting all of the tables in the list and see which one is what we’re looking for. We can do this efficiently by calling lapply().

Below is some code to extract the tables from the Wikipedia page listing all songs recorded by the Beatles.

Code:

  1. Use lapply() to take the beatles_tables list, which should contain 11 tables, and produce a list of R datasets (tibbles). Then use lapply() again to get the dimensions of each table using the dim() function. Based on the dimensions, identify which one is the big table under the “Main Songs” section on the page.

SOLUTION
## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 213   6
## 
## [[4]]
## [1] 1 2
## 
## [[5]]
## [1] 82  7
## 
## [[6]]
## [1] 10  2
## 
## [[7]]
## [1] 36 26
## 
## [[8]]
## [1] 11  2
## 
## [[9]]
## [1] 3 2
## 
## [[10]]
## [1] 5 2
## 
## [[11]]
## [1] 8 2

Looking at the Wikipedia page in the browser, the table under “Main songs” has 6 columns and dozens orf rows. So it must be the third table in the list.


Cleaning the data

Fixing variable names

Let’s look at the variable names in the box office data set.

Code:

## [1] "Rank"        "Release"     "Opening"     "Total Gross" "% of Total" 
## [6] "Theaters"    "Average"     "Date"        "Distributor"

Most of these are fine single-word variable names. However, a couple of them have spaces in the names, which is a pain to work with, and one has a special character (the % sign). Let’s “fix” these column names with a rename() operation.

Note the backquotes around the existing variable names. This is needed when the name is not a valid R variable name (due to spaces, punctuation, etc.)

Code:

## # A tibble: 5 x 9
##    Rank Release   Opening `Total Gross` `% of Total` Theaters Average Date 
##   <int> <chr>     <chr>   <chr>         <chr>        <chr>    <chr>   <chr>
## 1     1 Avengers… $357,1… $858,373,000  41.6%        4,662    $76,601 Apr …
## 2     2 Avengers… $257,6… $678,815,482  38%          4,474    $57,599 Apr …
## 3     3 Star War… $247,9… $936,662,225  26.5%        4,134    $59,982 Dec …
## 4     4 Star War… $220,0… $620,181,382  35.5%        4,232    $51,987 Dec …
## 5     5 Jurassic… $208,8… $652,270,625  32%          4,274    $48,855 Jun …
## # … with 1 more variable: Distributor <chr>

Much better.

Plotting: Attempt 1

Ok, let’s try a plot. Let’s see whether the biggest movies have gotten more dominant in the market over time by plotting the PctOfTotal variable, which tells us what share of movie receipts a particular film commanded, against the date the movie opened (Date), using a geom_point() to look at the trend over time. We’ll map the movie’s opening weekend haul to the size and color of the dot.

Code:

## Warning: Using size for a discrete variable is not advised.