Extracting Data from a Website

Goal

Use a reproducible workflow to extract data tables from HTML, clean it, and make some visualizations. This will involve the following steps:

  • Pulling the HTML from the site
  • Extracting the tables
  • Converting a table to a data frame
  • Fixing variable formats (quantitative data is numeric, dates are parsed)
  • Visualization

Moreover we want to be able to do all of these steps with R code so that our visualization is fully reproducible. That means no manual editing, copy-pasting, or Excel fixes.

The Data

We’ll look at the data on the all time biggest grossing opening weekends for movies Box Office Mojo

Load the tidyverse:

library(tidyverse)

Reading the Data from the Web

The rvest package has a read_html() function that reads HTML directly.

Code:

library(rvest)
## Note: the trailing slash is required
url <- "http://www.boxofficemojo.com/alltime/weekends/"
raw_html <- read_html(url)
raw_html
## {xml_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-type" content="text/html; charset= ...
## [2] <body>\n<iframe id="sis_pixel_sitewide" width="1" height="1" framebo ...

Well, we’ve got the HTML into R. That’s step 1 down. But this isn’t in a format that we can use yet.

Extracting the tables

The HTML is structured as a collection of “nodes” (think page elements: text blocks, tables, etc.), which are arranged hierarchically in a tree (nodes are composed of other nodes). We don’t need to know the fine details, but the main point is that somewhere within the mess of HTML that we just read in are some tables that can be converted into data frames in R. Fortunately, the good people behind rvest have figured out how to do this for us, and so all we need to do is call their nice function html_nodes() to get the tables.

Code:

tables <- raw_html %>%
  html_nodes("table")
tables
## {xml_nodeset (6)}
## [1] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>\ ...
## [2] <table border="0" cellpadding="0" cellspacing="0" width="100%">\n<tr ...
## [3] <table border="0" cellspacing="0" cellpadding="0"><tr><form method=" ...
## [4] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr>\ ...
## [5] <table border="0" cellspacing="1" cellpadding="5" width="98%">\n<tr  ...
## [6] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr>\ ...

Looks like there are six tables at that URL. By trial and error, we can discover that the table we care about, with the actual box office data, is fifth in this list.

A piece of R arcana (Rcana?): the usual square bracket notation for extracting a subset that we use to get an entry in a particular position in a vector (think myvariable[3] to get the third entry in myvariable) doesn’t quite work as expected when the data structure we start with is a list instead of a vector.

When we start with a list:

  • Single square brackets (mylist[3] or mylist[1:3]) return a sub-list
  • Double square brackets (mylist[[3]]) return a single element
  • This is a syntax error: mylist[[1:3]]

Since tables is a list of tables, to get a table out (instead of a list of one table), we need double brackets.

Code:

box_office_table <- tables[[5]]
box_office_table
## {xml_node}
## <table border="0" cellspacing="1" cellpadding="5" width="98%">
##  [1] <tr bgcolor="#dcdcdc">\n<td align="center"><font size="2"><a href=" ...
##  [2] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">1</font></ ...
##  [3] <tr bgcolor="#ffff99">\n<td align="right"><font size="2">2</font></ ...
##  [4] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">3</font></ ...
##  [5] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">4</font></ ...
##  [6] <tr bgcolor="#ffff99">\n<td align="right"><font size="2">5</font></ ...
##  [7] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">6</font></ ...
##  [8] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">7</font></ ...
##  [9] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">8</font></ ...
## [10] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">9</font></ ...
## [11] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">10</font>< ...
## [12] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">11</font>< ...
## [13] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">12</font>< ...
## [14] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">13</font>< ...
## [15] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">14</font>< ...
## [16] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">15</font>< ...
## [17] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">16</font>< ...
## [18] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">17</font>< ...
## [19] <tr bgcolor="#f4f4ff">\n<td align="right"><font size="2">18</font>< ...
## [20] <tr bgcolor="#ffffff">\n<td align="right"><font size="2">19</font>< ...
## ...

OK, we have the table we want. Step 2 done.

Converting the table to a data frame

Obviously this is still not a data frame. But getting one is easy (for us; as always there’s a lot of complexity wrapped in these simple functions). Just call html_table().

## header = TRUE preserves the column headings
box_office_data <- box_office_table %>% html_table(header = TRUE)
head(box_office_data)
##   Rank        Title (click to view) Studio     Opening* % of Total
## 1    1 Star Wars: The Force Awakens     BV $247,966,675      26.5%
## 2    2     Star Wars: The Last Jedi     BV $220,009,584      35.5%
## 3    3               Jurassic World   Uni. $208,806,270      32.0%
## 4    4        Marvel's The Avengers     BV $207,438,708      33.3%
## 5    5                Black Panther     BV $202,003,951      31.9%
## 6    6      Avengers: Age of Ultron     BV $191,271,109      41.7%
##   Theaters    Avg. Total Gross^     Date**
## 1    4,134 $59,982 $936,662,225 12/18/2015
## 2    4,232 $51,987 $619,982,340 12/15/2017
## 3    4,274 $48,855 $652,270,625  6/12/2015
## 4    4,349 $47,698 $623,357,910   5/4/2012
## 5    4,020 $50,250 $633,177,452  2/16/2018
## 6    4,276 $44,731 $459,005,868   5/1/2015

Now we’re getting somewhere!

Now I cheated a little bit by telling you that the right table was the fifth one. How could we have found that out?

We just need to try converting all of the tables in the list and see which one is what we’re looking for. We can do this efficiently by calling lapply().

  1. Use lapply() to take our original list of tables (tables), and produce a list of data frames. Then use lapply() again to get the dimensions of each table. It will be clear that there’s only one that could possibly be the one we want.

Sample solution:

## If we try this without fill = TRUE we will get an error
## that the number of columns changes across rows, along
## with a suggestion to use fill = TRUE
list_of_dfs <- lapply(tables, html_table, header = TRUE, fill = TRUE)
lapply(list_of_dfs, dim)
## [[1]]
## [1]  225 1989
## 
## [[2]]
## [1] 2 3
## 
## [[3]]
## [1] 0 1
## 
## [[4]]
## [1]  220 1982
## 
## [[5]]
## [1] 219   9
## 
## [[6]]
## [1] 0 1

Glancing at the web page with the data, we see that there should be nine variables, not 1, 3, or 1900-some. So table 5 must be what we want, by process of elimination.

Cleaning the data

Fixing variable names

Let’s look at the variable names in this data set.

Code:

names(box_office_data)
## [1] "Rank"                  "Title (click to view)" "Studio"               
## [4] "Opening*"              "% of Total"            "Theaters"             
## [7] "Avg."                  "Total Gross^"          "Date**"

Most of these are … not ideal: They’re full of spaces and special characters. Let’s fix them with a rename() operation.

Code:

## Note that since we're overwriting box_office_data
## if you run this chunk twice without rerunning the ones
## above you'll get an error the second time
box_office_data <- box_office_data %>%
  rename(
    rank           = `Rank`,
    title          = `Title (click to view)`,
    studio         = `Studio`,
    opening        = `Opening*`,
    pct_of_total   = `% of Total`,
    theaters       = `Theaters`,
    average_gross  = `Avg.`,
    total_gross    = `Total Gross^`,
    date           = `Date**`
  )
head(box_office_data)
##   rank                        title studio      opening pct_of_total
## 1    1 Star Wars: The Force Awakens     BV $247,966,675        26.5%
## 2    2     Star Wars: The Last Jedi     BV $220,009,584        35.5%
## 3    3               Jurassic World   Uni. $208,806,270        32.0%
## 4    4        Marvel's The Avengers     BV $207,438,708        33.3%
## 5    5                Black Panther     BV $202,003,951        31.9%
## 6    6      Avengers: Age of Ultron     BV $191,271,109        41.7%
##   theaters average_gross  total_gross       date
## 1    4,134       $59,982 $936,662,225 12/18/2015
## 2    4,232       $51,987 $619,982,340 12/15/2017
## 3    4,274       $48,855 $652,270,625  6/12/2015
## 4    4,349       $47,698 $623,357,910   5/4/2012
## 5    4,020       $50,250 $633,177,452  2/16/2018
## 6    4,276       $44,731 $459,005,868   5/1/2015

Much better. I can breathe now.

Plotting: Attempt 1

Ok, let’s try a plot. We’ll plot the opening box office total for each movie against the date when that movie opened using a geom_point(), to look at the trend over time, and we’ll map the movie’s share of the total box office returns for all movies to the size and color of the dot, to get a sense of whether the big blockbusters have become more financially dominant relative to other movies over time.

Code:

box_office_data %>%
  ggplot(aes(x = date, y = opening)) +
  geom_point(aes(size = pct_of_total, color = pct_of_total)) +
  xlab("Opening Date") +
  ylab("Opening Day Gross")
## Warning: Using size for a discrete variable is not advised.