Scraping and Cleaning Data From the Web

Goals

Use a reproducible workflow to extract data tables from HTML, clean it, and make some visualizations. This will involve the following steps:

  • Pulling the HTML from the site
  • Extracting the tables
  • Converting a table to a data frame
  • Cleaning the data in the table (for example, quantitative data is numeric, dates are parsed, undesireable symbols are removed, etc)
  • Other Wrangling and Visualization

Moreover we want to be able to do all of these steps with R code so that our visualization is fully reproducible. That means no manual editing, copy-pasting, or Excel fixes.

The Data

We’ll look at the data on the all time biggest grossing opening weekends for movies Box Office Mojo

Load the tidyverse:

Pulling th HTML from the Web

The rvest package has a read_html() function that reads HTML directly. It turns out we need to load this package separately.

Code:

## {html_document}
## <html class="a-no-js" data-19ax5a9jf="dingo">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="body" class="mojo-page-id-csw a-m-us a-aui_72554-c a-aui_a ...

Well, we’ve got the HTML into R. That’s step 1 down. But this isn’t in a format that we can use yet.

Extracting the tables

The HTML is structured as a collection of “nodes” (think page elements: text blocks, tables, etc.), which are arranged hierarchically in a tree (nodes are composed of other nodes). We don’t need to know the fine details, but the main point is that somewhere within the mess of HTML that we just read in are some tables that can be converted into data frames in R.

Fortunately, the good people behind rvest have figured out how to do this for us, and so all we need to do is call their nice function html_nodes() to get just the tables.

Code:

## {xml_nodeset (1)}
## [1] <table class="a-bordered a-horizontal-stripes a-size-base a-span12 m ...

It’s pretty hard to make sense of what is in this object, but if you inspect the tables object in your environment, you will see that it is a “list of length 1” with the xml_nodeset object class, and that its sole element is itself a “list of length 2” which is an xml_node. Looking at the first part of the markup for that first node, it looks like it has the table tag, which is promising.

Probably we want the actual content of that first entry in the tables list. This isn’t a tibble or data frame object, so we can’t extract elements by name using functions like pull(), but we can use pluck() as follows to extract the first entry.

Code:

## {html_node}
## <table class="a-bordered a-horizontal-stripes a-size-base a-span12 mojo-body-table mojo-table-annotated">
##  [1] <tr>\n<th class="a-text-right mojo-field-type-rank a-nowrap">\n<spa ...
##  [2] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [3] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [4] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [5] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [6] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [7] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [8] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
##  [9] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [10] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [11] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [12] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [13] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [14] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [15] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [16] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [17] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [18] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [19] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## [20] <tr>\n<td class="a-text-right mojo-header-column mojo-truncate mojo ...
## ...

This still has a lot of HTML markup, but it’s at least designated as a table in the HTML.

OK, we have the table we want. Step 2, extracting the table done.

Converting the HTML table to an R data frame object

Obviously this is still not a data frame. But getting one is easy (for us; as always there’s a lot of complexity wrapped in these simple functions). Just pass the HTML table to the html_table() function, which will return a tibble if the input has the right format.

Using the header = TRUE argument will cause html_table() to treat the first row of the table as listing variable names. One clue that this is what we want is that the HTML table starts with a node whose class seems to be called mojo-field-type-rank followed by the text Rank. Subsequent rows contain numerical ranks.

Code:

## # A tibble: 5 x 9
##    Rank Release   Opening `Total Gross` `% of Total` Theaters Average Date 
##   <int> <chr>     <chr>   <chr>         <chr>        <chr>    <chr>   <chr>
## 1     1 Avengers… $357,1… $858,373,000  41.6%        4,662    $76,601 Apr …
## 2     2 Avengers… $257,6… $678,815,482  38%          4,474    $57,599 Apr …
## 3     3 Star War… $247,9… $936,662,225  26.5%        4,134    $59,982 Dec …
## 4     4 Star War… $220,0… $620,181,382  35.5%        4,232    $51,987 Dec …
## 5     5 Jurassic… $208,8… $652,270,625  32%          4,274    $48,855 Jun …
## # … with 1 more variable: Distributor <chr>

Now we’re getting somewhere!

Aside: Finding the right table

In some cases you will get more than one table in your list of tables. How do we know which one is the one we’re trying to get?

We could just converting all of the tables in the list and see which one is what we’re looking for. We can do this efficiently by calling lapply().

Below is some code to extract the tables from the Wikipedia page listing all songs recorded by the Beatles.

Code:

  1. Use lapply() to take the beatles_tables list, which should contain 11 tables, and produce a list of R datasets (tibbles). Then use lapply() again to get the dimensions of each table using the dim() function. Based on the dimensions, identify which one is the big table under the “Main Songs” section on the page.

SOLUTION
## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 213   6
## 
## [[4]]
## [1] 1 2
## 
## [[5]]
## [1] 82  7
## 
## [[6]]
## [1] 10  2
## 
## [[7]]
## [1] 36 26
## 
## [[8]]
## [1] 11  2
## 
## [[9]]
## [1] 3 2
## 
## [[10]]
## [1] 5 2
## 
## [[11]]
## [1] 8 2

Looking at the Wikipedia page in the browser, the table under “Main songs” has 6 columns and dozens orf rows. So it must be the third table in the list.


Cleaning the data

Fixing variable names

Let’s look at the variable names in the box office data set.

Code:

## [1] "Rank"        "Release"     "Opening"     "Total Gross" "% of Total" 
## [6] "Theaters"    "Average"     "Date"        "Distributor"

Most of these are fine single-word variable names. However, a couple of them have spaces in the names, which is a pain to work with, and one has a special character (the % sign). Let’s “fix” these column names with a rename() operation.

Note the backquotes around the existing variable names. This is needed when the name is not a valid R variable name (due to spaces, punctuation, etc.)

Code:

## # A tibble: 5 x 9
##    Rank Release   Opening `Total Gross` `% of Total` Theaters Average Date 
##   <int> <chr>     <chr>   <chr>         <chr>        <chr>    <chr>   <chr>
## 1     1 Avengers… $357,1… $858,373,000  41.6%        4,662    $76,601 Apr …
## 2     2 Avengers… $257,6… $678,815,482  38%          4,474    $57,599 Apr …
## 3     3 Star War… $247,9… $936,662,225  26.5%        4,134    $59,982 Dec …
## 4     4 Star War… $220,0… $620,181,382  35.5%        4,232    $51,987 Dec …
## 5     5 Jurassic… $208,8… $652,270,625  32%          4,274    $48,855 Jun …
## # … with 1 more variable: Distributor <chr>

Much better.

Plotting: Attempt 1

Ok, let’s try a plot. Let’s see whether the biggest movies have gotten more dominant in the market over time by plotting the PctOfTotal variable, which tells us what share of movie receipts a particular film commanded, against the date the movie opened (Date), using a geom_point() to look at the trend over time. We’ll map the movie’s opening weekend haul to the size and color of the dot.

Code:

## Warning: Using size for a discrete variable is not advised.

Ohhhh nooooooo…. this plot could be its own disaster movie.

(Note: I had to set fig.width = 20 and fig.height = 7 in my chunk options to even get the plot to display)

Fixing variable formats

It looks like many of the variables which we intended to be quantitative, are actually stored as categorical. Let’s take a glimpse() at the data again to quickly see the data types of each column.

Code:

## Rows: 200
## Columns: 9
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ Release     <chr> "Avengers: Endgame", "Avengers: Infinity War", "Star …
## $ Opening     <chr> "$357,115,007", "$257,698,183", "$247,966,675", "$220…
## $ TotalGross  <chr> "$858,373,000", "$678,815,482", "$936,662,225", "$620…
## $ PctOfTotal  <chr> "41.6%", "38%", "26.5%", "35.5%", "32%", "33.3%", "28…
## $ Theaters    <chr> "4,662", "4,474", "4,134", "4,232", "4,274", "4,349",…
## $ Average     <chr> "$76,601", "$57,599", "$59,982", "$51,987", "$48,855"…
## $ Date        <chr> "Apr 26, 2019", "Apr 27, 2018", "Dec 18, 2015", "Dec …
## $ Distributor <chr> "Walt Disney Studios Motion Pictures", "Walt Disney S…

Yup: Opening, PctOfTotal, and Date all have the <chr> (character) type, which means they’re stored as text strings.

No problem. We’ve seen this issue before. We can use parse_number() to fix it.

  1. Use parse_number() together with the appropriate wrangling verb to convert all variables that should be quantitative into numbers. (Skip Date for now; we’ll handle that separately)

SOLUTION
## Rows: 200
## Columns: 9
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ Release     <chr> "Avengers: Endgame", "Avengers: Infinity War", "Star …
## $ Opening     <dbl> 357115007, 257698183, 247966675, 220009584, 208806270…
## $ TotalGross  <dbl> 858373000, 678815482, 936662225, 620181382, 652270625…
## $ PctOfTotal  <dbl> 41.6, 38.0, 26.5, 35.5, 32.0, 33.3, 28.9, 35.3, 41.7,…
## $ Theaters    <dbl> 4662, 4474, 4134, 4232, 4274, 4349, 4020, 4725, 4276,…
## $ Average     <dbl> 76601, 57599, 59982, 51987, 48855, 47698, 50249, 4058…
## $ Date        <chr> "Apr 26, 2019", "Apr 27, 2018", "Dec 18, 2015", "Dec …
## $ Distributor <chr> "Walt Disney Studios Motion Pictures", "Walt Disney S…

The Opening, TotalGross, PctOfTotal, Theaters and Average variables should now have the <dbl> (double) type, which indicates that they’re “double precision floating point” values. This is a computer jargon way of saying that they represent potentially non-integer numbers. That’s better.

Handling the dates

The dates in the Date column are written as Month Day, Year format, where the month is written out as a text abbreviation (rather than a number). Perfectly understandable by a person, but not by ggplot by default. The parse_date() function can read it and convert it, but we need to tell it what format the existing data is in.

It turns out the shorthand for a month abbreviation is %b% (%B would be the whole month name spelled out; %m is the number of the month; %d is the number of the day within the month, %y is a two-digit year, and %Y is a four digit year).

Code

## Rows: 200
## Columns: 9
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ Release     <chr> "Avengers: Endgame", "Avengers: Infinity War", "Star …
## $ Opening     <dbl> 357115007, 257698183, 247966675, 220009584, 208806270…
## $ TotalGross  <dbl> 858373000, 678815482, 936662225, 620181382, 652270625…
## $ PctOfTotal  <dbl> 41.6, 38.0, 26.5, 35.5, 32.0, 33.3, 28.9, 35.3, 41.7,…
## $ Theaters    <dbl> 4662, 4474, 4134, 4232, 4274, 4349, 4020, 4725, 4276,…
## $ Average     <dbl> 76601, 57599, 59982, 51987, 48855, 47698, 50249, 4058…
## $ Date        <date> 2019-04-26, 2018-04-27, 2015-12-18, 2017-12-15, 2015…
## $ Distributor <chr> "Walt Disney Studios Motion Pictures", "Walt Disney S…

Plotting: Attempt 2

Now the same code we used before will produce a more readable plot now.

Code:

Note that since Opening is now quantitative, ggplot uses a sequential color palette, which makes a lot more sense for this variable than the categorical palette it was using before.

Refining the plot: Customizing Legends

There are still some issues: the dollar amounts are displayed in scientific notation, which is awkward, and the legend shows the variable name that we created because it was nice to work with in code; but it’s not so nice on a plot.

We can use the guides() element to customize the legend titles, the scales library to have the y axis display dollar amounts, and the scale_x_date() element to have the \(x\)-axis labels format as dates.

Code:

Setting the legend titles this way has the nice effect that the two legends get merged into one, which is good since color and size correspond to the same variable.

And we’re done!

Additional Web Scraping Practice

Another thing we could do with this movie data is look at the actual dollar amounts commanded by the top movies over time. However, the meaning of a dollar is not constant over time, and some of any increases we see are attributable to simple inflation. Fortunately, we can adjust for inflation. The website at https://inflationdata.com/Inflation/Consumer_Price_Index/HistoricalCPI.aspx contains a table showing the consumer price index by month since 1913. To find out how many of todays dollars are equivalent to $1 at a given point in the past, we can find the ratio between the CPI now and the CPI in the year when the data comes from. If we multiply a past dollar amount by this ratio, we can express it in terms of today’s dollars.

  1. Scrape the inflation data from the site and use it together with the relevant _join() and mutate() operations to convert the box office returns to today’s dollars (for simplicity you can just use the Year and the Jan columns). In order to match up the dates, you’ll need to extract the year from the Date column of the box office data. You can use the year() function from the lubridate package to do this. Then plot Opening Day receipts by release date, adjusted for inflation, adjusting things like axis labels, etc as appropriate to improve the presentation of the plot.

SOLUTION

  1. Here is some data about the same thing (movie opening weekends), but from a different website (and so some formats may be different), and restricted to U.S. ticket sales. Create two plots using this data: one that is structured the same way as the one above, and one that shows something else you find interesting. Post the second plot in Slack at #lab12.

SOLUTION

Some Simple Text Manipulation

Let’s revisit the table of Beatles songs. Recall that it was in the third spot in the list of tables that we read in from the web.

Code:

Some of the fields in this data have odd formatting. For example, the song titles have quotation marks in them. How could we go about cleaning these values to remove the quotes?

To modify each entry in a variable we can use mutate()… but what operation will take a string with quotes and return the same string with no quotes? One option is to use the gsub() function. The first argument specifies a pattern we want to find in each string. The second argument specifies what we would like to put in place of any substring that matches the first pattern. And the third argument is a vector of strings.

So, to remove quotes (and while we’re at it, remove the parentheses from some of the variable names), we can do

Code:

Notice that to refer to a quotation mark in a pattern string, we need to precede it with two backslashes. This tells the parser to treat the following character as a literal instance of that character and not use any special syntactic meaning it would normally have.

Let’s take a look at how many songs were written by each Beatle.

Code:

## # A tibble: 28 x 2
##    Songwriters                                                    num_songs
##    <chr>                                                              <int>
##  1 Arthur Alexander                                                       1
##  2 Berry GordyJanie Bradford                                              1
##  3 Bobby ScottRic Marlow                                                  1
##  4 Buddy Holly                                                            1
##  5 Burt BacharachMack DavidLuther Dixon                                   1
##  6 Carl Perkins                                                           3
##  7 Chuck Berry                                                            2
##  8 George Harrison                                                        1
##  9 Georgia DobbinsWilliam GarrettBrian HollandRobert BatemanFred…         1
## 10 Gerry GoffinCarole King                                                1
## # … with 18 more rows

Early in the bands career they recorded a lot of covers, so we see one song credited to a lot of different people. Let’s try to filter() the data to include only Beatles originals.

One option would be to simply list all of the different ways that songs are credited that include members of the band, and use an %in% operator to filter. But since the members cowrote songs in a number of different combinations, this is a bit awkward. A more elegant solution would be to find the entries where the string includes one of the members’ names (Lennon, McCartney, Harrison, or Starkey – Ringo Starr’s real name is Richard Starkey).

We can create a filter that checks whether a string contains a particular pattern using the grepl() function. For example, to find all songs that Paul McCartney wrote or co-wrote:

## # A tibble: 6 x 2
##   Songwriters                                     num_songs
##   <chr>                                               <int>
## 1 LennonMcCartney                                       159
## 2 LennonMcCartneyHarrisonStarkey                          2
## 3 LennonMcCartneyJean NicolasHeinz Hellmer                1
## 4 LennonMcCartneyJean NicolasLee Montogue                 1
## 5 LennonMcCartneyStarkey                                  1
## 6 Traditional, arr.LennonMcCartneyHarrisonStarkey         1

To find instances where any of a set of substrings is contained in a string, we can use the following syntax:

## # A tibble: 9 x 2
##   Songwriters                                     num_songs
##   <chr>                                               <int>
## 1 George Harrison                                         1
## 2 Harrison                                               21
## 3 LennonMcCartney                                       159
## 4 LennonMcCartneyHarrisonStarkey                          2
## 5 LennonMcCartneyJean NicolasHeinz Hellmer                1
## 6 LennonMcCartneyJean NicolasLee Montogue                 1
## 7 LennonMcCartneyStarkey                                  1
## 8 Starkey                                                 2
## 9 Traditional, arr.LennonMcCartneyHarrisonStarkey         1

Looks like there’s one entry credited to “George Harrison”, whereas the rest of the credits use only the songwriter’s last name. Let’s clean this up with a gsub()

Code:

We’ll come back to text manipulation later on and see how to do more sophisticated things, but if you want to learn more on your own about the way patterns are specified for functions like gsub() and grepl(), you can read the documentation for those two functions, as well as the regex help page.

Text Manipulation Practice

  1. Find out how many of the Beatles’ songs contain personal pronouns such as “I”, “you”, “me”, “my”, “your”, “she”, “he”. How many also contain the word “love”? Is the proportion of songs that contain “love” that also have personal pronouns higher than the proportion out of those that don’t contain the word “love” that have personal pronouns? (Hint: Use grepl() in a mutate() instead of a filter() to create a binary variable that is TRUE if the song title contains (one of) the strings in question, and FALSE otherwise, and use the and (&) and/or or (|) operators to create binary variables that check for conjunctions or disjunctions of other binary variables).

SOLUTION

  1. Make a bar graph comparing the proportion of song titles containing the word “love” out of those containing a personal pronoun to the proportion containing “love” out of those not containing a personal pronoun.

SOLUTION