Data-Wrangling: Putting it All Together

Goal

Refresh your memory about the main data-wrangling building blocks, and work through an extended example that combines them in various ways to achieve a goal.

The Data

This lab is not about baby names! Woohoo!

Instead, we’ll look at some baseball data from the Lahman package.

This lab is modified from section 4.4 in your textbook.

Our overarching goal is the following: for a few star major league hitters, create a plot that shows the hitting statistic OPS+ (On-Base-Plus-Slugging average for the player relative to league average in that year) for each player in each season that they played in the majors, as a function of their age. This will enable us to see at a glance at what age various star players “peaked” (at least, according to this one measure).

First, we’ll need to load the Lahman package and a couple of datasets.

Let’s see what the data in each of these two tables looks like.

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA    0
## 2  0  NA  NA NA NA    0
## 3  5  NA  NA NA NA    1
## 4  2  NA  NA NA NA    0
## 5  1  NA  NA NA NA    0
##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA         <NA>       <NA>
## 2     Mobile        NA         NA       NA         <NA>       <NA>
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA         <NA>       <NA>
## 5 Palm Beach        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1      <NA>     David  Aardsma    David Allan    215     75    R      R
## 2      <NA>      Hank    Aaron    Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron     Tommie Lee    190     75    R      R
## 4      <NA>       Don     Aase Donald William    190     75    R      R
## 5      <NA>      Andy     Abad  Fausto Andres    184     73    L      L
##        debut  finalGame  retroID   bbrefID  deathDate  birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01       <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01       <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001  aasedo01       <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001  abadan01       <NA> 1972-08-25

Focusing on a single player: Manny Ramirez

Notice that in both of these data tables, players are indexed not by their names, but by unique player IDs. Manny Ramirez, for example, is "ramirma02".

How would we know this? We can get it from the Master data table which records players’ first and last names along with their unique IDs.

Code:

##    playerID birthYear birthMonth birthDay birthCountry        birthState
## 1 ramirma02      1972          5       30         D.R. Distrito Nacional
##       birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Santo Domingo        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast        nameGiven weight height bats throws
## 1      <NA>     Manny  Ramirez Manuel Aristides    225     72    R      R
##        debut  finalGame  retroID   bbrefID deathDate  birthDate
## 1 1993-09-02 2011-04-06 ramim002 ramirma02      <NA> 1972-05-30

(In this case, there were no other major leaguers with that name, so there’s no ambiguity; but for other names there might be more than one player with that name, hence the need for a unique ID.)

Having done that, we can filter the batting data to look at Ramirez’s season-by-season stats.

Code:

##     playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1  ramirma02   1993     1    CLE   AL  22  53   5   9   1   0  2   5  0  0
## 2  ramirma02   1994     1    CLE   AL  91 290  51  78  22   0 17  60  4  2
## 3  ramirma02   1995     1    CLE   AL 137 484  85 149  26   1 31 107  6  6
## 4  ramirma02   1996     1    CLE   AL 152 550  94 170  45   3 33 112  8  5
## 5  ramirma02   1997     1    CLE   AL 150 561  99 184  40   0 26  88  2  3
## 6  ramirma02   1998     1    CLE   AL 150 571 108 168  35   2 45 145  5  3
## 7  ramirma02   1999     1    CLE   AL 147 522 131 174  34   3 44 165  2  4
## 8  ramirma02   2000     1    CLE   AL 118 439  92 154  34   2 38 122  1  1
## 9  ramirma02   2001     1    BOS   AL 142 529  93 162  33   2 41 125  0  1
## 10 ramirma02   2002     1    BOS   AL 120 436  84 152  31   0 33 107  0  0
## 11 ramirma02   2003     1    BOS   AL 154 569 117 185  36   1 37 104  3  1
## 12 ramirma02   2004     1    BOS   AL 152 568 108 175  44   0 43 130  2  4
## 13 ramirma02   2005     1    BOS   AL 152 554 112 162  30   1 45 144  1  0
## 14 ramirma02   2006     1    BOS   AL 130 449  79 144  27   1 35 102  0  1
## 15 ramirma02   2007     1    BOS   AL 133 483  84 143  33   1 20  88  0  0
## 16 ramirma02   2008     1    BOS   AL 100 365  66 109  22   1 20  68  1  0
## 17 ramirma02   2008     2    LAN   NL  53 187  36  74  14   0 17  53  2  0
## 18 ramirma02   2009     1    LAN   NL 104 352  62 102  24   2 19  63  0  1
## 19 ramirma02   2010     1    LAN   NL  66 196  32  61  15   0  8  40  1  1
## 20 ramirma02   2010     2    CHA   AL  24  69   6  18   1   0  1   2  0  0
## 21 ramirma02   2011     1    TBA   AL   5  17   0   1   0   0  0   1  0  0
##     BB  SO IBB HBP SH SF GIDP
## 1    2   8   0   0  0  0    3
## 2   42  72   4   0  0  4    6
## 3   75 112   6   5  2  5   13
## 4   85 104   8   3  0  9   18
## 5   79 115   5   7  0  4   19
## 6   76 121   6   6  0 10   18
## 7   96 131   9  13  0  9   12
## 8   86 117   9   3  0  4    9
## 9   81 147  25   8  0  2    9
## 10  73  85  14   8  0  1   13
## 11  97  94  28   8  0  5   22
## 12  82 124  15   6  0  7   17
## 13  80 119   9  10  0  6   20
## 14 100 102  16   1  0  8   13
## 15  71  92  13   7  0  8   21
## 16  52  86   8   8  0  0   12
## 17  35  38  16   3  0  4    5
## 18  71  81  21   7  0  1    7
## 19  32  38   4   1  0  3    4
## 20  14  23   0   5  0  0    2
## 21   0   4   0   0  0  0    0

Which seasons was he active? (summarize())

We can see from this data that each “case” in the Batting dataset appears to consist of a season’s worth of hitting data for a single player. Manny Ramirez has 21 entries. Was he active for 21 seasons? Let’s check.

Code

##   rookie_year final_year num_seasons num_teams
## 1        1993       2011          19         5

Note that I used a new function, n_distinct(), to return the number of distinct values of a variable in a column.

Hmm… looks like he only played in 19 seasons (for a total of 5 different teams); not 21. What’s happening here?

It turns out that the rows of the Batting table are not necessarily a full season’s worth of data. If a player was traded during a season, then they played for two different teams that year, and so there are two different entries in the data. There must be two years when that happened for Ramirez. Let’s find out which they are.

When was he traded? (group_by() and summarize())

  1. Group MannyData by season, and use summarize() to return the number of different teams recorded in each season. You may want to filter() the data to show only those years when he played for more than one team.

SOLUTION
## # A tibble: 2 x 2
##   yearID num_teams
##    <int>     <int>
## 1   2008         2
## 2   2010         2

If we want to compute Manny’s batting statistics by full season, we’re going to have to do some aggregation to combine the two rows in these years.

  1. Compute and plot the total number of home runs hit by Ramirez in each season (making sure to correctly handle seasons when he was traded, summing his home runs across teams)

SOLUTION


  1. Ramirez hit the most home runs in 1998 and 2005. Of course, new players may not play as often as players that have been around for a few years, so let’s instead plot Ramirez’s “projected” home runs extrapolating to 162 games. Use mutate() to create a variable that consists of home runs per 162 games. You may want to revise your group_by() and summarize() code from the previous exercise to include the number of games played to make this easier (you’ll want to sum actual home runs and games over records first before mutating so that the home run rate is computed correctly).

Adding Age to the Data (join())

Our original goal was to construct a plot similar to this by the player’s age, instead of by season. Can we find age in the Batting table?

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA    0
## 2  0  NA  NA NA NA    0
## 3  5  NA  NA NA NA    1
## 4  2  NA  NA NA NA    0
## 5  1  NA  NA NA NA    0

Hmm… nope. Nor is there anything in this table that we can use to calculate age. What about in the biographical table, Master?

##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA         <NA>       <NA>
## 2     Mobile        NA         NA       NA         <NA>       <NA>
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA         <NA>       <NA>
## 5 Palm Beach        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1      <NA>     David  Aardsma    David Allan    215     75    R      R
## 2      <NA>      Hank    Aaron    Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron     Tommie Lee    190     75    R      R
## 4      <NA>       Don     Aase Donald William    190     75    R      R
## 5      <NA>      Andy     Abad  Fausto Andres    184     73    L      L
##        debut  finalGame  retroID   bbrefID  deathDate  birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01       <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01       <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001  aasedo01       <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001  abadan01       <NA> 1972-08-25

Aha, this table gives us players’ birth years.

To a first approximation (that is, ignoring birth month and day), we can represent age using the formula yearID - birthYear. Only problem is, these two variables come from different data tables.

This sounds like a job for a join!

  1. Use a suitable join operation, together with whatever other verbs are needed, to get Age added to Manny’s year-by-year data table, and plot the total number of homeruns/162 by his age instead of by season.

Putting together some pieces

  1. Produce analogous plots using On-Base Percentage (OBP) and Slugging Average (SLG), and OPS (On-Base plus Slugging) instead of HR/162. If you don’t know what these statistics are, that’s ok; you can calculate them as follows. The formula for OBP is: \(OBP = (H + BB + HBP) / (AB + BB + HBP + SF)\), where \(H\) stands for “hits” (an at-bat in which the player did not reach base safely after hitting the ball in play), \(BB\) stands for “base-on-balls” (a walk), \(HBP\) is a “hit by pitch”. Slugging average is the average number of bases per at bat, and is calculated as a weighted sum of the number of singles, doubles, triples and home runs, divided by the number of at-bats: \(SLG = (1 * X1B + 2 * X2B + 3 * X3B + 4 * HR) / AB\). Note that the data table does not include a separate column for singles, but every hit is either a single, a double, a triple or a home run, so it can be worked out from those counts and the total number of hits. Finally, \(OPS = OBP + SLG\). Before you calculate these statistics, you’ll want to combine the count statistics across teams within a season. Since we’re applying the same operation (summing) across several variables, this is a good time to use summarize_at() to specify a set of variables and an operation, rather than writing each one out individually with a regular summarize().

SOLUTION


For Reference: Baseball Hitting Statistics in a Nutshell

You don’t necessarily need to know all the details about baseball to create visualizations, but it’s never a bad idea to learn something about the measures you’re plotting. So if you aren’t familiar with baseball, here’s a quick “glossary”.

Each time a hitter appears at home plate to hit, the outcome is either

  • a hit (the player reached base safely after hitting the ball in play or out of the park as a home run)
  • a walk (the pitcher missed the strike zone four times without the batter swinging the bat)
  • a “hit by pitch” (the batter got hit by the pitch, and gets to go to first base)
  • “sacrifice fly” (the defending team catches the ball in the air so that the batter is out, but the ball is hit deep enough that a runner already on base can advance), or
  • some other kind of out.

Among hits * a single means the runner reached first base on the play * a double or a triple mean the batter reached second or third, respectively, and * a home run means the batter made it all the way around the bases (for a total of four bases on the play).

For the purposes of recording statistics, walks, being hit by a pitch, and sacrifice flies are not recorded as “at bats”, and so for the purposes of calculating the traditional “batting average”, they don’t factor in to either the numerator or the denominator: batting average (BA) is simply “hits” (H) divided by “at bats” (AB).

Unlike batting average, “on-base percentage” counts any outcome in which the player ends up on the bases (hits, walks, hit by pitch) in the numerator, and counts every plate appearance (all of these things plus outs including sacrifice flies) in the denominator.

If you’ve ever read the book or seen the movie Moneyball about the 2002 Oakland A’s, one of the key insights that Billy Beane and his analysts had was that the league as a whole had been undervaluing outcomes in which the player reached base not via a hit, and so they tried to sign players with high OBPs relative to their batting average.

Comparing a hitter’s stats to league average (wrangling together tables to be join()ed)

Around the turn of the millenium (coinciding with Manny Ramirez’s peak), many hitters were putting up off the charts hitting numbers, which can at least in part be attributed to a high rate of steroid abuse during that time. If we want to know when a player (Ramirez, for example) provided the greatest “added offensive value” to his team, it would be instructive to know how well he was hitting in each season relative to the rest of the league.

The statistic OPS+ is defined as a player’s OPS divided by the league average in that year, times 100 (so that 100 is league average, 150 means the player’s OPS was 50% higher than league average, etc.). Like age, however, OPS+ depends on information from two different data tables; or at least two views of the Batting data: the numerator comes from data for a specific player, and the denominator comes from data aggregated over players.

We can use a join to deal with this as well, but we first need to create the two tables (with the right type of aggregation) that we want to join.

We already know how to compute an individual player’s OPS. To compute the average for the whole league, we can do the same thing but without filtering first.

Code:

## # A tibble: 5 x 2
##   yearID leagueOPS
##    <int>     <dbl>
## 1   2014     0.700
## 2   2015     0.721
## 3   2016     0.739
## 4   2017     0.750
## 5   2018     0.728

You might notice that very long-ago years are missing some of the components needed to compute OPS, and so we get “not a number” for league OPS since in some cases we end up trying to divide zero by zero. We could have filtered out these early years, but they won’t cause a problem for what we want to do, since in joining, they’ll be left out anyway.

  1. Now that we have a dataset with league average OPS for each year, use an appropriate join to get Manny Ramirez’s OPS+ for each season, and plot it as a time series. Add a horizontal reference line at OPS+ = 100, since this is the league average.

SOLUTION

Beyond Manny Ramirez

  1. Produce a similar plot for another well known player of your choice (or if you’re not familiar with baseball players, take, say, Cal Ripken). Does the career arc look similar to that of Manny Ramirez? Note: Some names have more than one player in the database with that name. Cal Ripken is one (the most famous one is Cal Ripken, Jr.; his father also was an MLB player). To filter the data to include just one player, we’ll want to extract the player ID for Cal Ripken, Jr. After filtering the Master table, to a table called, say, RipkenBioData, we can use the following syntax to get just the first entry in the playerID column: RipkenBioData %>% slice_head(n = 1) %>% pull(playerID).

SOLUTION