Data-Wrangling: Putting it All Together

Goal

Refresh your memory about the main data-wrangling building blocks, and work through an extended example that combines them in various ways to achieve a goal.

The Data

This lab is not about baby names! Woohoo!

Instead, we’ll look at some baseball data from the Lahman package.

This lab is modified from section 4.4 in your textbook.

Our overarching goal is the following: for a few star major league hitters, create a plot that shows the hitting statistic OPS+ (On-Base-Plus-Slugging average for the player relative to league average in that year) for each player in each season that they played in the majors, as a function of their age. This will enable us to see at a glance at what age various star players “peaked” (at least, according to this one measure).

First, we’ll need to load the Lahman package and a couple of datasets.

Let’s see what the data in each of these two tables looks like.

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA    0
## 2  0  NA  NA NA NA    0
## 3  5  NA  NA NA NA    1
## 4  2  NA  NA NA NA    0
## 5  1  NA  NA NA NA    0
##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA         <NA>       <NA>
## 2     Mobile        NA         NA       NA         <NA>       <NA>
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA         <NA>       <NA>
## 5 Palm Beach        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1      <NA>     David  Aardsma    David Allan    215     75    R      R
## 2      <NA>      Hank    Aaron    Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron     Tommie Lee    190     75    R      R
## 4      <NA>       Don     Aase Donald William    190     75    R      R
## 5      <NA>      Andy     Abad  Fausto Andres    184     73    L      L
##        debut  finalGame  retroID   bbrefID  deathDate  birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01       <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01       <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001  aasedo01       <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001  abadan01       <NA> 1972-08-25

Focusing on a single player: Manny Ramirez

Notice that in both of these data tables, players are indexed not by their names, but by unique player IDs. Manny Ramirez, for example, is "ramirma02".

How would we know this? We can get it from the Master data table which records players’ first and last names along with their unique IDs.

Code:

##    playerID birthYear birthMonth birthDay birthCountry        birthState
## 1 ramirma02      1972          5       30         D.R. Distrito Nacional
##       birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Santo Domingo        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast        nameGiven weight height bats throws
## 1      <NA>     Manny  Ramirez Manuel Aristides    225     72    R      R
##        debut  finalGame  retroID   bbrefID deathDate  birthDate
## 1 1993-09-02 2011-04-06 ramim002 ramirma02      <NA> 1972-05-30

(In this case, there were no other major leaguers with that name, so there’s no ambiguity; but for other names there might be more than one player with that name, hence the need for a unique ID.)

Having done that, we can filter the batting data to look at Ramirez’s season-by-season stats.

Code:

##     playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1  ramirma02   1993     1    CLE   AL  22  53   5   9   1   0  2   5  0  0
## 2  ramirma02   1994     1    CLE   AL  91 290  51  78  22   0 17  60  4  2
## 3  ramirma02   1995     1    CLE   AL 137 484  85 149  26   1 31 107  6  6
## 4  ramirma02   1996     1    CLE   AL 152 550  94 170  45   3 33 112  8  5
## 5  ramirma02   1997     1    CLE   AL 150 561  99 184  40   0 26  88  2  3
## 6  ramirma02   1998     1    CLE   AL 150 571 108 168  35   2 45 145  5  3
## 7  ramirma02   1999     1    CLE   AL 147 522 131 174  34   3 44 165  2  4
## 8  ramirma02   2000     1    CLE   AL 118 439  92 154  34   2 38 122  1  1
## 9  ramirma02   2001     1    BOS   AL 142 529  93 162  33   2 41 125  0  1
## 10 ramirma02   2002     1    BOS   AL 120 436  84 152  31   0 33 107  0  0
## 11 ramirma02   2003     1    BOS   AL 154 569 117 185  36   1 37 104  3  1
## 12 ramirma02   2004     1    BOS   AL 152 568 108 175  44   0 43 130  2  4
## 13 ramirma02   2005     1    BOS   AL 152 554 112 162  30   1 45 144  1  0
## 14 ramirma02   2006     1    BOS   AL 130 449  79 144  27   1 35 102  0  1
## 15 ramirma02   2007     1    BOS   AL 133 483  84 143  33   1 20  88  0  0
## 16 ramirma02   2008     1    BOS   AL 100 365  66 109  22   1 20  68  1  0
## 17 ramirma02   2008     2    LAN   NL  53 187  36  74  14   0 17  53  2  0
## 18 ramirma02   2009     1    LAN   NL 104 352  62 102  24   2 19  63  0  1
## 19 ramirma02   2010     1    LAN   NL  66 196  32  61  15   0  8  40  1  1
## 20 ramirma02   2010     2    CHA   AL  24  69   6  18   1   0  1   2  0  0
## 21 ramirma02   2011     1    TBA   AL   5  17   0   1   0   0  0   1  0  0
##     BB  SO IBB HBP SH SF GIDP
## 1    2   8   0   0  0  0    3
## 2   42  72   4   0  0  4    6
## 3   75 112   6   5  2  5   13
## 4   85 104   8   3  0  9   18
## 5   79 115   5   7  0  4   19
## 6   76 121   6   6  0 10   18
## 7   96 131   9  13  0  9   12
## 8   86 117   9   3  0  4    9
## 9   81 147  25   8  0  2    9
## 10  73  85  14   8  0  1   13
## 11  97  94  28   8  0  5   22
## 12  82 124  15   6  0  7   17
## 13  80 119   9  10  0  6   20
## 14 100 102  16   1  0  8   13
## 15  71  92  13   7  0  8   21
## 16  52  86   8   8  0  0   12
## 17  35  38  16   3  0  4    5
## 18  71  81  21   7  0  1    7
## 19  32  38   4   1  0  3    4
## 20  14  23   0   5  0  0    2
## 21   0   4   0   0  0  0    0

Which seasons was he active? (summarize())

We can see from this data that each “case” in the Batting dataset appears to consist of a season’s worth of hitting data for a single player. Manny Ramirez has 21 entries. Was he active for 21 seasons? Let’s check.

Code

##   rookie_year final_year num_seasons num_teams
## 1        1993       2011          19         5

Note that I used a new function, n_distinct(), to return the number of distinct values of a variable in a column.

Hmm… looks like he only played in 19 seasons (for a total of 5 different teams); not 21. What’s happening here?

It turns out that the rows of the Batting table are not necessarily a full season’s worth of data. If a player was traded during a season, then they played for two different teams that year, and so there are two different entries in the data. There must be two years when that happened for Ramirez. Let’s find out which they are.

When was he traded? (group_by() and summarize())

  1. Group MannyData by season, and use summarize() to return the number of different teams recorded in each season. You may want to filter() the data to show only those years when he played for more than one team.

SOLUTION
## # A tibble: 2 x 2
##   yearID num_teams
##    <int>     <int>
## 1   2008         2
## 2   2010         2

If we want to compute Manny’s batting statistics by full season, we’re going to have to do some aggregation to combine the two rows in these years.

  1. Compute and plot the total number of home runs hit by Ramirez in each season (making sure to correctly handle seasons when he was traded, summing his home runs across teams)

SOLUTION


  1. Ramirez hit the most home runs in 1998 and 2005. Of course, new players may not play as often as players that have been around for a few years, so let’s instead plot Ramirez’s “projected” home runs extrapolating to 162 games. Use mutate() to create a variable that consists of home runs per 162 games. You may want to revise your group_by() and summarize() code from the previous exercise to include the number of games played to make this easier (you’ll want to sum actual home runs and games over records first before mutating so that the home run rate is computed correctly).

Adding Age to the Data (join())

Our original goal was to construct a plot similar to this by the player’s age, instead of by season. Can we find age in the Batting table?

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA    0
## 2  0  NA  NA NA NA    0
## 3  5  NA  NA NA NA    1
## 4  2  NA  NA NA NA    0
## 5  1  NA  NA NA NA    0

Hmm… nope. Nor is there anything in this table that we can use to calculate age. What about in the biographical table, Master?

##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA         <NA>       <NA>
## 2     Mobile        NA         NA       NA         <NA>       <NA>
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA         <NA>       <NA>
## 5 Palm Beach        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1      <NA>     David  Aardsma    David Allan    215     75    R      R
## 2      <NA>      Hank    Aaron    Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron     Tommie Lee    190     75    R      R
## 4      <NA>       Don     Aase Donald William    190     75    R      R
## 5      <NA>      Andy     Abad  Fausto Andres    184     73    L      L
##        debut  finalGame  retroID   bbrefID  deathDate  birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01       <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01       <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001  aasedo01       <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001  abadan01       <NA> 1972-08-25

Aha, this table gives us players’ birth years.

To a first approximation (that is, ignoring birth month and day), we can represent age using the formula yearID - birthYear. Only problem is, these two variables come from different data tables.

This sounds like a job for a join!

  1. Use a suitable join operation, together with whatever other verbs are needed, to get Age added to Manny’s year-by-year data table, and plot the total number of homeruns/162 by his age instead of by season.

Putting together some pieces

  1. Produce analogous plots using On-Base Percentage (OBP) and Slugging Average (SLG), and OPS (On-Base plus Slugging) instead of HR/162. If you don’t know what these statistics are, that’s ok; you can calculate them as follows. The formula for OBP is: \(OBP = (H + BB + HBP) / (AB + BB + HBP + SF)\), where \(H\) stands for “hits” (an at-bat in which the player did not reach base safely after hitting the ball in play), \(BB\) stands for “base-on-balls” (a walk), \(HBP\) is a “hit by pitch”. Slugging average is the average number of bases per at bat, and is calculated as a weighted sum of the number of singles, doubles, triples and home runs, divided by the number of at-bats: \(SLG = (1 * X1B + 2 * X2B + 3 * X3B + 4 * HR) / AB\). Note that the data table does not include a separate column for singles, but every hit is either a single, a double, a triple or a home run, so it can be worked out from those counts and the total number of hits. Finally, \(OPS = OBP + SLG\). Before you calculate these statistics, you’ll want to combine the count statistics across teams within a season. Since we’re applying the same operation (summing) across several variables, this is a good time to use summarize_at() to specify a set of variables and an operation, rather than writing each one out individually with a regular summarize().