Data-Wrangling: Putting it All Together

Goal

Refresh your memory about the main data-wrangling building blocks, and work through an extended example that combines them in various ways to achieve a goal.

The Data

This lab is not about baby names! Woohoo!

Instead, we’ll look at some baseball data from the Lahman package.

This lab is modified from section 4.4 in your textbook.

Our overarching goal is the following: for a few star major league hitters, create a plot that shows the hitting statistic OPS+ (On-Base-Plus-Slugging average for the player relative to league average in that year) for each player in each season that they played in the majors, as a function of their age. This will enable us to see at a glance at what age various star players “peaked” (at least, according to this one measure).

**Load the packages and data:**

library(tidyverse)
library(Lahman)
data(Batting)  # Batting statistics by player-season-team combination
data(Master)   # Biographical data by player

Peek at the data

head(Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA
head(Master)
##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
## 6  abadfe01      1985         12       17         D.R.  La Romana
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA         <NA>       <NA>
## 2     Mobile        NA         NA       NA         <NA>       <NA>
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA         <NA>       <NA>
## 5 Palm Beach        NA         NA       NA         <NA>       <NA>
## 6  La Romana        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast        nameGiven weight height bats throws
## 1      <NA>     David  Aardsma      David Allan    215     75    R      R
## 2      <NA>      Hank    Aaron      Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron       Tommie Lee    190     75    R      R
## 4      <NA>       Don     Aase   Donald William    190     75    R      R
## 5      <NA>      Andy     Abad    Fausto Andres    184     73    L      L
## 6      <NA>  Fernando     Abad Fernando Antonio    220     73    L      L
##        debut  finalGame  retroID   bbrefID  deathDate  birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01       <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01       <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001  aasedo01       <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001  abadan01       <NA> 1972-08-25
## 6 2010-07-28 2016-09-25 abadf001  abadfe01       <NA> 1985-12-17

Focusing on a single player: Manny Ramirez

Notice that in both of these data tables, players are indexed not by their names, but by unique player IDs. Manny Ramirez, for example, is "ramirma02".

How would we know this? We can get it from the Master data table which records players’ first and last names along with their unique IDs.

Code:

filter(Master, nameFirst == "Manny", nameLast == "Ramirez")
##    playerID birthYear birthMonth birthDay birthCountry        birthState
## 1 ramirma02      1972          5       30         D.R. Distrito Nacional
##       birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Santo Domingo        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast        nameGiven weight height bats throws
## 1      <NA>     Manny  Ramirez Manuel Aristides    225     72    R      R
##        debut  finalGame  retroID   bbrefID deathDate  birthDate
## 1 1993-09-02 2011-04-06 ramim002 ramirma02      <NA> 1972-05-30

(In this case, there were no other major leaguers with that name, so there’s no ambiguity; but for other names there might be more than one player with that name, hence the need for a unique ID.)

Having done that, we can filter the batting data to look at Ramirez’s season-by-season stats.

Code:

manny <- filter(Batting, playerID == "ramirma02")
head(manny)
##    playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1 ramirma02   1993     1    CLE   AL  22  53   5   9   1   0  2   5  0  0
## 2 ramirma02   1994     1    CLE   AL  91 290  51  78  22   0 17  60  4  2
## 3 ramirma02   1995     1    CLE   AL 137 484  85 149  26   1 31 107  6  6
## 4 ramirma02   1996     1    CLE   AL 152 550  94 170  45   3 33 112  8  5
## 5 ramirma02   1997     1    CLE   AL 150 561  99 184  40   0 26  88  2  3
## 6 ramirma02   1998     1    CLE   AL 150 571 108 168  35   2 45 145  5  3
##   BB  SO IBB HBP SH SF GIDP
## 1  2   8   0   0  0  0    3
## 2 42  72   4   0  0  4    6
## 3 75 112   6   5  2  5   13
## 4 85 104   8   3  0  9   18
## 5 79 115   5   7  0  4   19
## 6 76 121   6   6  0 10   18
nrow(manny)
## [1] 21

We can see from this data that each “case” in the Batting dataset appears to consist of a season’s worth of hitting data for a single player. Manny Ramirez has 21 entries. Was he active for 21 seasons? Let’s check.

Code

manny %>% 
  summarize(
    rookie_year = min(yearID), 
    final_year = max(yearID),
    num_seasons = n_distinct(yearID),
    num_teams = n_distinct(teamID)
    )
##   rookie_year final_year num_seasons num_teams
## 1        1993       2011          19         5

Note that I used a new function, n_distinct(), to return the number of distinct values of a variable in a column.

Hmm… looks like he only played in 19 seasons (for a total of 5 different teams); not 21. What’s happening here?

It turns out that the rows of the Batting table are not necessarily a full season’s worth of data. If a player was traded during a season, then they played for two different teams that year, and so there are two different entries in the data. There must be two years when that happened for Ramirez. Let’s find out which they are.

Code:

manny %>%
  group_by(yearID) %>%
  summarize(num_teams = n_distinct(teamID)) %>%
  filter(num_teams > 1)
## # A tibble: 2 x 2
##   yearID num_teams
##    <int>     <int>
## 1   2008         2
## 2   2010         2

If we want to compute Manny’s batting statistics by full season, we’re going to have to do some aggregation to combine the two rows in these years.

  1. Compute and plot the number of home runs hit by Ramirez in each season.

Sample Solution:

library(ggthemes)
library(scales)
hr.totals <- manny %>% 
  group_by(yearID) %>%
  summarize(tHR = sum(HR))
hr.totals %>%
  ggplot(aes(x = yearID, y = tHR)) +
  geom_line() +
  xlab("Season") +
  ylab("Total Home Runs") +
  theme_fivethirtyeight()

It might be nice to have a few more years labeled on the x axis. Let’s modify our plot using scale_x_continuous():

Sample solution (modified)

hr.totals %>%
  ggplot(aes(x = yearID, y = tHR)) +
  geom_line() +
  xlab("Season") +
  ylab("Total Home Runs") +
  scale_x_continuous(breaks = seq(1993, 2011, by = 2))

Ramirez hit the most home runs in 1998 and 2005. Of course, new players may not play as often as players that have been around for a few years, so let’s instead plot Ramirez’s “projected” home runs extrapolating to 162 games.

Sample solution

hr.totals <- manny %>%
  group_by(yearID) %>%
  summarize(
    tHR = sum(HR),
    tGames = sum(G)
  ) %>%
  mutate(tHRprojected = tHR / tGames * 162)
hr.totals %>%
  ggplot(aes(x = yearID, y = tHRprojected)) +
  geom_line() +
  xlab("Season") +
  ylab("Home Runs per 162 games") +
  scale_x_continuous(breaks = seq(1993, 2011, by = 2))