Refresh your memory about the main data-wrangling building blocks, and work through an extended example that combines them in various ways to achieve a goal.
This lab is not about baby names! Woohoo!
Instead, we’ll look at some baseball data from the Lahman
package.
This lab is modified from section 4.4 in your textbook.
Our overarching goal is the following: for a few star major league hitters, create a plot that shows the hitting statistic OPS+ (On-Base-Plus-Slugging average for the player relative to league average in that year) for each player in each season that they played in the majors, as a function of their age. This will enable us to see at a glance at what age various star players “peaked” (at least, according to this one measure).
First, we’ll need to load the Lahman
package and a couple of datasets.
library(tidyverse)
library(Lahman)
data(Batting) # Batting statistics by player-season-team combination
data(Master) # Biographical data by player
Let’s see what the data in each of these two tables looks like.
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA 0
## 2 0 NA NA NA NA 0
## 3 5 NA NA NA NA 1
## 4 2 NA NA NA NA 0
## 5 1 NA NA NA NA 0
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01 1981 12 27 USA CO
## 2 aaronha01 1934 2 5 USA AL
## 3 aaronto01 1939 8 5 USA AL
## 4 aasedo01 1954 9 8 USA CA
## 5 abadan01 1972 8 25 USA FL
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Denver NA NA NA <NA> <NA>
## 2 Mobile NA NA NA <NA> <NA>
## 3 Mobile 1984 8 16 USA GA
## 4 Orange NA NA NA <NA> <NA>
## 5 Palm Beach NA NA NA <NA> <NA>
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1 <NA> David Aardsma David Allan 215 75 R R
## 2 <NA> Hank Aaron Henry Louis 180 72 R R
## 3 Atlanta Tommie Aaron Tommie Lee 190 75 R R
## 4 <NA> Don Aase Donald William 190 75 R R
## 5 <NA> Andy Abad Fausto Andres 184 73 L L
## debut finalGame retroID bbrefID deathDate birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01 <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01 <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001 aasedo01 <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001 abadan01 <NA> 1972-08-25
Notice that in both of these data tables, players are indexed not by their names, but by unique player IDs. Manny Ramirez, for example, is "ramirma02"
.
How would we know this? We can get it from the Master
data table which records players’ first and last names along with their unique IDs.
Code:
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1 ramirma02 1972 5 30 D.R. Distrito Nacional
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Santo Domingo NA NA NA <NA> <NA>
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1 <NA> Manny Ramirez Manuel Aristides 225 72 R R
## debut finalGame retroID bbrefID deathDate birthDate
## 1 1993-09-02 2011-04-06 ramim002 ramirma02 <NA> 1972-05-30
(In this case, there were no other major leaguers with that name, so there’s no ambiguity; but for other names there might be more than one player with that name, hence the need for a unique ID.)
Having done that, we can filter the batting data to look at Ramirez’s season-by-season stats.
Code:
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 ramirma02 1993 1 CLE AL 22 53 5 9 1 0 2 5 0 0
## 2 ramirma02 1994 1 CLE AL 91 290 51 78 22 0 17 60 4 2
## 3 ramirma02 1995 1 CLE AL 137 484 85 149 26 1 31 107 6 6
## 4 ramirma02 1996 1 CLE AL 152 550 94 170 45 3 33 112 8 5
## 5 ramirma02 1997 1 CLE AL 150 561 99 184 40 0 26 88 2 3
## 6 ramirma02 1998 1 CLE AL 150 571 108 168 35 2 45 145 5 3
## 7 ramirma02 1999 1 CLE AL 147 522 131 174 34 3 44 165 2 4
## 8 ramirma02 2000 1 CLE AL 118 439 92 154 34 2 38 122 1 1
## 9 ramirma02 2001 1 BOS AL 142 529 93 162 33 2 41 125 0 1
## 10 ramirma02 2002 1 BOS AL 120 436 84 152 31 0 33 107 0 0
## 11 ramirma02 2003 1 BOS AL 154 569 117 185 36 1 37 104 3 1
## 12 ramirma02 2004 1 BOS AL 152 568 108 175 44 0 43 130 2 4
## 13 ramirma02 2005 1 BOS AL 152 554 112 162 30 1 45 144 1 0
## 14 ramirma02 2006 1 BOS AL 130 449 79 144 27 1 35 102 0 1
## 15 ramirma02 2007 1 BOS AL 133 483 84 143 33 1 20 88 0 0
## 16 ramirma02 2008 1 BOS AL 100 365 66 109 22 1 20 68 1 0
## 17 ramirma02 2008 2 LAN NL 53 187 36 74 14 0 17 53 2 0
## 18 ramirma02 2009 1 LAN NL 104 352 62 102 24 2 19 63 0 1
## 19 ramirma02 2010 1 LAN NL 66 196 32 61 15 0 8 40 1 1
## 20 ramirma02 2010 2 CHA AL 24 69 6 18 1 0 1 2 0 0
## 21 ramirma02 2011 1 TBA AL 5 17 0 1 0 0 0 1 0 0
## BB SO IBB HBP SH SF GIDP
## 1 2 8 0 0 0 0 3
## 2 42 72 4 0 0 4 6
## 3 75 112 6 5 2 5 13
## 4 85 104 8 3 0 9 18
## 5 79 115 5 7 0 4 19
## 6 76 121 6 6 0 10 18
## 7 96 131 9 13 0 9 12
## 8 86 117 9 3 0 4 9
## 9 81 147 25 8 0 2 9
## 10 73 85 14 8 0 1 13
## 11 97 94 28 8 0 5 22
## 12 82 124 15 6 0 7 17
## 13 80 119 9 10 0 6 20
## 14 100 102 16 1 0 8 13
## 15 71 92 13 7 0 8 21
## 16 52 86 8 8 0 0 12
## 17 35 38 16 3 0 4 5
## 18 71 81 21 7 0 1 7
## 19 32 38 4 1 0 3 4
## 20 14 23 0 5 0 0 2
## 21 0 4 0 0 0 0 0
summarize()
)We can see from this data that each “case” in the Batting
dataset appears to consist of a season’s worth of hitting data for a single player. Manny Ramirez has 21 entries. Was he active for 21 seasons? Let’s check.
Code
MannyData %>%
summarize(
rookie_year = min(yearID),
final_year = max(yearID),
num_seasons = n_distinct(yearID),
num_teams = n_distinct(teamID))
## rookie_year final_year num_seasons num_teams
## 1 1993 2011 19 5
Note that I used a new function, n_distinct()
, to return the number of distinct values of a variable in a column.
Hmm… looks like he only played in 19 seasons (for a total of 5 different teams); not 21. What’s happening here?
It turns out that the rows of the Batting
table are not necessarily a full season’s worth of data. If a player was traded during a season, then they played for two different teams that year, and so there are two different entries in the data. There must be two years when that happened for Ramirez. Let’s find out which they are.
group_by()
and summarize()
)MannyData
by season, and use summarize()
to return the number of different teams recorded in each season. You may want to filter()
the data to show only those years when he played for more than one team.MannyData %>%
group_by(yearID) %>%
summarize(num_teams = n_distinct(teamID)) %>%
filter(num_teams > 1)
## # A tibble: 2 x 2
## yearID num_teams
## <int> <int>
## 1 2008 2
## 2 2010 2
If we want to compute Manny’s batting statistics by full season, we’re going to have to do some aggregation to combine the two rows in these years.
MannyHRs <- MannyData %>%
group_by(yearID) %>%
summarize(tHR = sum(HR))
MannyHRs %>%
ggplot(aes(x = yearID, y = tHR)) +
geom_line() +
scale_x_continuous(
name = "Season",
breaks = seq(1990, 2012, by = 2),
limits = c(1990, 2012)) +
scale_y_continuous(
name = "Total Home Runs",
limits = c(0, 50),
breaks = seq(0,50,by = 2))
mutate()
to create a variable that consists of home runs per 162 games. You may want to revise your group_by()
and summarize()
code from the previous exercise to include the number of games played to make this easier (you’ll want to sum actual home runs and games over records first before mutating so that the home run rate is computed correctly).MannyHRs <- MannyData %>%
group_by(yearID) %>%
summarize(
tHR = sum(HR),
tGames = sum(G)) %>%
mutate(tHRper162 = tHR / tGames * 162)
MannyHRs %>%
ggplot(aes(x = yearID, y = tHRper162)) +
geom_line() +
scale_x_continuous(
name = "Season",
limits = c(1990, 2012),
breaks = seq(1990, 2012, by = 2)) +
scale_y_continuous(
name = "Home Runs per 162 games",
limits = c(0,54),
breaks = seq(0,54, by = 2))
The shape is basically the same, but now we can see that although Ramirez hit the same number of home runs in 2000 and 2005, he did it in fewer games in 2000.
join()
)Our original goal was to construct a plot similar to this by the player’s age, instead of by season. Can we find age in the Batting
table?
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA 0
## 2 0 NA NA NA NA 0
## 3 5 NA NA NA NA 1
## 4 2 NA NA NA NA 0
## 5 1 NA NA NA NA 0
Hmm… nope. Nor is there anything in this table that we can use to calculate age. What about in the biographical table, Master
?
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01 1981 12 27 USA CO
## 2 aaronha01 1934 2 5 USA AL
## 3 aaronto01 1939 8 5 USA AL
## 4 aasedo01 1954 9 8 USA CA
## 5 abadan01 1972 8 25 USA FL
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Denver NA NA NA <NA> <NA>
## 2 Mobile NA NA NA <NA> <NA>
## 3 Mobile 1984 8 16 USA GA
## 4 Orange NA NA NA <NA> <NA>
## 5 Palm Beach NA NA NA <NA> <NA>
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1 <NA> David Aardsma David Allan 215 75 R R
## 2 <NA> Hank Aaron Henry Louis 180 72 R R
## 3 Atlanta Tommie Aaron Tommie Lee 190 75 R R
## 4 <NA> Don Aase Donald William 190 75 R R
## 5 <NA> Andy Abad Fausto Andres 184 73 L L
## debut finalGame retroID bbrefID deathDate birthDate
## 1 2004-04-06 2015-08-23 aardd001 aardsda01 <NA> 1981-12-27
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01 <NA> 1934-02-05
## 3 1962-04-10 1971-09-26 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 1977-07-26 1990-10-03 aased001 aasedo01 <NA> 1954-09-08
## 5 2001-09-10 2006-04-13 abada001 abadan01 <NA> 1972-08-25
Aha, this table gives us players’ birth years.
To a first approximation (that is, ignoring birth month and day), we can represent age using the formula yearID - birthYear
. Only problem is, these two variables come from different data tables.
This sounds like a job for a join
!
join
operation, together with whatever other verbs are needed, to get Age
added to Manny’s year-by-year data table, and plot the total number of homeruns/162 by his age instead of by season.MannyData <- Batting %>%
filter(playerID == "ramirma02") %>%
left_join(Master, by = "playerID") %>%
mutate(Age = yearID - birthYear) %>%
group_by(Age) %>%
summarize(
tHR = sum(HR),
tGames = sum(G),
tHRper162 = tHR / tGames * 162)
MannyData %>%
ggplot(aes(x = Age, y = tHRper162)) +
geom_line() +
xlab("Age") +
ylab("Home Runs per 162 games") +
scale_x_continuous(breaks = seq(18, 50, by = 2))
summarize_at()
to specify a set of variables and an operation, rather than writing each one out individually with a regular summarize()
.## Note that the additional variables in group_by aren't actually needed
## to get what we want because they either have only one value or are redundant
## with other variables, but including them allows us to retain them in the
## summarized output
MannyData <- Batting %>%
filter(playerID == "ramirma02") %>%
left_join(Master, by = c("playerID" = "playerID")) %>%
mutate(Age = yearID - birthYear) %>%
group_by(Age, playerID, yearID) %>%
summarize_at(
vars(AB, H, BB, HBP, SF, X2B, X3B, HR),
list(sum) # Apply the functions in this list to each variable
) %>%
mutate(
X1B = H - X2B - X3B - HR,
OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
SLG = (1*X1B + 2*X2B + 3*X3B + 4*HR) / AB,
OPS = OBP + SLG)
## Note: Because we only applied one function to each listed
## variable, it didn't append anything to the variable names.
## If we had wanted to apply more than one function, the resulting
## names in the summarized data table would have the label we gave to that
## function in our function list appended to them. For example, if we
## had said list(total = sum, avg = mean), we'd get variables with
## names like H_total, X2B_total, H_avg, X2B_avg, etc.
MannyData %>%
ggplot(aes(x = Age, y = OBP)) +
geom_line() +
xlab("Age") +
ylab("OBP") +
scale_x_continuous(breaks = seq(18, 50, by = 2))
MannyData %>%
ggplot(aes(x = Age, y = SLG)) +
geom_line() +
xlab("Age") +
ylab("SLG") +
scale_x_continuous(breaks = seq(18, 50, by = 2))
MannyData %>%
ggplot(aes(x = Age, y = OPS)) +
geom_line() +
xlab("Age") +
ylab("OPS") +
scale_x_continuous(breaks = seq(18, 50, by = 2))
You don’t necessarily need to know all the details about baseball to create visualizations, but it’s never a bad idea to learn something about the measures you’re plotting. So if you aren’t familiar with baseball, here’s a quick “glossary”.
Each time a hitter appears at home plate to hit, the outcome is either
Among hits * a single means the runner reached first base on the play * a double or a triple mean the batter reached second or third, respectively, and * a home run means the batter made it all the way around the bases (for a total of four bases on the play).
For the purposes of recording statistics, walks, being hit by a pitch, and sacrifice flies are not recorded as “at bats”, and so for the purposes of calculating the traditional “batting average”, they don’t factor in to either the numerator or the denominator: batting average (BA) is simply “hits” (H) divided by “at bats” (AB).
Unlike batting average, “on-base percentage” counts any outcome in which the player ends up on the bases (hits, walks, hit by pitch) in the numerator, and counts every plate appearance (all of these things plus outs including sacrifice flies) in the denominator.
If you’ve ever read the book or seen the movie Moneyball about the 2002 Oakland A’s, one of the key insights that Billy Beane and his analysts had was that the league as a whole had been undervaluing outcomes in which the player reached base not via a hit, and so they tried to sign players with high OBPs relative to their batting average.
join()
ed)Around the turn of the millenium (coinciding with Manny Ramirez’s peak), many hitters were putting up off the charts hitting numbers, which can at least in part be attributed to a high rate of steroid abuse during that time. If we want to know when a player (Ramirez, for example) provided the greatest “added offensive value” to his team, it would be instructive to know how well he was hitting in each season relative to the rest of the league.
The statistic OPS+ is defined as a player’s OPS divided by the league average in that year, times 100 (so that 100 is league average, 150 means the player’s OPS was 50% higher than league average, etc.). Like age, however, OPS+ depends on information from two different data tables; or at least two views of the Batting
data: the numerator comes from data for a specific player, and the denominator comes from data aggregated over players.
We can use a join
to deal with this as well, but we first need to create the two tables (with the right type of aggregation) that we want to join.
We already know how to compute an individual player’s OPS. To compute the average for the whole league, we can do the same thing but without filtering first.
Code:
LeagueData <- Batting %>%
mutate(X1B = H - X2B - X3B - HR) %>%
group_by(yearID) %>%
summarize(
leagueOPS =
sum(H + BB + HBP, na.rm = TRUE) / sum(AB + BB + HBP + SF, na.rm = TRUE) +
sum(X1B + 2*X2B + 3*X3B + 4*HR, na.rm = TRUE) / sum(AB, na.rm = TRUE))
LeagueData %>% slice_tail(n = 5)
## # A tibble: 5 x 2
## yearID leagueOPS
## <int> <dbl>
## 1 2014 0.700
## 2 2015 0.721
## 3 2016 0.739
## 4 2017 0.750
## 5 2018 0.728
You might notice that very long-ago years are missing some of the components needed to compute OPS, and so we get “not a number” for league OPS since in some cases we end up trying to divide zero by zero. We could have filtered out these early years, but they won’t cause a problem for what we want to do, since in join
ing, they’ll be left out anyway.
join
to get Manny Ramirez’s OPS+ for each season, and plot it as a time series. Add a horizontal reference line at OPS+ = 100, since this is the league average.Master
table, to a table called, say, RipkenBioData
, we can use the following syntax to get just the first entry in the playerID
column: RipkenBioData %>% slice_head(n = 1) %>% pull(playerID)
.