The goal of this lab is to do the wrangling necessary to create the following two figures, starting with just the raw script of the Shakespeare play Macbeth.
The first plot shows what share of lines each character in the play has, collapsing infrequent speakers into a single “OTHERS” bin.
The second shows, for each character with at least 1% of total lines, when they first speak, when they have uttered half their lines, and when they last speak.
First we need to pull the text into R. We’ll make use of the RCurl
package here.
Code:
library(RCurl)
library(tidyverse)
library(magrittr)
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- getURL(macbeth_url)
Then we’ll take the text and break it up into lines, splitting on linebreaks in the original document.
Note: Below many of the exercises I’m showing you a sample of what your output might look like; but to get a representative cross-section without printing out hundreds of lines, I’ve just taken a random sample of 10 lines to display)
Sample solution:
set.seed(42)
## Lots of ways you could do this; here's one
## Would also match, e.g., GRANNY MACBETH, if
## such a character existed
grep("^ ([A-Z][A-Z]* )?MACBETH\\.", macbeth_lines, value = TRUE) %>%
sample(10) # random sample, since head() only shows MACBETH lines
## [1] " LADY MACBETH. Alack, I am afraid they have awaked"
## [2] " LADY MACBETH. These deeds must not be thought"
## [3] " MACBETH. How say'st thou, that Macduff denies his person"
## [4] " MACBETH. To know my deed, 'twere best not know myself."
## [5] " MACBETH. What man dare, I dare."
## [6] " MACBETH. See, they encounter thee with their hearts' thanks."
## [7] " MACBETH. I will not yield,"
## [8] " MACBETH. Sweet remembrancer!"
## [9] " LADY MACBETH. That which hath made them drunk hath made me"
## [10] " MACBETH. We will speak further."
## Here's one that is more specific:
grep("^ (LADY )*MACBETH", macbeth_lines, value = TRUE) %>%
sample(10)
## [1] " LADY MACBETH. Who was it that thus cried? Why, worthy Thane,"
## [2] " MACBETH. I wish your horses swift and sure of foot,"
## [3] " MACBETH. O, yet I do repent me of my fury,"
## [4] " MACBETH. Had I three ears, I'd hear thee."
## [5] " LADY MACBETH. Say to the King I would attend his leisure"
## [6] " LADY MACBETH. Thou'rt mad to say it!"
## [7] " LADY MACBETH. Almost at odds with morning, which is which."
## [8] " MACBETH. So shall I, love, and so, I pray, be you."
## [9] " LADY MACBETH. Nought's had, all's spent,"
## [10] " MACBETH. Here had we now our country's honor roof'd,"
Sample solution:
line_numbers <-
grep("^ [A-Z][A-Z ]+\\.", macbeth_lines)
line_text <- macbeth_lines %>% extract(line_numbers)
data.frame(line_numbers, line_text) %>% head()
## line_numbers line_text
## 1 148 ASCII.
## 2 290 FIRST WITCH. When shall we three meet again?
## 3 292 SECOND WITCH. When the hurlyburly's done,
## 4 294 THIRD WITCH. That will be ere the set of sun.
## 5 295 FIRST WITCH. Where the place?
## 6 296 SECOND WITCH. Upon the heath.
str_extract()
from the stringr
library to pull out just the character name from each line. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN)
(or you can use the pipe syntax).Sample solution:
## Note again the space in the brackets. This is needed to match character
## names that contain spaces (such as LADY MACBETH)
characters <- line_text %>% str_extract("^ [A-Z][A-Z ]+\\.")
characters %>% sample(10)
## [1] " BANQUO." " MESSENGER." " FIRST MURTHERER."
## [4] " SIWARD." " MALCOLM." " ALL."
## [7] " SECOND MURTHERER." " MURTHERER." " MALCOLM."
## [10] " THIRD MURTHERER."
gsub()
to remove the character name from each line of textHaving extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.
The mapply()
function is helpful here. Like lapply()
it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:
Code:
spoken_text <-
mapply(
FUN = gsub,
pattern = characters, ## vary the pattern= argument over elements of `characters`
x = line_text, ## vary the x= argument over elements of `line_text`
MoreArgs = list(replacement = "") ## hold the replacement= argument fixed
)
spoken_text %>% sample(10)
## DUNCAN.
## " No more that Thane of Cawdor shall deceive"
## LADY MACBETH.
## " Who was it that thus cried? Why, worthy Thane,"
## FIRST MURTHERER.
## " Wast not the way?"
## FIRST WITCH.
## " Ay, sir, all this is so. But why"
## MURTHERER.
## " Ay, my good lord. Safe in a ditch he bides,"
## SECOND MURTHERER.
## " A light, a light!"
## LADY MACBETH.
## " Donalbain."
## SIWARD.
## " Enter, sir, the castle."
## LADY MACBETH.
## " A kind good night to all!"
## MACBETH.
## " You are, and do not know't."
The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.
Now we’ll glue our various variables into a data frame.
Code:
frame <-
data.frame(
line = line_numbers,
character = characters,
line_text = spoken_text)
head(frame)
## line character line_text
## 1 148 ASCII.
## 2 290 FIRST WITCH. When shall we three meet again?
## 3 292 SECOND WITCH. When the hurlyburly's done,
## 4 294 THIRD WITCH. That will be ere the set of sun.
## 5 295 FIRST WITCH. Where the place?
## 6 296 SECOND WITCH. Upon the heath.
My favorite character from Macbeth was definitely ASCII.
We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub()
by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works like this:
Code:
## Put parens around the part of the pattern you care about
## Refer to "whatever is matched" with \\1 in the replacement string
## Each parenthetical expression gets an index, \\1, \\2, etc.
frame <- frame %>%
mutate(
character = gsub("^ ([A-Z ]+)\\.", "\\1", character))
head(frame)
## line character line_text
## 1 148 ASCII
## 2 290 FIRST WITCH When shall we three meet again?
## 3 292 SECOND WITCH When the hurlyburly's done,
## 4 294 THIRD WITCH That will be ere the set of sun.
## 5 295 FIRST WITCH Where the place?
## 6 296 SECOND WITCH Upon the heath.
Sample solution:
library(mosaic) # alters the functionality of sample()
character_stats <- frame %>%
group_by(character) %>%
summarize(
num_lines = n(),
pct_lines = 100 * n() / nrow(frame),
first_appearance = min(line),
halfway_point = round(median(line)),
last_appearance = max(line)) %>%
arrange(desc(num_lines))
character_stats %>% sample(10)
## # A tibble: 10 x 7
## character num_lines pct_lines first_appearance halfway_point
## <chr> <int> <dbl> <int> <dbl>
## 1 ROSS 39 6.06 369 2240
## 2 YOUNG SI… 3 0.466 3030 3032
## 3 BOTH MUR… 2 0.311 1557 1571
## 4 SEYTON 5 0.776 2842 2847
## 5 SERVANT 5 0.776 1600 2819
## 6 MURTHERER 4 0.621 1736 1741
## 7 BANQUO 33 5.12 444 942
## 8 SECOND A… 2 0.311 2127 2128
## 9 LADY MAC… 59 9.16 668 1253
## 10 MACDUFF 58 9.01 1159 2436
## # … with 2 more variables: last_appearance <int>, orig.id <chr>
So that the bar graph we produce actually respects the ordering of the characters, we’ll use factor()
to encode the ordering in the character
variable.
Code:
We haven’t consolidated minor characters into OTHER yet, but let’s see what our bar graph would look like so far.
Code:
character_stats %>%
ggplot(aes(x = character, y = pct_lines)) +
geom_bar(stat = "identity") +
scale_x_discrete(name = "Character") +
scale_y_continuous(name = "% of Total Lines") +
theme(
axis.text.x = element_text(angle = 60, hjust = 1))
filter()
and summarize()
to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. (For the median just input NA
) Bind this new summary data to the original data, having excluded the individual low-activity charactersSample solution:
### Combine infrequent speakers into an "OTHERS" bin
low_freq_chars <- character_stats %>%
filter(pct_lines < 1) %>%
summarize(
character = "OTHERS",
num_lines = sum(num_lines),
pct_lines = sum(pct_lines),
first_appearance = min(first_appearance),
halfway_point = NA,
last_appearance = max(last_appearance))
### Replacing the lower activity characters with "OTHERS"
character_stats <-
character_stats %>%
filter(pct_lines >= 1) %>%
rbind(low_freq_chars)
Sample solution (bar graph):
### Bar plot v. 2
character_stats %>%
ggplot(aes(x = character, y = pct_lines)) +
geom_bar(stat = "identity") +
scale_x_discrete(name = "Character") +
scale_y_continuous(name = "% of Total Lines") +
theme(
axis.text.x = element_text(angle = 60, hjust = 1))
Sample solution (point graph):
### Graphing each character's first, middle and last lines
character_stats %>%
filter(character != "OTHERS") %>%
rename(
first = first_appearance,
middle = halfway_point,
last = last_appearance) %>%
mutate(
character = reorder(character, first, min)) %>%
gather(
key = label,
value = line,
first, middle, last) %>%
ggplot(
aes(
x = character,
y = line)) +
geom_point(aes(size = pct_lines)) +
geom_line() +
coord_flip() +
scale_y_continuous(name = "Line #") +
scale_x_discrete(name = "Character") +
guides(size = guide_legend(title = "% of Total Lines"))
#lab14
.