The goal of this lab is the following two figures. The first shows what share of lines each character in Macbeth has, collapsing infrequent speakers into a single “OTHERS” bin.
The second shows, for each character with at least 1% of total lines, when they first speak, when they have uttered half their lines, and when they last speak.
Code:
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- RCurl::getURL(macbeth_url)
macbeth_lines <- strsplit(macbeth_raw, "\r\n")[[1]]
Sample solution:
set.seed(42)
## Lots of ways you could do this; here's one
## Would also match, e.g., GRANNY MACBETH, if
## such a character existed
grep("^ [A-Z]* ?MACBETH.", macbeth_lines, value = TRUE) %>%
sample(10) # random sample, since head() only shows MACBETH lines
## [1] " MACBETH. Throw physic to the dogs, I'll none of it."
## [2] " MACBETH. She should have died hereafter;"
## [3] " LADY MACBETH. Donalbain."
## [4] " MACBETH. are they? Gone? Let this pernicious hour"
## [5] " MACBETH. Here had we now our country's honor roof'd,"
## [6] " MACBETH. Both of you"
## [7] " LADY MACBETH. You have displaced the mirth, broke the good"
## [8] " LADY MACBETH. Only look up clear;"
## [9] " MACBETH. Why should I play the Roman fool and die"
## [10] " LADY MACBETH. What, quite unmann'd in folly?"
Sample solution:
## Note: previous version of the lab left out the space in the brackets
## causing lines spoken by multi-word characters to be omitted
line_numbers <-
grep("^ [A-Z ]+\\.", macbeth_lines)
line_text <- macbeth_lines[line_numbers]
data.frame(line_numbers, line_text) %>% head()
## line_numbers line_text
## 1 138 ASCII.
## 2 193 P.O. Box 2782
## 3 280 FIRST WITCH. When shall we three meet again?
## 4 282 SECOND WITCH. When the hurlyburly's done,
## 5 284 THIRD WITCH. That will be ere the set of sun.
## 6 285 FIRST WITCH. Where the place?
str_extract()
from the stringr
library to pull out just the character name from each line. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN)
.Sample solution:
## Note again the space in the brackets. This is needed to match character
## names that contain spaces (such as LADY MACBETH)
library(stringr)
characters <- str_extract(line_text, "^ [A-Z ]+\\.")
gsub()
to remove the character name from each line of textHaving extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.
The mapply()
function is helpful here. Like lapply()
it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:
Code:
spoken_text <-
mapply(
FUN = gsub,
pattern = characters, ## vary the pattern= argument over elements of `characters`
x = line_text, ## vary the x= argument over elements of `line_text`
MoreArgs = list(replacement = "") ## hold the replacement= argument fixed
)
head(spoken_text)
## ASCII. P.
## "" "O. Box 2782"
## FIRST WITCH. SECOND WITCH.
## " When shall we three meet again?" " When the hurlyburly's done,"
## THIRD WITCH. FIRST WITCH.
## " That will be ere the set of sun." " Where the place?"
The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.
Now we’ll glue our various variables into a data frame.
Code:
frame <-
data.frame(
line = line_numbers,
character = characters,
line_text = spoken_text)
head(frame)
## line character line_text
## 1 138 ASCII.
## 2 193 P. O. Box 2782
## 3 280 FIRST WITCH. When shall we three meet again?
## 4 282 SECOND WITCH. When the hurlyburly's done,
## 5 284 THIRD WITCH. That will be ere the set of sun.
## 6 285 FIRST WITCH. Where the place?
My favorite character from Macbeth was definitely ASCII.
We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub()
by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works like this:
Code:
## Put parens around the part of the pattern you care about
## Refer to "whatever is matched" with \\1 in the replacement string
## Each parenthetical expression gets an index, \\1, \\2, etc.
frame <- frame %>%
mutate(
character = gsub("^ ([A-Z ]+)\\.", "\\1", character))
head(frame)
## line character line_text
## 1 138 ASCII
## 2 193 P O. Box 2782
## 3 280 FIRST WITCH When shall we three meet again?
## 4 282 SECOND WITCH When the hurlyburly's done,
## 5 284 THIRD WITCH That will be ere the set of sun.
## 6 285 FIRST WITCH Where the place?
Sample solution:
library(readr) # for parse_integer()
character_stats <- frame %>%
group_by(character) %>%
summarize(
num_lines = n(),
prop_lines = n() / nrow(frame),
first_appearance = parse_integer(min(line)),
halfway_point = median(line),
last_appearance = parse_integer(max(line))) %>%
arrange(desc(num_lines))
character_stats
## # A tibble: 46 x 6
## character num_lines prop_lines first_appearance halfway_point
## <chr> <int> <dbl> <int> <dbl>
## 1 MACBETH 146 0.226 433 1569.
## 2 LADY MACBETH 59 0.0913 658 1243.
## 3 MACDUFF 58 0.0898 1149 2426.
## 4 ROSS 39 0.0604 359 2230.
## 5 MALCOLM 38 0.0588 305 2505.
## 6 BANQUO 33 0.0511 434 932.
## 7 FIRST WITCH 23 0.0356 280 465.
## 8 LENNOX 21 0.0325 357 1861.
## 9 DOCTOR 20 0.0310 2499 2700.
## 10 LADY MACDUFF 19 0.0294 2229 2275.
## # ... with 36 more rows, and 1 more variable: last_appearance <int>
So that the bar graph we produce actually respects the ordering of the characters, we’ll use factor()
to encode the ordering in the character
variable.
Code:
character_stats <- character_stats %>%
mutate(
character = factor(character, levels = character))
filter()
and summarize()
to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. (For the median just input NA
) Bind this new summary data to the original data, having excluded the individual low-activity charactersSample solution:
low_freq_chars <- character_stats %>%
filter(prop_lines < 0.01) %>%
summarize(
character = "OTHERS",
num_lines = sum(num_lines),
prop_lines = sum(prop_lines),
first_appearance = parse_integer(min(first_appearance)),
halfway_point = NA,
last_appearance = parse_integer(max(last_appearance)))
character_stats <-
character_stats %>%
filter(prop_lines >= 0.01) %>%
rbind(low_freq_chars)
Sample solution (bar graph):
### Bar plot v. 2
character_stats %>%
ggplot(aes(x = character, y = prop_lines)) +
geom_bar(stat = "identity")
Sample solution (point graph):
### Graphing each character's first, middle and last lines
library(tidyr)
character_stats %>%
rename(first = first_appearance, middle = halfway_point, last = last_appearance) %>%
gather(key = label, value = line, first, middle, last) %>%
filter(character != "OTHERS") %>%
ggplot(aes(x = character, y = line, color = label, size = prop_lines)) +
geom_point() +
coord_flip()