Working With Text Data

The goal of this lab is the following two figures. The first shows what share of lines each character in Macbeth has, collapsing infrequent speakers into a single “OTHERS” bin.

The second shows, for each character with at least 1% of total lines, when they first speak, when they have uttered half their lines, and when they last speak.

Reading in the text, and splitting it into lines

Code:

macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- RCurl::getURL(macbeth_url)
macbeth_lines <- strsplit(macbeth_raw, "\r\n")[[1]]

Use Regular Expressions to find all occurrences of either MACBETH or LADY MACBETH preceded by two spaces and followed by a period, at the start of a line.

Sample solution:

set.seed(42)
## Lots of ways you could do this; here's one
## Would also match, e.g., GRANNY MACBETH, if
## such a character existed
grep("^  [A-Z]* ?MACBETH.", macbeth_lines, value = TRUE) %>% 
  sample(10) # random sample, since head() only shows MACBETH lines

##  [1] "  MACBETH. Throw physic to the dogs, I'll none of it."       
##  [2] "  MACBETH. She should have died hereafter;"                  
##  [3] "  LADY MACBETH. Donalbain."                                  
##  [4] "  MACBETH. are they? Gone? Let this pernicious hour"         
##  [5] "  MACBETH. Here had we now our country's honor roof'd,"      
##  [6] "  MACBETH. Both of you"                                      
##  [7] "  LADY MACBETH. You have displaced the mirth, broke the good"
##  [8] "  LADY MACBETH. Only look up clear;"                         
##  [9] "  MACBETH. Why should I play the Roman fool and die"         
## [10] "  LADY MACBETH. What, quite unmann'd in folly?"

Get the line numbers and lines for all spoken lines (i.e., starting with two spaces, any character name in all caps, followed by a period)

Sample solution:

## Note: previous version of the lab left out the space in the brackets
## causing lines spoken by multi-word characters to be omitted
line_numbers <-
  grep("^  [A-Z ]+\\.", macbeth_lines)
line_text <- macbeth_lines[line_numbers]
data.frame(line_numbers, line_text) %>% head()

##   line_numbers                                       line_text
## 1          138                                          ASCII.
## 2          193                                   P.O. Box 2782
## 3          280    FIRST WITCH. When shall we three meet again?
## 4          282       SECOND WITCH. When the hurlyburly's done,
## 5          284   THIRD WITCH. That will be ere the set of sun.
## 6          285                   FIRST WITCH. Where the place?

Using the line text you extracted above, use str_extract() from the stringr library to pull out just the character name from each line. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN).

Sample solution:

## Note again the space in the brackets.  This is needed to match character
## names that contain spaces (such as LADY MACBETH)
library(stringr)
characters <- str_extract(line_text, "^  [A-Z ]+\\.")

Aside: Using `gsub()` to remove the character name from each line of text

Having extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.

The mapply() function is helpful here. Like lapply() it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:

Code:

spoken_text <- 
  mapply(
    FUN      = gsub, 
    pattern  = characters,  ## vary the pattern= argument over elements of `characters`
    x        = line_text,   ## vary the x= argument over elements of `line_text`
    MoreArgs = list(replacement = "") ## hold the replacement= argument fixed
    )
head(spoken_text)

##                              ASCII.                                  P. 
##                                  ""                       "O. Box 2782" 
##                        FIRST WITCH.                       SECOND WITCH. 
##  " When shall we three meet again?"      " When the hurlyburly's done," 
##                        THIRD WITCH.                        FIRST WITCH. 
## " That will be ere the set of sun."                 " Where the place?"

The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.

Now we’ll glue our various variables into a data frame.

Code:

frame <- 
  data.frame(
    line = line_numbers, 
    character = characters,
    line_text = spoken_text)
head(frame)

##   line         character                         line_text
## 1  138            ASCII.                                  
## 2  193                P.                       O. Box 2782
## 3  280      FIRST WITCH.   When shall we three meet again?
## 4  282     SECOND WITCH.       When the hurlyburly's done,
## 5  284      THIRD WITCH.  That will be ere the set of sun.
## 6  285      FIRST WITCH.                  Where the place?

My favorite character from Macbeth was definitely ASCII.

Pulling out just the character name itself

We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub() by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works like this:

Code:

## Put parens around the part of the pattern you care about
## Refer to "whatever is matched" with \\1 in the replacement string
## Each parenthetical expression gets an index, \\1, \\2, etc.
frame <- frame %>%
  mutate(
    character = gsub("^  ([A-Z ]+)\\.", "\\1", character))
head(frame)

##   line      character                         line_text
## 1  138          ASCII                                  
## 2  193              P                       O. Box 2782
## 3  280    FIRST WITCH   When shall we three meet again?
## 4  282   SECOND WITCH       When the hurlyburly's done,
## 5  284    THIRD WITCH  That will be ere the set of sun.
## 6  285    FIRST WITCH                  Where the place?

Use your data wrangling skillz to compute the key summary statistics for each character: first line, median line, last line, total number of lines, and proportion of total lines. Sort the results in decreasing order of number of lines.

Sample solution:

library(readr) # for parse_integer()
character_stats <- frame %>%
  group_by(character) %>%
  summarize(
    num_lines = n(),
    prop_lines = n() / nrow(frame),
    first_appearance = parse_integer(min(line)),
    halfway_point = median(line),
    last_appearance = parse_integer(max(line))) %>%
  arrange(desc(num_lines))
character_stats

## # A tibble: 46 x 6
##    character    num_lines prop_lines first_appearance halfway_point
##    <chr>            <int>      <dbl>            <int>         <dbl>
##  1 MACBETH            146     0.226               433         1569.
##  2 LADY MACBETH        59     0.0913              658         1243.
##  3 MACDUFF             58     0.0898             1149         2426.
##  4 ROSS                39     0.0604              359         2230.
##  5 MALCOLM             38     0.0588              305         2505.
##  6 BANQUO              33     0.0511              434          932.
##  7 FIRST WITCH         23     0.0356              280          465.
##  8 LENNOX              21     0.0325              357         1861.
##  9 DOCTOR              20     0.0310             2499         2700.
## 10 LADY MACDUFF        19     0.0294             2229         2275.
## # ... with 36 more rows, and 1 more variable: last_appearance <int>

So that the bar graph we produce actually respects the ordering of the characters, we’ll use factor() to encode the ordering in the character variable.

Code:

character_stats <- character_stats %>%
  mutate(
    character = factor(character, levels = character))

Use filter() and summarize() to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. (For the median just input NA) Bind this new summary data to the original data, having excluded the individual low-activity characters

Sample solution:

low_freq_chars <- character_stats %>%
  filter(prop_lines < 0.01) %>%
  summarize(
    character = "OTHERS",
    num_lines = sum(num_lines),
    prop_lines = sum(prop_lines),
    first_appearance = parse_integer(min(first_appearance)),
    halfway_point = NA,
    last_appearance = parse_integer(max(last_appearance)))
character_stats <- 
  character_stats %>% 
  filter(prop_lines >= 0.01) %>%
  rbind(low_freq_chars)

Now we’re ready to plot!

Sample solution (bar graph):

### Bar plot v. 2
character_stats %>% 
  ggplot(aes(x = character, y = prop_lines)) +
  geom_bar(stat = "identity")

Sample solution (point graph):

### Graphing each character's first, middle and last lines
library(tidyr)
character_stats %>%
  rename(first = first_appearance, middle = halfway_point, last = last_appearance) %>%
  gather(key = label, value = line, first, middle, last) %>%
  filter(character != "OTHERS") %>%
  ggplot(aes(x = character, y = line, color = label, size = prop_lines)) +
  geom_point() + 
  coord_flip()

STAT 209: Lab 15

Working With Text Data

Reading in the text, and splitting it into lines

Aside: Using gsub() to remove the character name from each line of text

Pulling out just the character name itself

Aside: Using `gsub()` to remove the character name from each line of text