Working With Text Data

Goal

The goal of this lab is to do the wrangling necessary to create the following two figures, starting with just the raw script of the Shakespeare play Macbeth.

The first plot shows what share of lines each character in the play has, collapsing infrequent speakers into a single “OTHERS” bin.

The second shows, for each character with at least 1% of total lines, when they first speak, when they have uttered half their lines, and when they last speak.

Reading in the text, and splitting it into lines

First we need to pull the text into R. We’ll make use of the RCurl package here.

Code:

library(RCurl)
library(tidyverse)
library(magrittr)

macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- getURL(macbeth_url)

Then we’ll take the text and break it up into lines, splitting on linebreaks in the original document.

macbeth_lines <- macbeth_raw %>% 
  strsplit("\r\n") %>%
  extract2(1)

Using RegEx to extract key patterns from the text

Use Regular Expressions to find all occurrences of either MACBETH or LADY MACBETH preceded by two spaces and followed by a period, at the start of a line.

Note: Below many of the exercises I’m showing you a sample of what your output might look like; but to get a representative cross-section without printing out hundreds of lines, I’ve just taken a random sample of 10 lines to display)

Sample solution:

set.seed(42)
## Lots of ways you could do this; here's one
## Would also match, e.g., GRANNY MACBETH, if
## such a character existed
grep("^  ([A-Z][A-Z]* )?MACBETH\\.", macbeth_lines, value = TRUE) %>% 
  sample(10) # random sample, since head() only shows MACBETH lines

##  [1] "  LADY MACBETH. Alack, I am afraid they have awaked"           
##  [2] "  LADY MACBETH. These deeds must not be thought"               
##  [3] "  MACBETH. How say'st thou, that Macduff denies his person"    
##  [4] "  MACBETH. To know my deed, 'twere best not know myself."      
##  [5] "  MACBETH. What man dare, I dare."                             
##  [6] "  MACBETH. See, they encounter thee with their hearts' thanks."
##  [7] "  MACBETH. I will not yield,"                                  
##  [8] "  MACBETH. Sweet remembrancer!"                                
##  [9] "  LADY MACBETH. That which hath made them drunk hath made me"  
## [10] "  MACBETH. We will speak further."

## Here's one that is more specific:
grep("^  (LADY )*MACBETH", macbeth_lines, value = TRUE) %>% 
  sample(10)

##  [1] "  LADY MACBETH. Who was it that thus cried? Why, worthy Thane,"
##  [2] "  MACBETH. I wish your horses swift and sure of foot,"         
##  [3] "  MACBETH. O, yet I do repent me of my fury,"                  
##  [4] "  MACBETH. Had I three ears, I'd hear thee."                   
##  [5] "  LADY MACBETH. Say to the King I would attend his leisure"    
##  [6] "  LADY MACBETH. Thou'rt mad to say it!"                        
##  [7] "  LADY MACBETH. Almost at odds with morning, which is which."  
##  [8] "  MACBETH. So shall I, love, and so, I pray, be you."          
##  [9] "  LADY MACBETH. Nought's had, all's spent,"                    
## [10] "  MACBETH. Here had we now our country's honor roof'd,"

Get the line numbers and lines for all spoken lines (i.e., starting with two spaces, any character name in all caps, followed by a period)

Sample solution:

line_numbers <-
  grep("^  [A-Z][A-Z ]+\\.", macbeth_lines)
line_text <- macbeth_lines %>% extract(line_numbers)
data.frame(line_numbers, line_text) %>% head()

##   line_numbers                                       line_text
## 1          148                                          ASCII.
## 2          290    FIRST WITCH. When shall we three meet again?
## 3          292       SECOND WITCH. When the hurlyburly's done,
## 4          294   THIRD WITCH. That will be ere the set of sun.
## 5          295                   FIRST WITCH. Where the place?
## 6          296                   SECOND WITCH. Upon the heath.

Using the line text you extracted above, use str_extract() from the stringr library to pull out just the character name from each line. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN) (or you can use the pipe syntax).

Sample solution:

## Note again the space in the brackets.  This is needed to match character
## names that contain spaces (such as LADY MACBETH)
characters <- line_text %>% str_extract("^  [A-Z][A-Z ]+\\.")
characters %>% sample(10)

##  [1] "  BANQUO."           "  MESSENGER."        "  FIRST MURTHERER." 
##  [4] "  SIWARD."           "  MALCOLM."          "  ALL."             
##  [7] "  SECOND MURTHERER." "  MURTHERER."        "  MALCOLM."         
## [10] "  THIRD MURTHERER."

Aside: Using `gsub()` to remove the character name from each line of text

Having extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.

The mapply() function is helpful here. Like lapply() it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:

Code:

spoken_text <- 
  mapply(
    FUN      = gsub, 
    pattern  = characters,  ## vary the pattern= argument over elements of `characters`
    x        = line_text,   ## vary the x= argument over elements of `line_text`
    MoreArgs = list(replacement = "") ## hold the replacement= argument fixed
    )
spoken_text %>% sample(10)

##                                           DUNCAN. 
##     " No more that Thane of Cawdor shall deceive" 
##                                     LADY MACBETH. 
## " Who was it that thus cried? Why, worthy Thane," 
##                                  FIRST MURTHERER. 
##                              " Wast not the way?" 
##                                      FIRST WITCH. 
##               " Ay, sir, all this is so. But why" 
##                                        MURTHERER. 
##    " Ay, my good lord. Safe in a ditch he bides," 
##                                 SECOND MURTHERER. 
##                              " A light, a light!" 
##                                     LADY MACBETH. 
##                                     " Donalbain." 
##                                           SIWARD. 
##                        " Enter, sir, the castle." 
##                                     LADY MACBETH. 
##                      " A kind good night to all!" 
##                                          MACBETH. 
##                    " You are, and do not know't."

The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.

Now we’ll glue our various variables into a data frame.

Code:

frame <- 
  data.frame(
    line = line_numbers, 
    character = characters,
    line_text = spoken_text)
head(frame)

##   line       character                         line_text
## 1  148          ASCII.                                  
## 2  290    FIRST WITCH.   When shall we three meet again?
## 3  292   SECOND WITCH.       When the hurlyburly's done,
## 4  294    THIRD WITCH.  That will be ere the set of sun.
## 5  295    FIRST WITCH.                  Where the place?
## 6  296   SECOND WITCH.                   Upon the heath.

My favorite character from Macbeth was definitely ASCII.

Cleaning the data: Pulling out just character names

We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub() by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works like this:

Code:

## Put parens around the part of the pattern you care about
## Refer to "whatever is matched" with \\1 in the replacement string
## Each parenthetical expression gets an index, \\1, \\2, etc.
frame <- frame %>%
  mutate(
    character = gsub("^  ([A-Z ]+)\\.", "\\1", character))
head(frame)

##   line    character                         line_text
## 1  148        ASCII                                  
## 2  290  FIRST WITCH   When shall we three meet again?
## 3  292 SECOND WITCH       When the hurlyburly's done,
## 4  294  THIRD WITCH  That will be ere the set of sun.
## 5  295  FIRST WITCH                  Where the place?
## 6  296 SECOND WITCH                   Upon the heath.

Computing some summary statistics

Use your data wrangling skillz to compute the key summary statistics for each character: first line, median line, last line, total number of lines, and percentage of total lines. Sort the results in decreasing order of number of lines. There’s no text manipulation needed here, but it’s a necessary step for our graphs.

Sample solution:

library(mosaic) # alters the functionality of sample()
character_stats <- frame %>%
  group_by(character) %>%
  summarize(
    num_lines = n(),
    pct_lines = 100 * n() / nrow(frame),
    first_appearance = min(line),
    halfway_point = round(median(line)),
    last_appearance = max(line)) %>%
  arrange(desc(num_lines))
character_stats %>% sample(10)

## # A tibble: 10 x 7
##    character num_lines pct_lines first_appearance halfway_point
##    <chr>         <int>     <dbl>            <int>         <dbl>
##  1 ROSS             39     6.06               369          2240
##  2 YOUNG SI…         3     0.466             3030          3032
##  3 BOTH MUR…         2     0.311             1557          1571
##  4 SEYTON            5     0.776             2842          2847
##  5 SERVANT           5     0.776             1600          2819
##  6 MURTHERER         4     0.621             1736          1741
##  7 BANQUO           33     5.12               444           942
##  8 SECOND A…         2     0.311             2127          2128
##  9 LADY MAC…        59     9.16               668          1253
## 10 MACDUFF          58     9.01              1159          2436
## # … with 2 more variables: last_appearance <int>, orig.id <chr>

So that the bar graph we produce actually respects the ordering of the characters, we’ll use factor() to encode the ordering in the character variable.

Code:

character_stats <- character_stats %>%
  mutate(
    character = factor(character, levels = character))

We haven’t consolidated minor characters into OTHER yet, but let’s see what our bar graph would look like so far.

Code:

character_stats %>% 
  ggplot(aes(x = character, y = pct_lines)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(name = "Character") +
  scale_y_continuous(name = "% of Total Lines") +
  theme(
    axis.text.x = element_text(angle = 60, hjust = 1))

Consolidating infrequent speakers

Use filter() and summarize() to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. (For the median just input NA) Bind this new summary data to the original data, having excluded the individual low-activity characters

Sample solution:

### Combine infrequent speakers into an "OTHERS" bin
low_freq_chars <- character_stats %>%
  filter(pct_lines < 1) %>%
  summarize(
    character = "OTHERS",
    num_lines = sum(num_lines),
    pct_lines = sum(pct_lines),
    first_appearance = min(first_appearance),
    halfway_point = NA,
    last_appearance = max(last_appearance))

### Replacing the lower activity characters with "OTHERS"
character_stats <- 
  character_stats %>% 
  filter(pct_lines >= 1) %>%
  rbind(low_freq_chars)

Now we’re ready to plot!

Sample solution (bar graph):

### Bar plot v. 2
character_stats %>% 
  ggplot(aes(x = character, y = pct_lines)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(name = "Character") +
  scale_y_continuous(name = "% of Total Lines") +
  theme(
    axis.text.x = element_text(angle = 60, hjust = 1))

See if you can produce the following graph, showing, for each character that has at least 1% of total lines, when they first speak, when they have spoken half their lines, and when they last speak.

Sample solution (point graph):

### Graphing each character's first, middle and last lines
character_stats %>%
  filter(character != "OTHERS") %>%  
  rename(
    first  = first_appearance, 
    middle = halfway_point, 
    last   = last_appearance) %>%
  mutate(
    character = reorder(character, first, min)) %>%
  gather(
    key   = label, 
    value = line, 
    first, middle, last) %>%
  ggplot(
    aes(
      x     = character, 
      y     = line)) +
  geom_point(aes(size  = pct_lines)) + 
  geom_line() +
  coord_flip() +
  scale_y_continuous(name = "Line #") +
  scale_x_discrete(name = "Character") +
  guides(size = guide_legend(title = "% of Total Lines"))

Getting Credit

Investigate some other aspect of this or another text, producing a visualization to illustrate your findings. Even if you use Macbeth again, your investigation should not just be a data-wrangling and visualization exercise starting with the cleaned data we already have: find something to explore that requires some text manipulation involving regular expressions to extract things or clean the data. If you use another text, make sure it has enough structure in it that you can get something data-frame like using the tools you know, but not so much that doing this is trivial. Post a link to the text, a snippet containing your code (use the “attach” button on Slack and select “Create New Code Snippet”), your graph, and the honor pledge to #lab14.

STAT 209: Lab 14

Working With Text Data

Goal

Reading in the text, and splitting it into lines

Using RegEx to extract key patterns from the text

Aside: Using gsub() to remove the character name from each line of text

Cleaning the data: Pulling out just character names

Computing some summary statistics

Consolidating infrequent speakers

Getting Credit

Aside: Using `gsub()` to remove the character name from each line of text