Working With Text Data

Goal

The goal of this lab is to do the wrangling necessary to create the following two figures, starting with just the raw script of the Shakespeare play Macbeth.

The first plot shows what share of lines each character in the play has, collapsing infrequent speakers into a single “OTHERS” bin.

Share of lines spoken by each character

The second plot shows, for each character that has at least 1% of total lines, when they first speak, when they have spoken half their lines, and when they last speak.

First, middle and last line spoken by each character

Reading in the text, and splitting it into lines

First we need to pull the text into R. We’ll make use of the RCurl package here for the getURL() function.

library(RCurl)

macbeth_url <- "https://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- getURL(macbeth_url)

Then we’ll take the text and break it up into lines, splitting on linebreaks in the original document. The special characters designated by \r and \n are two different formats of line breaks, referred to as “carriage return” and “newline”, respectively.

macbeth_lines <- macbeth_raw %>% 
  strsplit("\r\n") %>%
  pluck(1)

This returns a vector of the individual lines from the text. Let’s put that vector into a data table along with a column that has the numbers of the lines (this will just consist of numbers from 1 to however many lines there are).

macbeth_lines_table <- tibble(
  line_number = 1:length(macbeth_lines),
  text        = macbeth_lines)

Using Regular Expressions to extract key patterns from the text

Regular expressions are representations of patterns in text that can be used to find parts of a text that fit a particular format, as well as to perform find and replace operations. We used some simple forms of regular expressions a couple times: we used them with the dataset of Beatles songs to clean up the formatting of names, as well as to find song titles that had a certain word in them. We also used them to simplify the names of airlines in the last lab.

The following table lists some of the basic syntactic elements of regular expressions.

Here are some examples using the grep() function to find lines that contain a particular pattern as specified by a regular expression. In each case I’m using sample() just to return a random subset of the results. The argument value = TRUE causes grep() to return the content of the lines that contain matches. Without this we’d just get the line numbers that have a match.

Note that the second argument of grep is a vector. In many data-wrangling contexts this will be the name of a column in a dataset, and the whole grep() expression will be used within another wrangling verb.

The related function grepl() returns TRUE or FALSE for every entry being checked, which is often more useful in data-wrangling, since these can be used with filter(), or with mutate() to create a binary variable.

The `.` symbol: “Match any one character”

grep("MAC.", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. I heard the owl scream and the crickets cry."
##  [2] "  MACDUFF. I believe drink gave thee the lie last night."    
##  [3] "  MACBETH. Where?"                                           
##  [4] "  LADY MACBETH. If he had been forgotten,"                   
##  [5] "  MACDUFF. Boundless intemperance"                           
##  [6] "  MACBETH. I wish your horses swift and sure of foot,"       
##  [7] "  MACDUFF. Hail, King, for so thou art. Behold where stands" 
##  [8] "  MACBETH. [Aside.] Come what come may,"                     
##  [9] "  LADY MACBETH. Nought's had, all's spent,"                  
## [10] "  LADY MACBETH. We fail?"

In order to look for a literal ., we need to precede it with two backslashes. This tells the parser to ignore the special meaning of the period in regular expressions and treat it as a literal character. This is needed when looking for other characters that have special meanings as well.

grep("MACBETH\\.", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. The service and the loyalty lowe,"                  
##  [2] "  MACBETH. Had I three ears, I'd hear thee."                   
##  [3] "  MACBETH. Thou wast born of woman."                           
##  [4] "  MACBETH. Avaunt, and quit my sight! Let the earth hide thee!"
##  [5] "  MACBETH. Give me your favor; my dull brain was wrought"      
##  [6] "  MACBETH. Here's our chief guest."                            
##  [7] "  MACBETH. No, nor more fearful."                              
##  [8] "  MACBETH. We have scotch'd the snake, not kill'd it."         
##  [9] "  MACBETH. morrow, both."                                      
## [10] "  LADY MACBETH. Woe, alas!"

Square brackets: Match any character in a set

We can use a pair of square brackets to give a set of characters we want to match. To specify any capital letter we can write A-Z (similarly, any lowercase letter will be matched with a-z, and any numerical digit with 0-9).

grep("MAC[A-Z]", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACDUFF. Yes, he is dead. How wilt thou do for father?"   
##  [2] "  MACBETH. Ride you this afternoon?"                             
##  [3] "  LADY MACDUFF. Why, I can buy me twenty at any market."         
##  [4] "  MACBETH. [Aside.] Two truths are told,"                        
##  [5] "  MACBETH. Came they not by you?"                                
##  [6] "  LADY MACDUFF. Father'd he is, and yet he's fatherless."        
##  [7] "  MACBETH. Accursed be that tongue that tells me so,"            
##  [8] "  MACBETH. She should have died hereafter;"                      
##  [9] "  MACDUFF. Is thy master stirring?"                              
## [10] "  LADY MACBETH. Wash your hands, put on your nightgown, look not"

We can give multiple sets in the same set of brackets:

grep("MAC[A-Za-z]", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. What do you mean?"                               
##  [2] "  LADY MACBETH. Come on,"                                        
##  [3] "  MACBETH. Ay, and a bold one, that dare look on that"           
##  [4] "  MACBETH. Till then, enough. Come, friends.             Exeunt."
##  [5] "  MACDUFF. And all my children?"                                 
##  [6] "  LADY MACBETH. A kind good night to all!"                       
##  [7] "  MACBETH. Ay, in the catalogue ye go for men,"                  
##  [8] "  MACBETH. [Aside.] Time, thou anticipatest my dread exploits."  
##  [9] "  LADY MACDUFF. Ay, that he was."                                
## [10] "  MACBETH. Good repose the while."

Parentheticals with alternatives separated by `|` character

The following pattern will require the specific sequence MAC to be present in the line but allow it to be followed by either BETH or DUFF.

grep("MAC(BETH|DUFF)", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. Ay, in the catalogue ye go for men,"                
##  [2] "  MACBETH. You are, and do not know't."                        
##  [3] "  LADY MACDUFF. Wisdom? To leave his wife, to leave his babes,"
##  [4] "  MACBETH. Bring me no more reports; let them fly all!"        
##  [5] "  MACBETH. I will not yield,"                                  
##  [6] "  MACDUFF. O Scotland, Scotland!"                              
##  [7] "  LADY MACBETH. Know you not he has?"                          
##  [8] "  LADY MACBETH. Is Banquo gone from court?"                    
##  [9] "  MACBETH. Into the air, and what seem'd corporal melted"      
## [10] "  MACBETH. Geese, villain?"

Using `^` and `$` character to indicates the beginning and end of a line

The following will require the line to start with two spaces followed immediately by MAC (and then either BETH or DUFF). This will prevent matching LADY MACBETH or LADY MACDUFF.

grep("^  MAC(BETH|DUFF)", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. Blood hath been shed ere now, i' the olden time,"   
##  [2] "  MACBETH. There's one did laugh in 's sleep, and one cried,"  
##  [3] "  MACBETH. Come, we'll to sleep. My strange and self-abuse"    
##  [4] "  MACDUFF. He has no children. All my pretty ones?"            
##  [5] "  MACBETH. Your spirits shine through you. Within this hour at"
##  [6] "  MACDUFF I'll make so bold to call,"                          
##  [7] "  MACBETH. 'Twas a rough fight."                               
##  [8] "  MACBETH. You know your own degrees; sit down. At first"      
##  [9] "  MACBETH. [Aside.] Two truths are told,"                      
## [10] "  MACBETH. [Aside.] Time, thou anticipatest my dread exploits."

The following will find only lines ending with an exclamation point. Unlike the period and question mark, the exclamation point doesn’t have a special meaning in regular expressions, so we don’t need backslashes for it to be interpreted literally.

grep("!$", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "    Come in, without there!"                                      
##  [2] "  ROSS. 'Gainst nature still!"                                    
##  [3] "    Which is too nigh your person. Heaven preserve you!"          
##  [4] "    For it hath cow'd my better part of man!"                     
##  [5] "  DUNCAN. O valiant cousin! Worthy gentleman!"                    
##  [6] "    Young fry of treachery!"                                      
##  [7] "    come in, equivocator. [Knocking within.] Knock, knock, knock!"
##  [8] "    Thyself and office deftly show!"                              
##  [9] "    And health on both!"                                          
## [10] "  MACDUFF. That way the noise is. Tyrant, show thy face!"

Wildcards

A ? is an example of a “wildcard”, which indicates that the preceding character may or may not be present in the pattern.

The following says that the line might start with a single space, but it might also just start with MAC. Note that this will fail to match lines of dialogue, since these start with two spaces.

grep("^ ?MAC", macbeth_lines, value = TRUE)

## [1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
## [2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"

An asterisk (*) in a regular expression is an other wildcard which indicates that the preceding character can occur any number of times in a row (including zero times).

The following says that the line can start with any number of spaces, but possibly none, followed by MAC.

grep("^ *MAC", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
##  [2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
##  [3] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"   
##  [4] "  MACDUFF, Thane of Fife, a nobleman of Scotland"                 
##  [5] "  MACBETH. So foul and fair a day I have not seen."               
##  [6] "  MACBETH. Speak, if you can. What are you?"                      
##  [7] "  MACBETH. Stay, you imperfect speakers, tell me more."           
##  [8] "  MACBETH. Into the air, and what seem'd corporal melted"         
##  [9] "  MACBETH. Your children shall be kings."                         
## [10] "  MACBETH. And Thane of Cawdor too. Went it not so?"

The + character is similar to *, but it requires that there be at least one instance of the preceding character.

This will cause the results to exclude the lines that start with MACHINE since these don’t have any spaces at the beginning.

grep("^ +MAC", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
##  [2] "  MACDUFF, Thane of Fife, a nobleman of Scotland"              
##  [3] "  MACBETH. So foul and fair a day I have not seen."            
##  [4] "  MACBETH. Speak, if you can. What are you?"                   
##  [5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
##  [6] "  MACBETH. Into the air, and what seem'd corporal melted"      
##  [7] "  MACBETH. Your children shall be kings."                      
##  [8] "  MACBETH. And Thane of Cawdor too. Went it not so?"           
##  [9] "  MACBETH. The Thane of Cawdor lives. Why do you dress me"     
## [10] "  MACBETH. [Aside.] Glamis, and Thane of Cawdor!"

Combining elements

We can use these wildcards to modify sequences longer than a single character by putting the preceding sequence in parentheses.

This will find all the occurrences of either MACBETH or LADY MACBETH at the start of a line beginning with at least one space. By saying (LADY )? we’re saying that the sequence LADY may or may not be present.

grep("^ +(LADY )?MACBETH", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
##  [2] "  LADY MACBETH, his wife"                                      
##  [3] "  MACBETH. So foul and fair a day I have not seen."            
##  [4] "  MACBETH. Speak, if you can. What are you?"                   
##  [5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
##  [6] "  MACBETH. Into the air, and what seem'd corporal melted"      
##  [7] "  MACBETH. Your children shall be kings."                      
##  [8] "  MACBETH. And Thane of Cawdor too. Went it not so?"           
##  [9] "  MACBETH. The Thane of Cawdor lives. Why do you dress me"     
## [10] "  MACBETH. [Aside.] Glamis, and Thane of Cawdor!"

Use grep() to find all occurrences of either MACBETH or LADY MACBETH preceded by two spaces and followed by a period, at the start of a line.

SOLUTION

There are lots of ways you could do this; here’s one that matches both MACBETH and LADY MACBETH, though it would also match, say, GRANNY MACBETH, if such a character existed.

set.seed(42)
grep("^  ([A-Z][A-Z]* )?MACBETH\\.", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. Alack, I am afraid they have awaked"           
##  [2] "  LADY MACBETH. These deeds must not be thought"               
##  [3] "  MACBETH. How say'st thou, that Macduff denies his person"    
##  [4] "  MACBETH. To know my deed, 'twere best not know myself."      
##  [5] "  MACBETH. What man dare, I dare."                             
##  [6] "  MACBETH. See, they encounter thee with their hearts' thanks."
##  [7] "  MACBETH. I will not yield,"                                  
##  [8] "  MACBETH. Sweet remembrancer!"                                
##  [9] "  LADY MACBETH. That which hath made them drunk hath made me"  
## [10] "  MACBETH. We will speak further."

Here’s one that is more specific.

grep("^  (LADY )?MACBETH\\.", macbeth_lines, value = TRUE) %>% 
  sample(10)

##  [1] "  LADY MACBETH. Infirm of purpose!"                    
##  [2] "  MACBETH. Well then, now"                             
##  [3] "  LADY MACBETH. Help me hence, ho!"                    
##  [4] "  MACBETH. That will never be."                        
##  [5] "  MACBETH. We have scotch'd the snake, not kill'd it." 
##  [6] "  MACBETH. My dearest love,"                           
##  [7] "  LADY MACBETH. Did you send to him, sir?"             
##  [8] "  MACBETH. O, full of scorpions is my mind, dear wife!"
##  [9] "  LADY MACBETH. Come on,"                              
## [10] "  MACBETH. Where?"

Get the line numbers and lines for all lines of spoken dialogue, which always start with two spaces, a character name in all caps, followed by a period. (It’s ok if your code picks up a few lines fitting this pattern that aren’t dialogue)

SOLUTION

dialogue_lines <-
  grep("^  [A-Z][A-Z ]+\\.", macbeth_lines)
lines_table <- macbeth_lines_table %>% 
  filter(line_number %in% dialogue_lines)

Using the line text you extracted above, use str_extract() function to pull out just the character name from each line, putting it in a new column called character. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN) (or you can use the pipe syntax).

SOLUTION

Note the space in the brackets. This is needed to match character names that contain spaces (such as LADY MACBETH)

lines_table <- lines_table %>%
  mutate(
    character = str_extract(text, "^  [A-Z][A-Z ]+\\."))

Using `gsub()` to remove the character name from each line of text

Having extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.

The mapply() function is helpful here. Like lapply() it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:

lines_table <- lines_table %>%
  mutate(
    spoken_text = mapply(
      FUN         = gsub,
      pattern     = character,
      x           = text,
      MoreArgs    = list(replacement = "")))

The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.

My favorite character from Macbeth was definitely ASCII.

Cleaning the data: Pulling out just character names

We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub() by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works as follows.

We put parens around parts of the pattern we want to be able to refer back to. We can then use \1, \2, etc. in the replacement string to refer to whatever is matched by the first parenthetical, the second parenthetical, etc.

lines_table <- lines_table %>%
  mutate(
    character = gsub("^  ([A-Z][A-Z ]+)\\.", "\\1", character))

We can read this as saying: “Find instances of two spaces at the start of a line which are followed by at least one capital letter, and then any number of additional capital letters or spaces (at least one), followed by a period. Replace this by just the sequence of capital letters and/or spaces (which we designate as \1 by putting parentheses around it.”

Computing some summary statistics

Use your data wrangling skillz to compute the key summary statistics for each character: first line, median line, last line, total number of lines, and percentage of total lines. Call the resulting table character_stats, and sort the results in decreasing order of number of lines. There’s no text manipulation needed here, but it’s a necessary step for our graphs.

SOLUTION

So that the bar graph we produce has the characters arranged in order of the number of lines they have, we’ll use factor() to encode the order the characters appear in the sorted table in the character variable itself. (Remove eval = FALSE below once you have created character_stats)

character_stats <- character_stats %>%
  mutate(
    character = factor(character, levels = character))

We haven’t consolidated minor characters into OTHER yet, but let’s see what our bar graph would look like so far. Make a bar graph with character names on the x axis and the percentage of lines spoken by that character on the y axis. The bars should be arranged from tallest to shortest (this should happen automatically due to the reordering of the character column that we did). To get the character names to be angled a bit, you can use theme(axis.text.x = element_text(angle = 60, hjust = 1)) (this is a component of the plot of the same kind as geom_bar(), etc.; that is, add it to the plot code with +)

SOLUTION

Consolidating infrequent speakers

Use filter() and summarize() to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. Have your summarize() return the same set of columns that we have in the original data, with the character column set to "OTHERS". (For the median line just input NA). Using bind_rows(), merge this new summary data to the original data, having excluded the individual low-activity characters.

SOLUTION

Now we’re ready to plot! Reproduce the bar plot using this modified data. You’ll need to repeat the mutate step of setting the levels of the character column: because we’ve introduced a new “character” in OTHERS, the column will be converted back to an unordered text column.

SOLUTION

See if you can produce the second graph on your own, showing, for each character that has at least 1% of total lines, when they first speak, when they have spoken half their lines, and when they last speak. You’ll need to do a bit of additional data-wrangling before plotting.

SOLUTION

Investigate some other aspect of this or another text, producing a visualization to illustrate your findings. Even if you use Macbeth again, your investigation should not just be a data-wrangling and visualization exercise starting with the cleaned data we already have: find something to explore that requires some text manipulation involving regular expressions to extract things or clean the data. If you use another text, make sure it has enough structure in it that you can get something data-frame like using the tools you know, but not so much that doing this is trivial. Post a link to the text, a snippet containing your code, and your graph to #lab16.

STAT 209: Lab 16

Working With Text Data

Goal

Reading in the text, and splitting it into lines

Using Regular Expressions to extract key patterns from the text

The . symbol: “Match any one character”

Square brackets: Match any character in a set

Parentheticals with alternatives separated by | character

Using ^ and $ character to indicates the beginning and end of a line

Wildcards

Combining elements

SOLUTION

SOLUTION

SOLUTION

Using gsub() to remove the character name from each line of text

Cleaning the data: Pulling out just character names

Computing some summary statistics

SOLUTION

SOLUTION

Consolidating infrequent speakers

SOLUTION

SOLUTION

SOLUTION

The `.` symbol: “Match any one character”

Parentheticals with alternatives separated by `|` character

Using `^` and `$` character to indicates the beginning and end of a line

Using `gsub()` to remove the character name from each line of text