Working With Text Data

Goal

The goal of this lab is to do the wrangling necessary to create the following two figures, starting with just the raw script of the Shakespeare play Macbeth.

The first plot shows what share of lines each character in the play has, collapsing infrequent speakers into a single “OTHERS” bin.

Share of lines spoken by each character

The second plot shows, for each character that has at least 1% of total lines, when they first speak, when they have spoken half their lines, and when they last speak.

First, middle and last line spoken by each character

Reading in the text, and splitting it into lines

First we need to pull the text into R. We’ll make use of the RCurl package here for the getURL() function.

library(RCurl)

macbeth_url <- "https://www.gutenberg.org/cache/epub/1129/pg1129.txt"
macbeth_raw <- getURL(macbeth_url)

Then we’ll take the text and break it up into lines, splitting on linebreaks in the original document. The special characters designated by \r and \n are two different formats of line breaks, referred to as “carriage return” and “newline”, respectively.

macbeth_lines <- macbeth_raw %>% 
  strsplit("\r\n") %>%
  pluck(1)

This returns a vector of the individual lines from the text. Let’s put that vector into a data table along with a column that has the numbers of the lines (this will just consist of numbers from 1 to however many lines there are).

macbeth_lines_table <- tibble(
  line_number = 1:length(macbeth_lines),
  text        = macbeth_lines)

Using Regular Expressions to extract key patterns from the text

Regular expressions are representations of patterns in text that can be used to find parts of a text that fit a particular format, as well as to perform find and replace operations. We used some simple forms of regular expressions a couple times: we used them with the dataset of Beatles songs to clean up the formatting of names, as well as to find song titles that had a certain word in them. We also used them to simplify the names of airlines in the last lab.

The following table lists some of the basic syntactic elements of regular expressions.

Here are some examples using the grep() function to find lines that contain a particular pattern as specified by a regular expression. In each case I’m using sample() just to return a random subset of the results. The argument value = TRUE causes grep() to return the content of the lines that contain matches. Without this we’d just get the line numbers that have a match.

Note that the second argument of grep is a vector. In many data-wrangling contexts this will be the name of a column in a dataset, and the whole grep() expression will be used within another wrangling verb.

The related function grepl() returns TRUE or FALSE for every entry being checked, which is often more useful in data-wrangling, since these can be used with filter(), or with mutate() to create a binary variable.

The `.` symbol: “Match any one character”

grep("MAC.", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. This is a sorry sight.           [Looks on his hands."
##  [2] "  LADY MACBETH. Infirm of purpose!"                              
##  [3] "  LADY MACDUFF. Wisdom? To leave his wife, to leave his babes,"  
##  [4] "  MACDUFF. Wherefore did you so?"                                
##  [5] "  LADY MACBETH. My hands are of your color, but I shame"         
##  [6] "  LADY MACBETH. Fie, for shame!"                                 
##  [7] "  MACDUFF. How does my wife?"                                    
##  [8] "  MACDUFF. I know this is a joyful trouble to you,"              
##  [9] "  MACBETH. So shall I, love, and so, I pray, be you."            
## [10] "  LADY MACBETH. These deeds must not be thought"

In order to look for a literal ., we need to precede it with two backslashes. This tells the parser to ignore the special meaning of the period in regular expressions and treat it as a literal character. This is needed when looking for other characters that have special meanings as well.

grep("MACBETH\\.", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. He has almost supp'd. Why have you left the"     
##  [2] "  MACBETH. Here's our chief guest."                              
##  [3] "  LADY MACBETH. Thou'rt mad to say it!"                          
##  [4] "  MACBETH. How now, you secret, black, and midnight hags?"       
##  [5] "  MACBETH. Your children shall be kings."                        
##  [6] "  MACBETH. Take thy face hence.                    Exit Servant."
##  [7] "  MACBETH. Avaunt, and quit my sight! Let the earth hide thee!"  
##  [8] "  MACBETH. Tonight we hold a solemn supper, sir,"                
##  [9] "  MACBETH. Tell me, thou unknown power-"                         
## [10] "  MACBETH. You are, and do not know't."

Square brackets: Match any character in a set

We can use a pair of square brackets to give a set of characters we want to match. To specify any capital letter we can write A-Z (similarly, any lowercase letter will be matched with a-z, and any numerical digit with 0-9).

grep("MAC[A-Z]", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. The rest is labor, which is not used for you."        
##  [2] "  MACBETH. There's one did laugh in 's sleep, and one cried,"    
##  [3] "  LADY MACDUFF. Whither should I fly?"                           
##  [4] "  LADY MACBETH. Out, damned spot! Out, I say! One- two -why then"
##  [5] "  MACBETH. Blood hath been shed ere now, i' the olden time,"     
##  [6] "  MACBETH. The labor we delight in physics pain."                
##  [7] "  MACBETH. They have tied me to a stake; I cannot fly,"          
##  [8] "  MACBETH. Avaunt, and quit my sight! Let the earth hide thee!"  
##  [9] "  MACBETH. Had I three ears, I'd hear thee."                     
## [10] "THE TRAGEDY OF MACBETH"

We can give multiple sets in the same set of brackets:

grep("MAC[A-Za-z]", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. That which hath made them drunk hath made me"    
##  [2] "  MACBETH. So shall I, love, and so, I pray, be you."            
##  [3] "  MACBETH. Good repose the while."                               
##  [4] "  MACBETH. Till then, enough. Come, friends.             Exeunt."
##  [5] "  MACBETH. The table's full."                                    
##  [6] "  MACDUFF. Wherefore did you so?"                                
##  [7] "  LADY MACBETH. O, never"                                        
##  [8] "  MACDUFF. Turn, hell hound, turn!"                              
##  [9] "  MACBETH. [Aside.] Time, thou anticipatest my dread exploits."  
## [10] "  LADY MACBETH. What beast wast then"

Parentheticals with alternatives separated by `|` character

The following pattern will require the specific sequence MAC to be present in the line but allow it to be followed by either BETH or DUFF.

grep("MAC(BETH|DUFF)", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  LADY MACBETH. A foolish thought, to say a sorry sight."  
##  [2] "  LADY MACBETH. Consider it not so deeply."                
##  [3] "  MACDUFF. See, who comes here?"                           
##  [4] "  LADY MACBETH. What beast wast then"                      
##  [5] "  MACBETH. I have almost forgot the taste of fears:"       
##  [6] "  MACBETH. My name's Macbeth."                             
##  [7] "  MACBETH. We will speak further."                         
##  [8] "  LADY MACBETH. Say to the King I would attend his leisure"
##  [9] "  MACBETH. What news more?"                                
## [10] "  MACDUFF. What should he be?"

Using `^` and `$` character to indicates the beginning and end of a line

The following will require the line to start with two spaces followed immediately by MAC (and then either BETH or DUFF). This will prevent matching LADY MACBETH or LADY MACDUFF.

grep("^  MAC(BETH|DUFF)", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "  MACBETH. Where?"                                        
##  [2] "  MACDUFF. Is thy master stirring?"                       
##  [3] "  MACBETH. Thou are too like the spirit of Banquo Down!"  
##  [4] "  MACBETH. How now, you secret, black, and midnight hags?"
##  [5] "  MACDUFF. This avarice"                                  
##  [6] "  MACBETH. If you shall cleave to my consent, when 'tis," 
##  [7] "  MACDUFF. O, relation"                                   
##  [8] "  MACBETH. Ride you this afternoon?"                      
##  [9] "  MACDUFF. I have no words."                              
## [10] "  MACBETH. Call 'em, let me see 'em."

The following will find only lines ending with an exclamation point. Unlike the period and question mark, the exclamation point doesn’t have a special meaning in regular expressions, so we don’t need backslashes for it to be interpreted literally.

grep("!$", macbeth_lines, value = TRUE) %>% 
  sample(size = 10)

##  [1] "    Awake, awake!"                                      
##  [2] "    To make them kings -the seed of Banquo kings!"      
##  [3] "  LADY MACDUFF. Poor prattler, how thou talk'st!"       
##  [4] "    Stand ay accursed in the calendar!"                 
##  [5] "  MACBETH. Thou are too like the spirit of Banquo Down!"
##  [6] "  DUNCAN. O valiant cousin! Worthy gentleman!"          
##  [7] "  FIRST MURTHERER. What, you egg!"                      
##  [8] "  DUNCAN. See, see, our honor'd hostess!"               
##  [9] "    Under a hand accursed!"                             
## [10] "    So all hail, Macbeth and Banquo!"

Wildcards

A ? is an example of a “wildcard”, which indicates that the preceding character may or may not be present in the pattern.

The following says that the line might start with a single space, but it might also just start with MAC. Note that this will fail to match lines of dialogue, since these start with two spaces.

grep("^ ?MAC", macbeth_lines, value = TRUE)

## [1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
## [2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"

An asterisk (*) in a regular expression is an other wildcard which indicates that the preceding character can occur any number of times in a row (including zero times).

The following says that the line can start with any number of spaces, but possibly none, followed by MAC.

grep("^ *MAC", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
##  [2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
##  [3] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"   
##  [4] "  MACDUFF, Thane of Fife, a nobleman of Scotland"                 
##  [5] "  MACBETH. So foul and fair a day I have not seen."               
##  [6] "  MACBETH. Speak, if you can. What are you?"                      
##  [7] "  MACBETH. Stay, you imperfect speakers, tell me more."           
##  [8] "  MACBETH. Into the air, and what seem'd corporal melted"         
##  [9] "  MACBETH. Your children shall be kings."                         
## [10] "  MACBETH. And Thane of Cawdor too. Went it not so?"

The + character is similar to *, but it requires that there be at least one instance of the preceding character.

This will cause the results to exclude the lines that start with MACHINE since these don’t have any spaces at the beginning.

grep("^ +MAC", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
##  [2] "  MACDUFF, Thane of Fife, a nobleman of Scotland"              
##  [3] "  MACBETH. So foul and fair a day I have not seen."            
##  [4] "  MACBETH. Speak, if you can. What are you?"                   
##  [5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
##  [6] "  MACBETH. Into the air, and what seem'd corporal melted"      
##  [7] "  MACBETH. Your children shall be kings."                      
##  [8] "  MACBETH. And Thane of Cawdor too. Went it not so?"           
##  [9] "  MACBETH. The Thane of Cawdor lives. Why do you dress me"     
## [10] "  MACBETH. [Aside.] Glamis, and Thane of Cawdor!"

Combining elements

We can use these wildcards to modify sequences longer than a single character by putting the preceding sequence in parentheses.

This will find all the occurrences of either MACBETH or LADY MACBETH at the start of a line beginning with at least one space. By saying (LADY )? we’re saying that the sequence LADY may or may not be present.

grep("^ +(LADY )?MACBETH", macbeth_lines, value = TRUE) %>% 
  head(n = 10)

##  [1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
##  [2] "  LADY MACBETH, his wife"                                      
##  [3] "  MACBETH. So foul and fair a day I have not seen."            
##  [4] "  MACBETH. Speak, if you can. What are you?"                   
##  [5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
##  [6] "  MACBETH. Into the air, and what seem'd corporal melted"      
##  [7] "  MACBETH. Your children shall be kings."                      
##  [8] "  MACBETH. And Thane of Cawdor too. Went it not so?"           
##  [9] "  MACBETH. The Thane of Cawdor lives. Why do you dress me"     
## [10] "  MACBETH. [Aside.] Glamis, and Thane of Cawdor!"

Use grep() to find all occurrences of either MACBETH or LADY MACBETH preceded by two spaces and followed by a period, at the start of a line.

SOLUTION

Get the line numbers and lines for all lines of spoken dialogue, which always start with two spaces, a character name in all caps, followed by a period. (It’s ok if your code picks up a few lines fitting this pattern that aren’t dialogue)

SOLUTION

Using the line text you extracted above, use str_extract() function to pull out just the character name from each line, putting it in a new column called character. The syntax is str_extract(SOURCE_TEXT, QUOTED_REGEX_PATTERN) (or you can use the pipe syntax).

SOLUTION

Using `gsub()` to remove the character name from each line of text

Having extracted the full text of each line, we may want to pull out just the actual dialogue. We won’t need this for our plots, but it’s a useful thing to know how to do.

The mapply() function is helpful here. Like lapply() it allows us to call a function on a list of arguments, but instead of just varying the first argument, we can vary arbitrary numbers of arguments over multiple lists of the same length, and then if we want, hold other arguments fixed. It works like this:

(Remove eval = FALSE below when you have done the previous exercises)

lines_table <- lines_table %>%
  mutate(
    spoken_text = mapply(
      FUN         = gsub,
      pattern     = character,
      x           = text,
      MoreArgs    = list(replacement = "")))

The lists supplied to each argument name will be stepped through together: the first elements of each one will be used together, the second of each used together, etc.

My favorite character from Macbeth was definitely ASCII.

Cleaning the data: Pulling out just character names

We’d like our graph not to include extraneous spaces or periods in the character names. We could pull these out using gsub() by filtering out each kind of extra stuff; alternatively we can use a fancified form of regular expression to replace each character name string with just the actual name. This works as follows.

We put parens around parts of the pattern we want to be able to refer back to. We can then use \1, \2, etc. in the replacement string to refer to whatever is matched by the first parenthetical, the second parenthetical, etc. (Remove eval = FALSE below when you have done the previous exercises)

lines_table <- lines_table %>%
  mutate(
    character = gsub("^  ([A-Z][A-Z ]+)\\.", "\\1", character))

We can read this as saying: “Find instances of two spaces at the start of a line which are followed by at least one capital letter, and then any number of additional capital letters or spaces (at least one), followed by a period. Replace this by just the sequence of capital letters and/or spaces (which we designate as \1 by putting parentheses around it.”

Computing some summary statistics

Use your data wrangling skillz to compute the key summary statistics for each character: first line, median line, last line, total number of lines, and percentage of total lines. Call the resulting table character_stats, and sort the results in decreasing order of number of lines. There’s no text manipulation needed here, but it’s a necessary step for our graphs.

SOLUTION

So that the bar graph we produce has the characters arranged in order of the number of lines they have, we’ll use factor() to encode the order the characters appear in the sorted table in the character variable itself. (Remove eval = FALSE below once you have created character_stats)

character_stats <- character_stats %>%
  mutate(
    character = factor(character, levels = character))

We haven’t consolidated minor characters into OTHER yet, but let’s see what our bar graph would look like so far. Make a bar graph with character names on the x axis and the percentage of lines spoken by that character on the y axis. The bars should be arranged from tallest to shortest (this should happen automatically due to the reordering of the character column that we did). To get the character names to be angled a bit, you can use theme(axis.text.x = element_text(angle = 60, hjust = 1)) (this is a component of the plot of the same kind as geom_bar(), etc.; that is, add it to the plot code with +)

SOLUTION

Consolidating infrequent speakers

Use filter() and summarize() to count up the lines and find the first and last appearance, by any character with less than 1% of total lines. Have your summarize() return the same set of columns that we have in the original data, with the character column set to "OTHERS". (For the median line just input NA). Using bind_rows(), merge this new summary data to the original data, having excluded the individual low-activity characters.

SOLUTION

Now we’re ready to plot! Reproduce the bar plot using this modified data. You’ll need to repeat the mutate step of setting the levels of the character column: because we’ve introduced a new “character” in OTHERS, the column will be converted back to an unordered text column.

SOLUTION

See if you can produce the second graph on your own, showing, for each character that has at least 1% of total lines, when they first speak, when they have spoken half their lines, and when they last speak. You’ll need to do a bit of additional data-wrangling before plotting.

SOLUTION

Investigate some other aspect of this or another text, producing a visualization to illustrate your findings. Even if you use Macbeth again, your investigation should not just be a data-wrangling and visualization exercise starting with the cleaned data we already have: find something to explore that requires some text manipulation involving regular expressions to extract things or clean the data. If you use another text, make sure it has enough structure in it that you can get something data-frame like using the tools you know, but not so much that doing this is trivial. Post a link to the text, a snippet containing your code, and your graph to #lab16.

STAT 209: Lab 16

Working With Text Data

Goal

Reading in the text, and splitting it into lines

Using Regular Expressions to extract key patterns from the text

The . symbol: “Match any one character”

Square brackets: Match any character in a set

Parentheticals with alternatives separated by | character

Using ^ and $ character to indicates the beginning and end of a line

Wildcards

Combining elements

SOLUTION

SOLUTION

SOLUTION

Using gsub() to remove the character name from each line of text

Cleaning the data: Pulling out just character names

Computing some summary statistics

SOLUTION

SOLUTION

Consolidating infrequent speakers

SOLUTION

SOLUTION

SOLUTION

The `.` symbol: “Match any one character”

Parentheticals with alternatives separated by `|` character

Using `^` and `$` character to indicates the beginning and end of a line

Using `gsub()` to remove the character name from each line of text