Accessing a database

We’ll need the following packages

The dbConnect() function allows us to open a connection to a remote database. The dbConnect_scidb() function is a wrapper to quickly access the scidb database at Smith College, which the examples (and lab) will use.

The db object is a connection: it doesn’t contain data itself, rather it’s like a portal through which we can bring data from the remote database.

Browsing the available data

In order to know what data is available, we can list the datasets (tables) that are visible through the connection.

## [1] "airports" "carriers" "flights"  "planes"

This is the collection of flights datasets but in database form.

Translating SQL data to an R “view”

We can create an R object that we can then interact with as if it were a dataset we had read in (it hasn’t actually read in the data to memory though, so you can’t open it in the RStudio data viewer)

Using dplyr wrangling with the R view of the data

We can use this flights with dplyr wrangling functions much like we could if it were an actual R dataset. For example:

## Warning: Missing values are always removed in SQL.
## Use `SUM(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source:     lazy query [?? x 3]
## # Database:   mysql 5.7.33-log
## #   [@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:/airlines]
## # Ordered by: desc(pct_ontime)
##   name                       N pct_ontime
##   <chr>                  <dbl>      <dbl>
## 1 Delta Air Lines Inc.    1770      0.876
## 2 Endeavor Air Inc.       1067      0.848
## 3 American Airlines Inc.  1055      0.846
## 4 Virgin America           291      0.845
## 5 JetBlue Airways         3341      0.826

Note that not all dplyr functions will work because they require a tibble. For example, the slice_*() functions don’t work.

The equivalent SQL code

The wrangling we did above could be done by writing an SQL query. In fact, R has to translate our dplyr pipeline into SQL before it can be executed.

We can see the auto-generated SQL code, though because it’s generated algorithmically, it’s often a lot less intelligible than actual SQL code written directly would be.

## <SQL>
## SELECT *
## FROM (SELECT `name`, `N`, `pct_ontime`
## FROM (SELECT `LHS`.`carrier` AS `carrier`, `N`, `pct_ontime`, `name`
## FROM (SELECT `carrier`, COUNT(*) AS `N`, SUM(`arr_delay` <= 15.0) / COUNT(*) AS `pct_ontime`
## FROM (SELECT *
## FROM `flights`
## WHERE (`year` = 2013.0 AND `month` = 9.0)) `q01`
## WHERE (`dest` = 'JFK')
## GROUP BY `carrier`) `LHS`
## INNER JOIN `carriers` AS `RHS`
## ON (`LHS`.`carrier` = `RHS`.`carrier`)
## ) `q01`) `q02`
## WHERE (`N` >= 100.0)
## ORDER BY `pct_ontime` DESC

Here’s what the query (with head(n=5)) might look like if we actually wrote it in SQL ourselves. (This will make more sense after you do the lab)

5 records
name N pct_ontime
Delta Air Lines Inc. 1770 0.8757
Endeavor Air Inc. 1067 0.8482
American Airlines Inc. 1055 0.8455
Virgin America 291 0.8454
JetBlue Airways 3341 0.8264

Sometimes we can get the SQL code equivalent to a simple R command by making use of the algorithmic translation.

## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`arr_delay`) OVER ()

Though the above isn’t quite equivalent to what the R function would do. It’s actually more like the R code:

## <SQL> AVG(`arr_delay`) OVER ()

Reading in data as an actual R dataset

If the data isn’t too big, we can convert the whole thing into an actual R dataset (rather than an R view of the SQL data, which is what we did above) with collect(). This will let us use some R code that might not work on the “view”.

But this should be done with caution, because bigger datasets can overload the system (probably don’t want to do this with flights, for instance)

## # A tibble: 1,610 x 2
##    carrier name                                        
##    <chr>   <chr>                                       
##  1 02Q     Titan Airways                               
##  2 04Q     Tradewind Aviation                          
##  3 05Q     Comlux Aviation, AG                         
##  4 06Q     Master Top Linhas Aereas Ltd.               
##  5 07Q     Flair Airlines Ltd.                         
##  6 09Q     Swift Air, LLC                              
##  7 0BQ     DCA                                         
##  8 0CQ     ACM AIR CHARTER GmbH                        
##  9 0GQ     Inter Island Airways, d/b/a Inter Island Air
## 10 0HQ     Polar Airlines de Mexico d/b/a Nova Air     
## # … with 1,600 more rows

To see the difference, we can look at how much memory the “view” object takes up, compared to the collect()ed version

## 3.6 Kb
## 234.8 Kb

It’s about 70 times bigger.

Imagine we had a dataset with a million cases and 100 variables…

##             V1        V2        V3        V4         V5         V6
## 1 9.582709e-01 0.9922512 0.1110860 0.2018943 0.92692726 0.37770663
## 2 8.275782e-01 0.2445587 0.9042847 0.4021900 0.40541440 0.87778504
## 3 5.211565e-05 0.7543153 0.8484690 0.2576616 0.14036474 0.33231068
## 4 5.238584e-01 0.7341216 0.9096582 0.9411570 0.08149133 0.70059507
## 5 1.857047e-01 0.9998941 0.4997949 0.8504782 0.35023587 0.92871656
## 6 3.931427e-01 0.1029113 0.7763494 0.4322868 0.06987871 0.02113734
##          V7        V8        V9        V10        V11        V12       V13
## 1 0.9241682 0.7321434 0.6457445 0.84090746 0.37699683 0.11303564 0.8445392
## 2 0.9409175 0.8463810 0.6960195 0.28127747 0.33710478 0.09618753 0.8602856
## 3 0.4061917 0.6593970 0.4816522 0.04773262 0.60816545 0.93996437 0.8947656
## 4 0.2996085 0.4513319 0.8648686 0.91937293 0.63258345 0.96702538 0.5577387
## 5 0.8847899 0.5804747 0.4582482 0.43036180 0.07115559 0.38610356 0.6712138
## 6 0.1112059 0.9062408 0.6533471 0.23267113 0.95912483 0.74110738 0.5601601
##         V14       V15       V16       V17        V18       V19        V20
## 1 0.4215485 0.5024041 0.3324905 0.5798861 0.73656287 0.9058820 0.06553702
## 2 0.7967206 0.5880284 0.6873303 0.5340180 0.99820694 0.5521920 0.73798591
## 3 0.3784760 0.7905586 0.8359667 0.1974253 0.08187754 0.3770132 0.56443771
## 4 0.5431132 0.8198253 0.4652053 0.6069765 0.38944993 0.2278911 0.28811750
## 5 0.1293676 0.2073670 0.3025314 0.9324739 0.36084299 0.9156316 0.92158419
## 6 0.2737923 0.6725885 0.9131288 0.5731909 0.88643173 0.9495621 0.79468594
##         V21        V22       V23       V24       V25        V26       V27
## 1 0.5268952 0.22102920 0.1614168 0.5398129 0.5534918 0.63053215 0.9138582
## 2 0.7184214 0.47497728 0.7540147 0.5170090 0.7088770 0.66099985 0.6457488
## 3 0.4521553 0.10054078 0.7048759 0.8284640 0.2597262 0.05071979 0.1015634
## 4 0.3425029 0.03766112 0.7062673 0.3940008 0.9554836 0.06397756 0.1072469
## 5 0.0808702 0.71542450 0.2866293 0.3385621 0.8423259 0.70070237 0.4397458
## 6 0.9134128 0.10396219 0.5619998 0.8792314 0.9389477 0.08229026 0.9443304
##          V28       V29       V30        V31       V32       V33        V34
## 1 0.95372513 0.8196622 0.3350209 0.11008057 0.5885140 0.2824735 0.46979932
## 2 0.83527093 0.3982447 0.3380713 0.99021123 0.4240839 0.6628323 0.38366060
## 3 0.65671663 0.8255118 0.8969452 0.65022025 0.4359607 0.7181876 0.22302089
## 4 0.77768441 0.3026732 0.4322319 0.06919407 0.4425501 0.7780290 0.62058674
## 5 0.07671021 0.1579417 0.4917127 0.52673065 0.9817886 0.4669908 0.43678329
## 6 0.56186048 0.5146895 0.5882684 0.66306858 0.6195238 0.3501962 0.04590658
##         V35       V36       V37       V38        V39       V40       V41
## 1 0.3124830 0.5470640 0.1808295 0.7330374 0.22604512 0.1875547 0.8345588
## 2 0.5881805 0.7342948 0.3853905 0.3758414 0.02682627 0.6894175 0.7174112
## 3 0.4542990 0.9491330 0.2006372 0.9563640 0.66758427 0.3178381 0.9217305
## 4 0.5084065 0.9848869 0.7816868 0.3391856 0.79361847 0.4551915 0.2143525
## 5 0.7209006 0.3152856 0.7292400 0.6094060 0.24921339 0.1156424 0.5158954
## 6 0.4309317 0.4723814 0.6922514 0.5415480 0.20744677 0.8530895 0.2648333
##          V42        V43       V44       V45       V46       V47        V48
## 1 0.08691973 0.45780129 0.3639167 0.2329356 0.6524616 0.3093971 0.85254289
## 2 0.14015128 0.82960765 0.6484462 0.5121706 0.1763923 0.4618049 0.97822864
## 3 0.34644847 0.11776351 0.8821448 0.2468129 0.5940904 0.8928482 0.72971691
## 4 0.56506518 0.05162098 0.9412839 0.8606908 0.6702378 0.2151544 0.68620139
## 5 0.41518479 0.57804145 0.8208939 0.8159484 0.2133537 0.7794204 0.04970226
## 6 0.54492781 0.77628252 0.4784446 0.7489294 0.4179912 0.1018776 0.08423770
##          V49        V50         V51       V52       V53       V54
## 1 0.05011054 0.19976264 0.346614554 0.9770470 0.8267348 0.6334371
## 2 0.31570993 0.58907720 0.781645620 0.8480106 0.6347199 0.6175607
## 3 0.57858472 0.79191134 0.515945126 0.3766002 0.3026915 0.2345405
## 4 0.70633191 0.82395169 0.377163650 0.9984267 0.6800605 0.1634893
## 5 0.28389753 0.58162969 0.920986567 0.5618137 0.9910437 0.6067510
## 6 0.12919771 0.03711531 0.001498685 0.2329754 0.4613257 0.5295072
##          V55        V56       V57        V58       V59       V60
## 1 0.53094921 0.21623788 0.8614596 0.33171237 0.9341563 0.2200101
## 2 0.47861894 0.73678839 0.7281260 0.14987313 0.5004925 0.6787636
## 3 0.72074411 0.68317191 0.2861561 0.53523440 0.3076221 0.1334552
## 4 0.86068967 0.04197453 0.1593689 0.52065730 0.9146829 0.6291156
## 5 0.04583296 0.37435635 0.8326694 0.06079883 0.6847007 0.2338096
## 6 0.83359670 0.97424794 0.2428933 0.78732705 0.5660282 0.1880929
##          V61       V62         V63        V64        V65       V66
## 1 0.47981659 0.3152615 0.234957625 0.20673160 0.28288922 0.4930318
## 2 0.02402208 0.3130100 0.007754755 0.78323134 0.83275916 0.8413838
## 3 0.35898984 0.7538553 0.332491866 0.22549944 0.89242656 0.8668341
## 4 0.94779962 0.9199844 0.369502428 0.41298377 0.35668981 0.2780697
## 5 0.86410667 0.8637177 0.976842775 0.03755392 0.09185347 0.2204420
## 6 0.28906298 0.3919817 0.528322783 0.15433444 0.15536724 0.8300934
##         V67       V68         V69        V70        V71       V72
## 1 0.4252947 0.7002006 0.777502397 0.21700370 0.17315534 0.3743983
## 2 0.1952382 0.1936738 0.007402987 0.09291439 0.05342468 0.7761527
## 3 0.5683557 0.5022794 0.421756913 0.37493134 0.20992576 0.9337587
## 4 0.6870402 0.6021943 0.477648763 0.15935415 0.01790243 0.3987472
## 5 0.4998683 0.2721193 0.574701922 0.09861315 0.19274351 0.7465204
## 6 0.7169159 0.9765102 0.516474072 0.74254187 0.18122967 0.7793678
##         V73        V74       V75       V76        V77       V78       V79
## 1 0.5974588 0.76641946 0.1153148 0.6061773 0.84756187 0.4838876 0.8577100
## 2 0.3950155 0.15711291 0.2940018 0.4688531 0.05814825 0.4487619 0.4451039
## 3 0.0402423 0.58409636 0.3314035 0.8811726 0.86637747 0.3525458 0.1333492
## 4 0.2260923 0.44701136 0.2872220 0.3407173 0.59933234 0.3292218 0.1050910
## 5 0.6262719 0.01958447 0.8956871 0.5171751 0.37101680 0.2528866 0.4862446
## 6 0.8462474 0.27717063 0.7396417 0.1456409 0.48084612 0.5423240 0.1755146
##          V80        V81       V82        V83       V84       V85       V86
## 1 0.48708573 0.26643755 0.6820924 0.61044474 0.5608281 0.5623676 0.4567782
## 2 0.50336126 0.55115940 0.5023934 0.87433810 0.1477371 0.6742229 0.1601285
## 3 0.55604233 0.28791968 0.1651268 0.09728313 0.4839350 0.3967569 0.3787584
## 4 0.84355108 0.26052866 0.8070973 0.17624895 0.2043278 0.1271272 0.0144499
## 5 0.81585905 0.04895874 0.1797900 0.29220557 0.3791552 0.5182834 0.3905862
## 6 0.03522949 0.21229430 0.0849170 0.49969182 0.4004684 0.9415648 0.1258903
##         V87       V88       V89        V90       V91       V92       V93
## 1 0.7626591 0.2180036 0.7442870 0.41256313 0.7501933 0.4564545 0.9580733
## 2 0.8328556 0.7185586 0.8620664 0.47658683 0.8160589 0.4328343 0.3926681
## 3 0.1813921 0.4468990 0.4924655 0.16937879 0.9484113 0.6650279 0.6605460
## 4 0.5160538 0.1310464 0.7629739 0.22056204 0.9960883 0.2635354 0.3749227
## 5 0.3757253 0.1558633 0.8154368 0.07778942 0.3577418 0.5637030 0.5066815
## 6 0.6351672 0.9344551 0.6718389 0.13685002 0.8191125 0.6294466 0.6056130
##          V94       V95       V96       V97         V98       V99      V100
## 1 0.74299622 0.9372414 0.2990555 0.3055707 0.851845863 0.9236000 0.3213596
## 2 0.63276982 0.4216996 0.5496787 0.6490230 0.239391752 0.6148555 0.1894100
## 3 0.76141462 0.1507744 0.9086228 0.9663571 0.014460469 0.9062310 0.7897107
## 4 0.36047587 0.5312079 0.9025777 0.2441219 0.005079025 0.8869425 0.1086141
## 5 0.59693755 0.5834280 0.4751858 0.1520234 0.920813971 0.2201902 0.4662787
## 6 0.07630926 0.7267419 0.7488450 0.2114582 0.908823544 0.4001999 0.3564068
## 763 Mb

Nearly a gigabyte. This still fits in RAM, but 10 million cases with 100 variables may not (depending on the system), and 100M almost certainly wouldn’t.

Let’s free that memory by purging the data from our environment

Hybrid dplyr and SQL code

Instead of reading the whole dataset into memory, we can often find SQL equivalents for the things that we might not be able to do with an SQL view using dplyr code, and use it within dplyr code to make a sort of hybrid command:

## # Source:   lazy query [?? x 3]
## # Database: mysql 5.7.33-log
## #   [@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:/airlines]
##    carrier name                           name_code                        
##    <chr>   <chr>                          <chr>                            
##  1 02Q     Titan Airways                  Titan Airways(02Q)               
##  2 04Q     Tradewind Aviation             Tradewind Aviation(04Q)          
##  3 05Q     Comlux Aviation, AG            Comlux Aviation, AG(05Q)         
##  4 06Q     Master Top Linhas Aereas Ltd.  Master Top Linhas Aereas Ltd.(06…
##  5 07Q     Flair Airlines Ltd.            Flair Airlines Ltd.(07Q)         
##  6 09Q     Swift Air, LLC                 Swift Air, LLC(09Q)              
##  7 0BQ     DCA                            DCA(0BQ)                         
##  8 0CQ     ACM AIR CHARTER GmbH           ACM AIR CHARTER GmbH(0CQ)        
##  9 0GQ     Inter Island Airways, d/b/a I… Inter Island Airways, d/b/a Inte…
## 10 0HQ     Polar Airlines de Mexico d/b/… Polar Airlines de Mexico d/b/a N…
## # … with more rows

Notice that CONCAT is not a valid R function

Going (almost) all the way to “raw” SQL

The scidb database also has a bunch of data from IMDB.

##  [1] "aka_name"        "aka_title"       "cast_info"      
##  [4] "char_name"       "comp_cast_type"  "company_name"   
##  [7] "company_type"    "complete_cast"   "info_type"      
## [10] "keyword"         "kind_type"       "link_type"      
## [13] "movie_companies" "movie_info"      "movie_info_idx" 
## [16] "movie_keyword"   "movie_link"      "name"           
## [19] "person_info"     "role_type"       "title"

If we know the SQL query we want, we can make an R string from it and sent it do the database with dbGetQuery(). The following query returns a list of the different kinds of “entities” that the imdb database has information about (represented in the kind_type table)

##   id           kind
## 1  1          movie
## 2  2      tv series
## 3  3       tv movie
## 4  4    video movie
## 5  5 tv mini series
## 6  6     video game
## 7  7        episode

This one gives us the table that lists the kinds of ‘properties’ of these entities that the data includes.

##      id                            info
## 1     1                        runtimes
## 2     2                      color info
## 3     3                          genres
## 4     4                       languages
## 5     5                    certificates
## 6     6                       sound mix
## 7     7                       tech info
## 8     8                       countries
## 9     9                        taglines
## 10   10                        keywords
## 11   11              alternate versions
## 12   12                   crazy credits
## 13   13                           goofs
## 14   14                      soundtrack
## 15   15                          quotes
## 16   16                   release dates
## 17   17                          trivia
## 18   18                       locations
## 19   19                  mini biography
## 20   20                     birth notes
## 21   21                      birth date
## 22   22                          height
## 23   23                      death date
## 24   24                          spouse
## 25   25                     other works
## 26   26                      birth name
## 27   27                  salary history
## 28   28                      nick names
## 29   29                           books
## 30   30                   agent address
## 31   31             biographical movies
## 32   32                    portrayed in
## 33   33                       where now
## 34   34                      trade mark
## 35   35                      interviews
## 36   36                         article
## 37   37            magazine cover photo
## 38   38                       pictorial
## 39   39                     death notes
## 40   40                  LD disc format
## 41   41                         LD year
## 42   42                LD digital sound
## 43   43        LD official retail price
## 44   44           LD frequency response
## 45   45               LD pressing plant
## 46   46                       LD length
## 47   47                     LD language
## 48   48                       LD review
## 49   49                   LD spaciality
## 50   50                 LD release date
## 51   51           LD production country
## 52   52                     LD contrast
## 53   53              LD color rendition
## 54   54               LD picture format
## 55   55                  LD video noise
## 56   56              LD video artifacts
## 57   57              LD release country
## 58   58                    LD sharpness
## 59   59                LD dynamic range
## 60   60                  LD audio noise
## 61   61            LD color information
## 62   62                  LD group genre
## 63   63              LD quality program
## 64   64 LD close captions-teletext-ld-g
## 65   65                     LD category
## 66   66                  LD analog left
## 67   67                LD certification
## 68   68                LD audio quality
## 69   69                LD video quality
## 70   70                 LD aspect ratio
## 71   71                 LD analog right
## 72   72       LD additional information
## 73   73      LD number of chapter stops
## 74   74     LD dialogue intellegibility
## 75   75                    LD disc size
## 76   76                LD master format
## 77   77                    LD subtitles
## 78   78      LD status of availablility
## 79   79            LD quality of source
## 80   80              LD number of sides
## 81   81               LD video standard
## 82   82                   LD supplement
## 83   83               LD original title
## 84   84               LD sound encoding
## 85   85                       LD number
## 86   86                        LD label
## 87   87               LD catalog number
## 88   88              LD laserdisc title
## 89   89             screenplay-teleplay
## 90   90                           novel
## 91   91                        adaption
## 92   92                            book
## 93   93     production process protocol
## 94   94           printed media reviews
## 95   95                          essays
## 96   96                other literature
## 97   97                            mpaa
## 98   98                            plot
## 99   99              votes distribution
## 100 100                           votes
## 101 101                          rating
## 102 102                production dates
## 103 103                copyright holder
## 104 104                   filming dates
## 105 105                          budget
## 106 106                   weekend gross
## 107 107                           gross
## 108 108                 opening weekend
## 109 109                         rentals
## 110 110                      admissions
## 111 111                         studios
## 112 112                    top 250 rank
## 113 113                  bottom 10 rank

This one uses WHERE, which is the SQL equivalent of filter().

(Note: title is the name of both a table and a variable in that table, annoyingly)

(Note 2: since the whole query is quoted with double quotes, we need single quotes for quoted expressions within the query)

##         id                   title imdb_index kind_id production_year
## 1   476823 The Empire Strikes Back       <NA>       7            1980
## 2   461950 The Empire Strikes Back       <NA>       7            2015
## 3   919851 The Empire Strikes Back       <NA>       7            2015
## 4   924929 The Empire Strikes Back       <NA>       7            2017
## 5  1751199 The Empire Strikes Back       <NA>       7            2003
## 6  1890901 The Empire Strikes Back       <NA>       7            2016
## 7  2530173 The Empire Strikes Back       <NA>       7            2005
## 8  2780701 The Empire Strikes Back       <NA>       7            1986
## 9  2916719 The Empire Strikes Back       <NA>       7            2012
## 10 4346960 The Empire Strikes Back       <NA>       6            1992
##    imdb_id phonetic_code episode_of_id season_nr episode_nr series_years
## 1       NA         E5162        476562        NA         NA         <NA>
## 2       NA         E5162        461323         5         91         <NA>
## 3       NA         E5162        919801         1          7         <NA>
## 4       NA         E5162        924922         1          1         <NA>
## 5       NA         E5162       1751194         1          2         <NA>
## 6       NA         E5162       1890747         7          1         <NA>
## 7       NA         E5162       2530161         1         12         <NA>
## 8       NA         E5162       2780696         1          9         <NA>
## 9       NA         E5162       2916717         1          3         <NA>
## 10      NA         E5162            NA        NA         NA         <NA>
##                              md5sum
## 1  ec3a52f2c4ad8513889060bfc28871ad
## 2  0fa9a1c1f9581128e420e19f51fe2689
## 3  0a40b8ea1fce567f98bc5c8be3d695a6
## 4  98f5d44fdec7f4bd441353ebc014850f
## 5  ba276f18d6d8e4fb7e13c82493e0af8a
## 6  639bde13d7843adfaab1ed154cef949a
## 7  4fc2a75a237ba38bdc71c1add6dae7ff
## 8  d7de5425004a87ef0adf739cc6a5ef59
## 9  430d9e1af0842d60e501e86de44e05f6
## 10 8f45ef75543e1ad1173b5d6bfc0d6c12

To give more than one filter condition, we can use AND and/or OR. This one should restrict entries to movies titled “The Empire Strikes Back”…

##  [1] id              title           imdb_index      kind_id        
##  [5] production_year imdb_id         phonetic_code   episode_of_id  
##  [9] season_nr       episode_nr      series_years    md5sum         
## <0 rows> (or 0-length row.names)

Oops, nothing matches. Any idea why?

We can use “wildcards” to search for titles that contain a phrase

##        id                                          title imdb_index
## 1 4260164 Star Wars: Episode V - The Empire Strikes Back       <NA>
## 2 4346961       The Empire Strikes Back Part II (Parody)       <NA>
##   kind_id production_year imdb_id phonetic_code episode_of_id season_nr
## 1       1            1980      NA         S3621            NA        NA
## 2       1            2016      NA         E5162            NA        NA
##   episode_nr series_years                           md5sum
## 1         NA         <NA> 99c3b6266cd096147b1d7af2b8353edb
## 2         NA         <NA> 759372cfa023e3f7485dfe0db5bdc13d

The following is equivalent to a select(), filter() and arrange()

##                                            title production_year
## 1 Star Wars: Episode V - The Empire Strikes Back            1980
## 2       The Empire Strikes Back Part II (Parody)            2016

This one involves a join, along with a select and filter to get the cast of The Empire Strikes Back (to specify it uniquely we use the id we found for it above)

##                     name role_id
## 1          Anderson, Bob       1
## 2           Austen, Alan       1
## 3           Baker, Kenny       1
## 4        Bear, Lightning       1
## 5             Boa, Bruce       1
## 6      Bonehill, Richard       1
## 7      Bonehill, Richard       1
## 8      Bonehill, Richard       1
## 9      Bonehill, Richard       1
## 10     Buchanan, Stephen       1
## 11       Bulloch, Jeremy       1
## 12           Bunn, Chris       1
## 13          Bush, Morris       1
## 14   Butterfield, Trevor       1
## 15   Butterfield, Trevor       1
## 16          Cannon, John       1
## 17           Capri, Mark       1
## 18       Chancer, Norman       1
## 19         Clarkin, Tony       1
## 20       Colley, Kenneth       1
## 21       Culver, Michael       1
## 22          Curry, Shaun       1
## 23      Daniels, Anthony       1
## 24           Dew, Martin       1
## 25        Diamond, Peter       1
## 26           Dicks, John       1
## 27          Dowdall, Jim       1
## 28         Duff, Norwich       1
## 29          Durrant, Ian       1
## 30          Durrant, Ian       1
## 31         Edmonds, Mike       1
## 32          Egeland, Tom       1
## 33          Fell, Stuart       1
## 34            Flood, Tom       1
## 35           Flyng, Alan       1
## 36        Ford, Harrison       1
## 37    Frandy, Michael A.       1
## 38       Ginter, Patrick       1
## 39        Glover, Julian       1
## 40        Guinness, Alec       1
## 41          Hamill, Mark       1
## 42          Harris, Alan       1
## 43          Harris, Alan       1
## 44          Harte, Jerry       1
## 45          Hassett, Ray       1
## 46         Henry, Walter       1
## 47          Hollis, John       1
## 48        Jerricho, Paul       1
## 49         Johns, Milton       1
## 50         Johnston, Joe       1
## 51     Jones, James Earl       1
## 52           Jones, Mark       1
## 53        Juritzen, Arve       1
## 54           Klein, Paul       1
## 55         Lawson, Denis       1
## 56           Liston, Ian       1
## 57       Maguire, Oliver       1
## 58  Malcolm, Christopher       1
## 59         Mayhew, Peter       1
## 60        McDiarmid, Ian       1
## 61         McDonald, Mac       1
## 62        McKenzie, Jack       1
## 63      McQuarrie, Ralph       1
## 64           Meek, Steve       1
## 65           Meek, Steve       1
## 66           Meek, Steve       1
## 67           Meek, Steve       1
## 68     Morrison, Temuera       1
## 69          Morse, Ralph       1
## 70          Morse, Ralph       1
## 71          Morse, Ralph       1
## 72          Morton, John       1
## 73     Nelson, C. Andrew       1
## 74     Oldfield, Richard       1
## 75             Oz, Frank       1
## 76        Parsons, Chris       1
## 77        Parsons, Chris       1
## 78        Parsons, Chris       1
## 79       Pierre, Quentin       1
## 80         Prowse, David       1
## 81          Purvis, Jack       1
## 82    Ratzenberger, John       1
## 83         Revill, Clive       1
## 84       Richards, Terry       1
## 85        Robinson, Doug       1
## 86            Roy, Peter       1
## 87     Santiago, Michael       1
## 88         Scobey, Robin       1
## 89       Sheard, Michael       1
## 90       Sidoli, Richard       1
## 91           Smart, Tony       1
## 92          Swaden, Alan       1
## 93       Tucker, Burnell       1
## 94             Webb, Des       1
## 95          Weed, Harold       1
## 96   Williams, Billy Dee       1
## 97       Williams, Treat       1
## 98       Wingreen, Jason       1
## 99    Burleigh, Mercedes       2
## 100      Eaton, Marjorie       2
## 101   English, Stephanie       2
## 102       Fisher, Carrie       2
## 103   Hillkurtz, Tiffany       2
## 104        Hudson, Susie       2
## 105       Kahn, Brigitte       2
## 106        Munroe, Cathy       2
## 107        Munroe, Cathy       2
## 108        Turk, Marolyn       2
## 109           Bloom, Jim       3
## 110 Kazanjian, Howard G.       3
## 111          Kurtz, Gary       3
## 112        Lucas, George       3
## 113       McCallum, Rick       3
## 114        Watts, Robert       3
## 115      Brackett, Leigh       4
## 116     Kasdan, Lawrence       4
## 117        Lucas, George       4
## 118    Suschitzky, Peter       5
## 119       Williams, John       6
## 120          Mollo, John       7
## 121      Kershner, Irvin       8
## 122    Christopher, T.M.       9
## 123         Hirsch, Paul       9
## 124        Lucas, George       9
## 125        Lucas, Marcia       9
## 126         Alsup, Bunny      10
## 127         Arnold, Alan      10
## 128  Barclay, David Alan      10
## 129            Bay, Jane      10
## 130         Butcher, Ron      10
## 131     Calcutt, Stephen      10
## 132           Carey, Des      10
## 133            Coke, Liz      10
## 134            Cook, Ron      10
## 135       Dearlove, Jack      10
## 136      Farrar, Kathryn      10
## 137             Fox, Tim      10
## 138      Ginter, Patrick      10
## 139      Ginter, Patrick      10
## 140          Gordon, Ken      10
## 141      Harley, Barbara      10
## 142         Harris, Alan      10
## 143        Harris, James      10
## 144         Herman, Miki      10
## 145     Larkins, Michael      10
## 146           Laws, Nick      10
## 147         Mann, Pamela      10
## 148       Marshall, Mark      10
## 149          Meek, Steve      10
## 150       Midener, Wendy      10
## 151        Miller, Craig      10
## 152         Morton, John      10
## 153      Mullen, Kathryn      10
## 154       Parsons, Chris      10
## 155          Phipps, Ron      10
## 156        Rawlings, Kay      10
## 157            Roy, Deep      10
## 158         Shaw, Debbie      10
## 159  Wessler, Charles B.      10
## 160        Wing, Kristen      10
## 161          Wolff, Marc      10
## 162     Reynolds, Norman      11