Merging Data from Two Tables

Goal

To become comfortable with the so-called mutating join operations: verbs that create one table out of two, where the result table may have more variables (columns) than either component table. (This is in contrast to filtering joins, which use information in a second table to extract a subset of cases from the first)

The Data

We’ll work (surprise, surprise) with the babynames data, but this time in addition to the Social Security database, we’ll also use data from the U.S. Census.

Code:

library(babynames)
data(babynames)    ## SSA data
data(births)       ## Census data

Let’s peek at the births table, since we haven’t used it before:

births %>% slice_head(n = 10)

## # A tibble: 10 x 2
##     year  births
##    <int>   <int>
##  1  1909 2718000
##  2  1910 2777000
##  3  1911 2809000
##  4  1912 2840000
##  5  1913 2869000
##  6  1914 2966000
##  7  1915 2965000
##  8  1916 2964000
##  9  1917 2944000
## 10  1918 2948000

This is just the total number of births in the U.S. for each year, going back to 1909.

Joining the Datasets

Our first goal will be to create a single dataset, indexed by year, that contains one column for the total number of births that year as recorded by the census, and another column that contains the total number of births that year as recorded by the Social Security Administration.

First, let’s create a summarized version of the SSA data which is indexed by year, and records the total number of births that year. To make the differences between the join operations clearer, let’s also omit the data after 2012.

Code:

ssa_births <- babynames %>%
  rename(num_births = n) %>%   # just to disambiguate the ns below
  group_by(year) %>%
  summarize(
    distinct_name_sex_combos = n(), # this is just the number of entries that year
    births                   = sum(num_births)) %>%
  filter(year <= 2012)         # Removing data after 2012
ssa_births %>% slice_head(n = 10)

## # A tibble: 10 x 3
##     year distinct_name_sex_combos births
##    <dbl>                    <int>  <int>
##  1  1880                     2000 201484
##  2  1881                     1935 192696
##  3  1882                     2127 221533
##  4  1883                     2084 216946
##  5  1884                     2297 243462
##  6  1885                     2294 240854
##  7  1886                     2392 255317
##  8  1887                     2373 247394
##  9  1888                     2651 299473
## 10  1889                     2590 288946

Just for clarity (and parallel naming), let’s make a copy of the births dataset called census_births.

Code:

census_births <- births

The four “mutating” `join` operations

Note that both ssa_births and census_births have a year column, with one entry per year. However, they cover different sets of years.

## Starts in 1880 and ends (because we modified it) in 2012
ssa_births %>% slice_head(n = 5)

## # A tibble: 5 x 3
##    year distinct_name_sex_combos births
##   <dbl>                    <int>  <int>
## 1  1880                     2000 201484
## 2  1881                     1935 192696
## 3  1882                     2127 221533
## 4  1883                     2084 216946
## 5  1884                     2297 243462

ssa_births %>% slice_tail(n = 5)

## # A tibble: 5 x 3
##    year distinct_name_sex_combos  births
##   <dbl>                    <int>   <int>
## 1  2008                    35070 3926358
## 2  2009                    34702 3815638
## 3  2010                    34067 3690700
## 4  2011                    33903 3651914
## 5  2012                    33732 3650462

## Starts in 1909, ends in 2017
census_births %>% slice_head(n = 5)

## # A tibble: 5 x 2
##    year  births
##   <int>   <int>
## 1  1909 2718000
## 2  1910 2777000
## 3  1911 2809000
## 4  1912 2840000
## 5  1913 2869000

census_births %>% slice_tail(n = 5)

## # A tibble: 5 x 2
##    year  births
##   <int>   <int>
## 1  2013 3932181
## 2  2014 3988076
## 3  2015 3978497
## 4  2016 3945875
## 5  2017 3855500

If we want to combine them into a single table, one decision we have to make is what to do with the years that are in one table but not the other?

The way we answer this question determines the type of join we will use.

Inner join (`inner_join()`)

The inner_join() operation matches entries based on a “key” variable, and retains a case only if the key exists in both the left and the right table.

Compare the years in the ssa_births data:

ssa_births %>% pull(year) %>% unique()

##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [127] 2006 2007 2008 2009 2010 2011 2012

to the years in the census_births data:

census_births %>% pull(year) %>% unique()

##   [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
##  [15] 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936
##  [29] 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
##  [43] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
##  [57] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
##  [71] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
##  [85] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
##  [99] 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

to the years in the data resulting from an inner_join():

joined_via_inner <- ssa_births %>% 
  inner_join(census_births, by = "year")  # quotes are required in the variable name

joined_via_inner %>% slice_head(n = 5)

## # A tibble: 5 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  1909                     4227   511228  2718000
## 2  1910                     4629   590715  2777000
## 3  1911                     4867   644279  2809000
## 4  1912                     6351   988064  2840000
## 5  1913                     6968  1137111  2869000

joined_via_inner %>% pull(year) %>% unique()

##   [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
##  [15] 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936
##  [29] 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
##  [43] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
##  [57] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
##  [71] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
##  [85] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
##  [99] 2007 2008 2009 2010 2011 2012

Only the years that are recorded in both tables are included. Note that since both tables had a variable called births, R created new variable names to disambiguate. The one with the suffix .x comes from the left-hand table (ssa_births in this case), and the one with the suffix .y comes from the right-hand table (census_births).

Left join (`left_join()`)

In contrast, if we use left_join() then we will keep every entry from the left table, and record data from the variables added by the right table as NA for “missing”.

joined_via_left <- ssa_births %>%
  left_join(census_births, by = "year")

joined_via_left %>% filter(year < 1919)

## # A tibble: 39 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201484       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221533       NA
##  4  1883                     2084   216946       NA
##  5  1884                     2297   243462       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255317       NA
##  8  1887                     2373   247394       NA
##  9  1888                     2651   299473       NA
## 10  1889                     2590   288946       NA
## # … with 29 more rows

joined_via_left %>% filter(year > 2008)

## # A tibble: 4 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  2009                    34702  3815638  4130665
## 2  2010                    34067  3690700  3999386
## 3  2011                    33903  3651914  3953590
## 4  2012                    33732  3650462  3952841

joined_via_left %>% pull(year) %>% unique()

##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [127] 2006 2007 2008 2009 2010 2011 2012

Notice that missing values are recorded for years that are included in ssa_births but not in census_births; but for years included in census_births that are not in ssa_births (that is, years after 2012), the entries from census_births are omitted entirely.

We can examine this explicitly by filtering to see only those rows for which one or the other variable has missing data, using the is.na() function:

Code

## Returns nothing; x is never missing in a left join, unless
## it was missing in the original left-hand table
joined_via_left %>% 
  filter(is.na(births.x))

## # A tibble: 0 x 4
## # … with 4 variables: year <dbl>, distinct_name_sex_combos <int>,
## #   births.x <int>, births.y <int>

## Returns years in the left but not the right table
joined_via_left %>%
  filter(is.na(births.y))

## # A tibble: 29 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201484       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221533       NA
##  4  1883                     2084   216946       NA
##  5  1884                     2297   243462       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255317       NA
##  8  1887                     2373   247394       NA
##  9  1888                     2651   299473       NA
## 10  1889                     2590   288946       NA
## # … with 19 more rows

Right join (`right_join()`)

Right join (right_join()) is just the opposite of left join, with the roles switched.

In fact, if our datasets are called df1 and df2, the command

df2 %>% right_join(df1)

does the same thing as

df1 %>% left_join(df2)

Full join (`full_join()`)

If we want to keep entries for years that appear in either dataset, we can perform a full join. In this case, whichever years are missing in one or the other dataset will have an NA for that variable.

joined_via_full <- ssa_births %>%
  full_join(census_births, by = "year")

joined_via_full %>% pull(year)

##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [127] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

joined_via_full %>%
  filter(is.na(births.x))

## # A tibble: 5 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  2013                       NA       NA  3932181
## 2  2014                       NA       NA  3988076
## 3  2015                       NA       NA  3978497
## 4  2016                       NA       NA  3945875
## 5  2017                       NA       NA  3855500

joined_via_full %>%
  filter(is.na(births.y))

## # A tibble: 29 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201484       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221533       NA
##  4  1883                     2084   216946       NA
##  5  1884                     2297   243462       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255317       NA
##  8  1887                     2373   247394       NA
##  9  1888                     2651   299473       NA
## 10  1889                     2590   288946       NA
## # … with 19 more rows

Notice that the full set of years is the union of the first two sets.

Comparison Between the Two Sources

Having joined the data from the two sources, let’s see to what extent two sources agree. Create a scatterplot of the births with the data from ssa_births on the x axis and the data from census_births on the y axis. Rename the variables first so it’s clear which one is which.

SOLUTION

births_data <- joined_via_inner %>%
  rename(
    births_ssa    = births.x,
    births_census = births.y
  )

births_data %>%  
  ggplot(aes(x = births_ssa / 1000000, y = births_census / 1000000)) +
  geom_point() +
  scale_x_continuous(
    name   = "Millions of Births as Recorded by the SSA") +
  scale_y_continuous(
    name   = "Millions of Births as Recorded by the Census")

How closely do the two sources agree? Describe the nature of the discrepancies you see.

RESPONSE

Examine the documentation for the two original datasets to see whether you can account for the discrepancies.

RESPONSE

Joining flight data

The following use the nycflights13 package.

Compute the average arrival delay time (arr_delay) for each carrier (this requires a group_by() and summarize()), and include the full name of the carrier in your result set. Some flights may have missing data (NA) in the arr_delay column. You will want to set na.rm = TRUE in the mean() function to drop these cases before calculating the mean (otherwise the mean itself will be returned as missing). Sort the results in ascending order of arrival delay.

SOLUTION

library(nycflights13)
flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(avg_delay) %>%
  left_join(airlines, by = "carrier")

What was the full name of the airport that was the most common destination from NYC in 2013? How many flights went there that year? First group by destination and count the number of flights to each destination. Then return the destination that had the highest number of flights, as well as the actual number of flights to that destination. Finally, join the resulting one-line table with the airports table to add on the actual name of the airport. Make sure you are using a type of join that keeps the final result at just one entry. Restrict the columns in the final output to just the airport code, airport name, and number of flights. NOTE: The dest column in the flights data contains the airport code of the destination. In the airports data, the airport code is stored in a different variable, called faa. In order to join these we need to set by = c("dest" = "faa") in our join operation, where name on the left of the = is the column name in the left-hand dataset and the name on the right is the corresponding column name in the right-hand dataset. (The answer should be “Chicago Ohare Intl”, with code ORD, which had 17283 flights from NYC)

SOLUTION

flights %>%
  group_by(dest) %>%
  summarize(num_flights = n()) %>%
  slice_max(
    order_by = num_flights,
    n        = 1) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  select(dest, name, num_flights)

What model of plane had the most total flights in 2013? The frequency data comes from the flights table, and the model of each plane comes from the planes table. First, join the flights table with the planes table in a way such that only planes whose tail number corresponds to a known model are included. Then, group the resulting table by model. Then count the number of flights for each model using summarize(). Finally, find the model with the highest number of flights, as well as the number of flights itself. (The answer should be model A320-232, with 45831 flights)

SOLUTION

Were there any flights that went to “mystery” airports (i.e., airports whose full names don’t appear in the airports table)? What were they and how many flights went to each of those airports? Perform a join to get the airport name for each destination in the flights table, producing NA for any flights whose destination code does not appear in the airports table. Filter the results to retain only those flights heading to these “mystery” airports. Then group the data by destination and count the number of flights going to each mystery airport. Sort the final results to show the most popular mystery airports first (i.e., in descending order of number of flights). (There should be four results: SJU, BQN, STT, and PSE, in that order)

SOLUTION

Were there any “mystery” planes (i.e., planes that don’t appear in the planes table)? Produce output in the same form as for the previous exercise, with the tail numbers of the mystery planes, and the number of flights flown by those planes, sorted in descending order based on which planes took the most flights. There should be 722 rows in the result. No hints this time!

STAT 209: Lab 8

Colin Reimer Dawson

July 1, 2021

Merging Data from Two Tables

Goal

The Data

Joining the Datasets

The four “mutating” `join` operations

Inner join (`inner_join()`)

Left join (`left_join()`)

Right join (`right_join()`)

Full join (`full_join()`)

Comparison Between the Two Sources

SOLUTION

RESPONSE

RESPONSE

Joining flight data

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION

STAT 209: Lab 8

Colin Reimer Dawson

July 1, 2021

Merging Data from Two Tables

Goal

The Data

Joining the Datasets

The four “mutating” join operations

Inner join (inner_join())

Left join (left_join())

Right join (right_join())

Full join (full_join())

Comparison Between the Two Sources

SOLUTION

RESPONSE

RESPONSE

Joining flight data

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION

The four “mutating” `join` operations

Inner join (`inner_join()`)

Left join (`left_join()`)

Right join (`right_join()`)

Full join (`full_join()`)