Merging Data from Two Tables

Goal

To become comfortable with the so-called mutating join operations: verbs that create one table out of two, where the result table may have more variables (columns) than either component table. (This is in contrast to filtering joins, which use information in a second table to extract a subset of cases from the first)

The Data

We’ll work (surprise, surprise) with data on babynames, but this time in addition to the Social Security database, we’ll also use data from the U.S. Census.

Preliminaries

Load libraries and data

Code:

library(tidyverse)
library(babynames)
data(babynames)    ## SSA data
data(births)       ## Census data

Let’s peek at the births table, since we haven’t used it before:

head(births)
## # A tibble: 6 x 2
##    year  births
##   <int>   <int>
## 1  1909 2718000
## 2  1910 2777000
## 3  1911 2809000
## 4  1912 2840000
## 5  1913 2869000
## 6  1914 2966000

This is just the total number of births in the U.S. for each year, going back to 1909.

Our goal will be to create a single dataset indexed by year that contains one column for the total number of births that year as recorded by the census, and another column that contains the total number of births that year as recorded by the Social Security Administration.

First, let’s create a summarized version of the SSA data which is indexed by year, and records the total number of births that year. To make the differences between the join operations clearer, let’s also chop off the data after 2012.

Code:

ssa_births <- babynames %>%
  rename(num_births = n) %>%   # just to disambiguate the ns below
  group_by(year) %>%
  summarize(
    distinct_name_sex_combos = n(), # this is just the number of entries that year
    births = sum(num_births)) %>%
  filter(year <= 2012)
head(ssa_births)
## # A tibble: 6 x 3
##    year distinct_name_sex_combos births
##   <dbl>                    <int>  <int>
## 1  1880                     2000 201482
## 2  1881                     1935 192696
## 3  1882                     2127 221534
## 4  1883                     2084 216945
## 5  1884                     2297 243463
## 6  1885                     2294 240854

Just for clarity (and parallel naming), let’s make a copy of the births dataset called census_births.

Code:

census_births <- births

Joining the tables

Note that both ssa_births and census_births have a year column, with one entry per year. However, they cover different sets of years.

head(ssa_births)
## # A tibble: 6 x 3
##    year distinct_name_sex_combos births
##   <dbl>                    <int>  <int>
## 1  1880                     2000 201482
## 2  1881                     1935 192696
## 3  1882                     2127 221534
## 4  1883                     2084 216945
## 5  1884                     2297 243463
## 6  1885                     2294 240854
head(census_births)
## # A tibble: 6 x 2
##    year  births
##   <int>   <int>
## 1  1909 2718000
## 2  1910 2777000
## 3  1911 2809000
## 4  1912 2840000
## 5  1913 2869000
## 6  1914 2966000

If we want to combine them into a single table, one decision we have to make is what to do with the years that are in one table but not the other? The way we answer this question determines the type of join we will use.

Inner join

The inner_join() operation matches entries based on a “key” variable, and retains a case only if the key exists in both the left and the right table.

Compare the years in the ssa_births data:

ssa_births$year
##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [127] 2006 2007 2008 2009 2010 2011 2012

to the years in the census_births data:

census_births$year
##   [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
##  [15] 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936
##  [29] 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
##  [43] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
##  [57] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
##  [71] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
##  [85] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
##  [99] 2007 2008 2009 2010 2011 2012 2013 2014 2002 2003 2004 2005 2006 2007
## [113] 2008 2009 2010 2011 2012 2013 2014

to the years in the data resulting from an inner_join():

joined_via_inner <- 
  ssa_births %>% 
  inner_join(census_births, by = "year")  # quotes are required in the variable name
head(joined_via_inner)
## # A tibble: 6 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  1909                     4227   511231  2718000
## 2  1910                     4629   590719  2777000
## 3  1911                     4867   644271  2809000
## 4  1912                     6351   988059  2840000
## 5  1913                     6967  1137094  2869000
## 6  1914                     7964  1416325  2966000
joined_via_inner$year
##   [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
##  [15] 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936
##  [29] 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950
##  [43] 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
##  [57] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
##  [71] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
##  [85] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2002 2003 2003 2004
##  [99] 2004 2005 2005 2006 2006 2007 2007 2008 2008 2009 2009 2010 2010 2011
## [113] 2011 2012 2012

Only the years in both tables are included. Note that since both tables had a variable called births, but since this was not the key used in the join operation, R created new variable names to disambiguate. The one with the suffix .x comes from the left-hand table (ssa_births in this case), and the one with the suffix .y comes from the right-hand table (census_births).

Left join

In contrast, if we use left_join() then we will keep every entry from the left table, and record data from the variables added by the right table as NA for “missing”.

joined_via_left <-
  ssa_births %>%
  left_join(census_births, by = "year")
joined_via_left %>% filter(year < 1919)
## # A tibble: 39 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201482       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221534       NA
##  4  1883                     2084   216945       NA
##  5  1884                     2297   243463       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255319       NA
##  8  1887                     2373   247396       NA
##  9  1888                     2651   299474       NA
## 10  1889                     2590   288948       NA
## # ... with 29 more rows
joined_via_left %>% filter(year > 2008)
## # A tibble: 8 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  2009                    34689  3813356  4130665
## 2  2009                    34689  3813356  4130665
## 3  2010                    34050  3688069  3999386
## 4  2010                    34050  3688069  3999386
## 5  2011                    33880  3648686  3953590
## 6  2011                    33880  3648686  3953590
## 7  2012                    33697  3646143  3952841
## 8  2012                    33697  3646143  3952841
unique(joined_via_left$year)
##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [127] 2006 2007 2008 2009 2010 2011 2012

Notice that missing values are recorded for years in ssa_births but not census_births, but for years in census_births but not ssa_births (that is, years after 2012), the data from census_births is not included at all.

We can examine this explicitly by filtering to see only those rows for which one or the other variable has missing data, using the is.na() function:

Code

## Returns nothing; x is never missing in a left join, unless
## it was missing in the original left-hand table
joined_via_left %>% 
  filter(is.na(births.x))  
## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, distinct_name_sex_combos <int>,
## #   births.x <int>, births.y <int>
## Returns years in the left but not the right table
joined_via_left %>%
  filter(is.na(births.y))
## # A tibble: 29 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201482       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221534       NA
##  4  1883                     2084   216945       NA
##  5  1884                     2297   243463       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255319       NA
##  8  1887                     2373   247396       NA
##  9  1888                     2651   299474       NA
## 10  1889                     2590   288948       NA
## # ... with 19 more rows

It seems that there are duplicate entries for the years from 2002 on. This is due to duplicate entries in the original births dataset. I’m not sure why this happened since the entries are identical, but as it happens it illustrates an aspect of joins: if a key is found multiple times in one dataset, then the data from the other dataset is duplicated.

Right join

Right join (right_join()) is just the opposite of left join, with the roles switched. In fact, right_join(df1, df2) is the same as left_join(df2, df1).

Full join

If we want to keep entries for years that appear in either dataset, we can perform a full join. In this case, whichever years are missing in one or the other dataset will have an NA for that variable.

joined_via_full <- 
  ssa_births %>%
  full_join(census_births, by = "year")
joined_via_full$year
##   [1] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893
##  [15] 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
##  [29] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
##  [43] 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935
##  [57] 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
##  [71] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
##  [85] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
##  [99] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [113] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2002 2003 2003
## [127] 2004 2004 2005 2005 2006 2006 2007 2007 2008 2008 2009 2009 2010 2010
## [141] 2011 2011 2012 2012 2013 2014 2013 2014
joined_via_full %>%
  filter(is.na(births.x))
## # A tibble: 4 x 4
##    year distinct_name_sex_combos births.x births.y
##   <dbl>                    <int>    <int>    <int>
## 1  2013                       NA       NA  3932181
## 2  2014                       NA       NA  3988076
## 3  2013                       NA       NA  3932181
## 4  2014                       NA       NA  3988076
joined_via_full %>%
  filter(is.na(births.y))
## # A tibble: 29 x 4
##     year distinct_name_sex_combos births.x births.y
##    <dbl>                    <int>    <int>    <int>
##  1  1880                     2000   201482       NA
##  2  1881                     1935   192696       NA
##  3  1882                     2127   221534       NA
##  4  1883                     2084   216945       NA
##  5  1884                     2297   243463       NA
##  6  1885                     2294   240854       NA
##  7  1886                     2392   255319       NA
##  8  1887                     2373   247396       NA
##  9  1888                     2651   299474       NA
## 10  1889                     2590   288948       NA
## # ... with 19 more rows

Notice that the full set of years is the union of the first two sets.

Comparison

Having joined the data from the two sources, let’s see to what extent two sources agree by creating a scatterplot of the births with the data from ssa_births on the x axis and the data from census_births on the y.

Code

joined_via_full %>%
  ggplot(aes(x = births.x, y = births.y)) +
  geom_point()
## Warning: Removed 33 rows containing missing values (geom_point).

  1. Examine the documentation for the two original datasets to see whether you can account for the discrepancies.

Exercises

The following use the nycflights13 package.

  1. Compute the average arrival delay time for each carrier, and include the full name of the carrier in your result set.

Sample solution:

library(nycflights13)
flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airlines, by = c("carrier" = "carrier"))
  1. What was the full name of the airport that was the most common destination from NYC in 2013?

Sample solution:

flights %>%
  group_by(dest) %>%
  summarize(N = n()) %>%
  arrange(desc(N)) %>%
  left_join(airports, by = c("dest" = "faa"))
  1. What is the most commonly used plane? Find it’s manufacturer and model.

Sample solution:

flights %>%
  left_join(planes, by = c("tailnum" = "tailnum")) %>%
  group_by(manufacturer, model) %>%
  summarize(N = n()) %>%
  arrange(desc(N))
  1. Were there any flights that went to “mystery” airports (i.e., airports that don’t appear in the airports table)?

Sample solution:

flights %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  filter(is.na(name)) %>%
  group_by(dest) %>%
  summarize(N = n())
  1. Were there any “mystery” planes (i.e., planes that don’t appear in the planes table)?

Sample solution:

flights %>%
  left_join(planes, by = c("tailnum" = "tailnum")) %>%
  filter(is.na(model)) %>%
  group_by(tailnum) %>%
  summarize(N = n()) %>%
  arrange(desc(N))

Getting credit

DM me on Slack answering the following prompt: You have now learned the basics of data wrangling. What remains unclear?