Spatial Data

Cholera Outbreaks

A partial list of patients dying of cholera in an 1854 outbreak (Source: MDSR p. 318

A partial list of patients dying of cholera in an 1854 outbreak (Source: MDSR p. 318

The prevailing wisdom at the time was that cholera was transmitted through the air.

Can we use data visualization to find patterns in this outbreak?

## Simple feature collection with 250 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529160.3 ymin: 180857.9 xmax: 529655.9 ymax: 181306.2
## CRS:           +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## First 10 features:
##    Id Count                  geometry
## 1   0     3 POINT (529308.7 181031.4)
## 2   0     2 POINT (529312.2 181025.2)
## 3   0     1 POINT (529314.4 181020.3)
## 4   0     1 POINT (529317.4 181014.3)
## 5   0     4 POINT (529320.7 181007.9)
## 6   0     2   POINT (529336.7 181006)
## 7   0     2 POINT (529290.1 181024.4)
## 8   0     2   POINT (529301 181021.2)
## 9   0     3   POINT (529285 181020.2)
## 10  0     2 POINT (529288.4 181031.8)

That’s… not that comprehensible

We’d like something more like this…

A hand-drawn map of cholera cases by John Snow, overlaying cases on streets and water pumps

A hand-drawn map of cholera cases by John Snow, overlaying cases on streets and water pumps

Spatial Layers

##  [1] "Cholera_Deaths.dbf"          "Cholera_Deaths.prj"         
##  [3] "Cholera_Deaths.sbn"          "Cholera_Deaths.sbx"         
##  [5] "Cholera_Deaths.shp"          "Cholera_Deaths.shx"         
##  [7] "OSMap_Grayscale.tfw"         "OSMap_Grayscale.tif"        
##  [9] "OSMap_Grayscale.tif.aux.xml" "OSMap_Grayscale.tif.ovr"    
## [11] "OSMap.tfw"                   "OSMap.tif"                  
## [13] "Pumps.dbf"                   "Pumps.prj"                  
## [15] "Pumps.sbx"                   "Pumps.shp"                  
## [17] "Pumps.shx"                   "README.txt"                 
## [19] "SnowMap.tfw"                 "SnowMap.tif"                
## [21] "SnowMap.tif.aux.xml"         "SnowMap.tif.ovr"

Note two groups: Cholera_Deaths.* and Pumps.*. These are “layers” of spatial data.

Listing Layers

## [1] "Cholera_Deaths" "Pumps"         
## attr(,"driver")
## [1] "ESRI Shapefile"
## attr(,"nlayers")
## [1] 2

Viewing Layer Info

## Source: "/home/cdawson/shared/teaching/stat209/data/SnowGIS_SHP", layer: "Cholera_Deaths"
## Driver: ESRI Shapefile; number of rows: 250 
## Feature type: wkbPoint with 2 dimensions
## Extent: (529160.3 180857.9) - (529655.9 181306.2)
## CRS: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs  
## LDID: 87 
## Number of fields: 2 
##    name type length typeName
## 1    Id    0      6  Integer
## 2 Count    0      4  Integer

The data contains 250 rows, each with a count (number of cases) and an Id.

We can read the actual data into R as a Spatial Data Frame. Some of the functions that work on regular data frames will work on it; others won’t.

## Reading layer `Cholera_Deaths' from data source 
##   `/home/cdawson/shared/teaching/stat209/data/SnowGIS_SHP/Cholera_Deaths.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 250 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529160.3 ymin: 180857.9 xmax: 529655.9 ymax: 181306.2
## proj4string:   +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## Simple feature collection with 250 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529160.3 ymin: 180857.9 xmax: 529655.9 ymax: 181306.2
## proj4string:   +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs 
## First 10 features:
##    Id Count                  geometry
## 1   0     3 POINT (529308.7 181031.4)
## 2   0     2 POINT (529312.2 181025.2)
## 3   0     1 POINT (529314.4 181020.3)
## 4   0     1 POINT (529317.4 181014.3)
## 5   0     4 POINT (529320.7 181007.9)
## 6   0     2   POINT (529336.7 181006)
## 7   0     2 POINT (529290.1 181024.4)
## 8   0     2   POINT (529301 181021.2)
## 9   0     3   POINT (529285 181020.2)
## 10  0     2 POINT (529288.4 181031.8)

Mapping the Cholera Cases

Here’s a quick and dirty map of the cases as a basic scatterplot.

Adding Context from Google Maps

Unfortunately Google has restricted their Maps API and now requires a Google Cloud account to access it, so you won’t be able to run this code unless you create one here. It’s free for 90 days or up to a certain cumulative data bandwidth, but requires a credit card on file.

An open source alternative is available in the OpenStreetMaps package, but I have not had a chance to investigate this thoroughly yet.

Overlaying the cholera cases

Hmm… looks like we need to create columns corresponding to lon and lat to align the aesthetic mapping of the Cholera data with the background plot. The sf_coordinates() function will extract these.

Let’s try the plot again.

## [[1]]
## mapping: size = ~Count 
## geom_sf: na.rm = FALSE
## stat_sf: na.rm = FALSE
## position_identity 
## 
## [[2]]
## <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     crs: NULL
##     datum: crs
##     default: TRUE
##     default_crs: NULL
##     determine_crs: function
##     distance: function
##     expand: TRUE
##     fixup_graticule_labels: function
##     get_default_crs: function
##     is_free: function
##     is_linear: function
##     label_axes: list
##     label_graticule: 
##     labels: function
##     limits: list
##     lims_method: cross
##     modify_scales: function
##     ndiscr: 100
##     params: list
##     range: function
##     record_bbox: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>

Where’s the cholera data?

Let’s take a closer look at the coordinates used in each dataset.

## Simple feature collection with 5 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529308.7 ymin: 181007.9 xmax: 529320.7 ymax: 181031.4
## proj4string:   +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs 
##   Id Count      lon      lat                  geometry
## 1  0     3 529308.7 181031.4 POINT (529308.7 181031.4)
## 2  0     2 529312.2 181025.2 POINT (529312.2 181025.2)
## 3  0     1 529314.4 181020.3 POINT (529314.4 181020.3)
## 4  0     1 529317.4 181014.3 POINT (529317.4 181014.3)
## 5  0     4 529320.7 181007.9 POINT (529320.7 181007.9)
## # A tibble: 1 x 4
##   ll.lat ll.lon ur.lat ur.lon
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1   51.5 -0.140   51.5 -0.133

(The bb is short for “bounding box”)

It seems the coordinate systems are completely different. It doesn’t seem to be a simple matter of rescaling the units either…

Projections

Geographic locations need to be projected from “globe” locations to a 2D coordinate system.

There are multiple such projections, with different properties.

The Mercator projection preserves angles, but distorts areas:

Gall-Peters projection preserves areas, but distorts angles:

The Lambert projection is angle-preserving, similar to Mercator:

The Albers projection distorts both area and angles a little, trying to minimize the overall distortion:

How are Cholera Deaths Projected?

## Coordinate Reference System:
##   No user input
##   wkt:
## PROJCS["OSGB 1936 / British National Grid",
##     GEOGCS["OSGB 1936",
##         DATUM["OSGB 1936",
##             SPHEROID["Airy 1830",6377563.396,299.3249646]],
##         PRIMEM["Greenwich",0.0],
##         UNIT["degree",0.0174532925199433]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["false_easting",400000.0],
##     PARAMETER["false_northing",-100000.0],
##     PARAMETER["central_meridian",-2.0],
##     PARAMETER["scale_factor",0.9996012717],
##     PARAMETER["latitude_of_origin",49.0],
##     UNIT["m",1.0]]

Coordinate Reference Systems

Transforming the Cholera Data to latitude and longitude

## Simple feature collection with 250 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -0.1384685 ymin: 51.51135 xmax: -0.1313274 ymax: 51.51533
## CRS:           EPSG:4326
## First 10 features:
##    Id Count                    geometry        lon      lat
## 1   0     3 POINT (-0.1363245 51.51291) -0.1363245 51.51291
## 2   0     2 POINT (-0.1362775 51.51285) -0.1362775 51.51285
## 3   0     1 POINT (-0.1362473 51.51281) -0.1362473 51.51281
## 4   0     1 POINT (-0.1362063 51.51275) -0.1362063 51.51275
## 5   0     4 POINT (-0.1361612 51.51269) -0.1361612 51.51269
## 6   0     2 POINT (-0.1359313 51.51267) -0.1359313 51.51267
## 7   0     2 POINT (-0.1365949 51.51285) -0.1365949 51.51285
## 8   0     2 POINT (-0.1364394 51.51282) -0.1364394 51.51282
## 9   0     3 POINT (-0.1366706 51.51281) -0.1366706 51.51281
## 10  0     2 POINT (-0.1366179 51.51292) -0.1366179 51.51292
## # A tibble: 1 x 4
##   ll.lat ll.lon ur.lat ur.lon
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1   51.5 -0.140   51.5 -0.133

That looks a lot better. Let’s try plotting the cases on the street map again.

## Warning: Removed 41 rows containing missing values (geom_sf).

It looks like they’re offset somewhat, not quite following the street contours.

By diving into the documentation of the dataset, we find out that we need an extra transformation, due to the particular coordinate system the Cholera data was recorded in.

## Simple feature collection with 250 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -0.1400738 ymin: 51.51186 xmax: -0.1329335 ymax: 51.51583
## CRS:           EPSG:4326
## First 10 features:
##    Id Count                    geometry        lon      lat
## 1   0     3 POINT (-0.1379301 51.51342) -0.1379301 51.51342
## 2   0     2  POINT (-0.137883 51.51336) -0.1378830 51.51336
## 3   0     1 POINT (-0.1378529 51.51332) -0.1378529 51.51332
## 4   0     1 POINT (-0.1378119 51.51326) -0.1378119 51.51326
## 5   0     4  POINT (-0.1377668 51.5132) -0.1377668 51.51320
## 6   0     2 POINT (-0.1375369 51.51318) -0.1375369 51.51318
## 7   0     2 POINT (-0.1382004 51.51336) -0.1382004 51.51336
## 8   0     2  POINT (-0.138045 51.51333) -0.1380450 51.51333
## 9   0     3 POINT (-0.1382761 51.51332) -0.1382761 51.51332
## 10  0     2 POINT (-0.1382234 51.51343) -0.1382234 51.51343
## Warning: Removed 3 rows containing missing values (geom_sf).

That looks more reasonable. Let’s try adding the water pumps from the 1850s. We’ll need to read in the data, and apply the same two projections to it to get it on the same coordinate system.

Now let’s try visualizing the cases with the locations of water pumps in red:

Lo and behold, the cases of cholera radiate from a particular water pump in the center! Maybe it’s transmitted through the water supply!