STAT 209: Lab 2

Loading Data

Since we want to be working with data, let’s load a dataset into our environment.

There are three main ways to load a dataset for use in R:

Accessing data that is bundled with an R package
Reading in data that is posted on the web
Reading in data from a file on your computer/server account

Accessing data from a package

Some R packages come with example datasets. There is one such package, Lock5Data developed as a companion to our textbook for this course. I will also use examples from the Stat2Data package.

Create a new code chunk, in which you load the Lock5Data package using the library() command. Run the line. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail!

Once we have loaded the package that contains the dataset we want, we then (usually) have to load the dataset. We can do this using the data() command.

For example, the "Pollster08" dataset provided by the Stat2Data package contains data about some political polls taken during the 2008 U.S. Presidential campaign.

We can load the package and then the dataset as follows:

library(Stat2Data) #loads the package

data(Pollster08)   #loads the dataset

Notice that Pollster08 now appears in our Environment.

In the future, any time you want to use a dataset from an R package, you will first need to load the package, then the dataset, as above.

Accessing Documentation

For packaged datasets like this, as well as for R functions, we can get some information about the data using R’s help interface. At the console, type

help(Pollster08)

This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.

This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset.

Reading data directly from the web

Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.

We can load a data set from a file on the web if we know the URL.

The read.file() command makes this simple for well formatted data files.

This command instructs R to fetch the data from the given URL, and put it in an object in our environment called Depression, which will now appear in the Environment pane.

Depression <- read.file("http://colindawson.net/data/depression.csv")

Note that the read.file() produces a result; namely a data frame object, which we store in a variable which we give a name that tells us something about what that data frame contains data about.

This is a bit different from the way the data() command worked for loading data from an R package. There, we didn’t have to create a variable in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.

Reading data from a file on your computer

You can also point to a file on your computer by supplying the file’s location on the computer as a path, in place of the URL.

Absolute paths

In your “Files” tab you should see a folder called stat209. Double click it to see its contents in the Files tab. It should have several subdirectories, one of which is called data. If you double click on data you should see the file depression.csv inside.

If instead of a URL you put the path to this file, which should be "~/stat209/data/depression.csv" inside the parentheses in read.file(), it would read in this file. The ~ symbol represents your “Home” folder.

A location that starts with a forward slash / or a tilde ~ is an absolute path. That is, it doesn’t matter what directory (folder) you’re currently working in; it will look in that specific location.

Relative paths

Sometimes it is convenient to specify a file using a relative path instead. Particularly when you want your code to run on another computer, where your files will not be in the exact same location they’re in on the computer you were working on.

Instead, we can tell R to look in a location which is relative to the location you’re currently working in (which might be the location where your script is).

For example, if I am currently working in my home directory (abbreviated in paths as ~), I could replace the absolute path above with "stat209/data/depression.csv". This tells R to look for a folder called stat209 inside the current working directory; a folder called data inside that one; and a file called depression.csv inside that.

If my file were directly inside my home folder, I could just type "depression.csv" inside read.file().

If you have been following along, the code below will probably give you an error:

## Probably generates an error
Depression <- read.file("stat209/data/depression.csv")

That’s because R is not looking inside your home directory, but inside your current working directory, which should be where your project is located.

Your working directory

R interprets relative paths relative to a starting point called your working directory. It is not, unfortunately, necessarily the same as the directory you are currently viewing in the Files tab.

You can see what directory is currently set as your working directory (relative to which all relative paths are interpreted) by typing getwd() in the console.

You can change your working directory (for example, to ~/stat209) by typing setwd("~/stat209") (with the quotes).

Make a code chunk in which you change your working directory to stat209, and then write a command using a relative path to read in the depression.csv dataset using the read.file() command. Don’t forget to store the resulting data frame in a named variable.

Interacting with data sets

The Pollster08 data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.

You should see that in the Environment tab that Pollster08 consists of 102 observations (cases), each with 11 variables recorded.

We could have R print out the entire data table by simply typing the name of the data object:

Pollster08

However printing the whole dataset is not that useful, especially if the data contains a lot of cases.

One advantage of RStudio is that it comes with a built-in data viewer.

Click on the name Pollster08 in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (PollTaker, PollDates, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.

You can close the data viewer by clicking on the “X” in the upper lefthand corner.

R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.

Most of what we will do in R will consist, in one way or another, of taking actions on data frames.

Run the command below, and then examine its structure. Identify all of the function names, argument names, and argument values. (Note that sometimes argument values, particularly for the first argument of a function, are supplied without an argument name: R can assign values to names according to the order they’re given in. However, the reverse is never true: if you give an argument name you need to supply a value.) What does this command appear to do?

pull(Pollster08, var = PollTaker)

Getting Help

The same help interface works for functions just as for data frames. We can get more information about the pull() or head() functions by typing

help(pull)
help(head)

at the console. We could also have done

?pull
?head

to achieve the same result.

Look up the nrow and ncol functions using the Help interface. Add a sentence to this document briefly describing what they do.

On Your Own (Homework)

Load the HoneybeeCircuits dataset from the Lock5Data package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (hint: use what you found in Exercise 9).
Make a code chunk below and add code to read in the women_in_labor_force.csv dataset from the data directory on your server account using the read.file() with an absolute path, storing the result as WomenInLabor. Make a second chunk to read the same dataset with a relative path from your project directory, storing the result as WomenInLabor2. What are the cases and variables in this dataset?
Knit your modified .Rmd file. Save a copy of your edited version of this lab (both the .Rmd and Knitted .html files) in the hw2 turnin folder.

Environment and Session Information

File creation date: 2021-06-03
R version 3.6.0 (2019-04-26)
R version (short form): 3.6.0
mosaic package version: 1.5.0
tidyverse package version: 1.3.1
Additional session information

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Stat2Data_2.0.0   mosaic_1.5.0      Matrix_1.2-17    
##  [4] mosaicData_0.17.0 ggformula_0.9.1   ggstance_0.3.1   
##  [7] lattice_0.20-38   forcats_0.5.1     stringr_1.4.0    
## [10] dplyr_1.0.5       purrr_0.3.4       readr_1.4.0      
## [13] tidyr_1.1.3       tibble_3.1.1      ggplot2_3.3.3    
## [16] tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] ggdendro_0.1-20  httr_1.4.2       jsonlite_1.7.2   splines_3.6.0   
##  [5] modelr_0.1.8     shiny_1.3.2      assertthat_0.2.1 cellranger_1.1.0
##  [9] yaml_2.2.0       ggrepel_0.8.1    pillar_1.6.0     backports_1.1.4 
## [13] glue_1.4.2       digest_0.6.21    promises_1.0.1   rvest_1.0.0     
## [17] colorspace_1.4-1 htmltools_0.3.6  httpuv_1.5.1     pkgconfig_2.0.3 
## [21] broom_0.7.6      haven_2.4.1      xtable_1.8-4     scales_1.0.0    
## [25] later_0.8.0      generics_0.0.2   ellipsis_0.3.0   withr_2.4.2     
## [29] lazyeval_0.2.2   cli_2.5.0        magrittr_2.0.1   crayon_1.4.1    
## [33] readxl_1.3.1     mime_0.7         evaluate_0.14    fs_1.3.1        
## [37] fansi_0.4.0      MASS_7.3-51.4    xml2_1.3.2       tools_3.6.0     
## [41] hms_1.0.0        lifecycle_1.0.0  munsell_0.5.0    reprex_2.0.0    
## [45] compiler_3.6.0   rlang_0.4.11     grid_3.6.0       rstudioapi_0.13 
## [49] htmlwidgets_1.3  crosstalk_1.0.0  mosaicCore_0.6.0 rmarkdown_2.5   
## [53] gtable_0.3.0     DBI_1.0.0        R6_2.4.0         gridExtra_2.3   
## [57] lubridate_1.7.10 knitr_1.25       utf8_1.1.4       stringi_1.4.3   
## [61] Rcpp_1.0.2       vctrs_0.3.8      leaflet_2.0.2    dbplyr_2.1.1    
## [65] tidyselect_1.1.1 xfun_0.19