Loading Data

Since we want to be working with data, let’s load a dataset into our environment.

There are three main ways to load a dataset for use in R:

  • Accessing data that is bundled with an R package
  • Reading in data that is posted on the web
  • Reading in data from a file that you have locally

Accessing data from a package

Some R packages come with example datasets. Two such packages are Lock5Data and Stat2Data, which consist of datasets used in two statistics textbooks.

  1. Create a new code chunk, in which you load the Stat2Data package with the library() command. Type the name of the package inside the parentheses. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail! Run the chunk.

YOUR CODE

Once we have loaded the package that contains the dataset we want, we then (usually) want to load the dataset. We can do this using the data() command.

For example, the "Pollster08" dataset provided by the Stat2Data package contains data about some political polls taken during the 2008 U.S. Presidential campaign.

We can load the package and then the dataset as follows:

library(Stat2Data) #loads the package (you already did this)
data(Pollster08)   #loads the specific dataset

Notice that Pollster08 now appears in our Environment.

In the future, any time you want to use a dataset which is included in an R package, you will first need to load the package with library(), then the dataset with data(), as above.

Accessing Documentation

For packaged datasets like this, as with functions, we can get some information about the data using R’s help interface. This is better done at the console, because we don’t really want the help page to pop up when we Knit the document (I’ve set the following chunks not to run when Knitting). Type the following at the console:

help(Pollster08)

or, equivalently:

?Pollster08

This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.

This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset. It should include a brief description of how the data was collected and what each variable (column) means.

Reading data directly from a file

Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.

If the data is on the web somewhere

We can load a data set from a file on the web if we know the URL.

The read_csv() command makes this simple for datasets that are represented as .csv (comma-separated values) text files (such as those exported from Excel or Google Sheets).

The command below instructs R to fetch the data from the given URL, and put it in an object in our environment called Depression, which will now appear in the Environment pane.

Make sure you click “Run Previous Chunks” before running it, because the read_csv() function is provided by the tidyverse package, which needs to have been loaded before you can use the function. This happens in the setup chunk above with the library(tidyverse) command.

Depression <- read_csv("http://colindawson.net/data/depression.csv")

Note that the read_csv() produces a result; namely a data frame object, which we store in an object which we give a name that tells us something about what that data is.

This is a bit different from the way the data() command worked for loading data from an R package. There, we didn’t have to create an object in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.

Before you try to read in a dataset, make sure you know whether it ‘lives’ in an R package, on the web, or in a file, and use the appropriate method to read it in

Reading data from a file on the computer

You can also point to a file on the computer you are running RStudio on (in this case the server) by supplying the file’s location on the computer as a path, in place of the URL.

Note: RStudio as run on the server cannot read files directly from your personal machine. If you want to read in data from a file that you have created (for example, by making the data frame in Excel and exporting it to a .csv file) you will need to Upload the file first, as we have done with homework assignments and quizzes

Relative Paths for Portability

In your “Files” tab you should see a folder called stat113. Double click it to see its contents in the Files tab. It should have several sub-directories, one of which is called data. If you click on data you should see the file depression.csv inside.

We could point RStudio directly to the dataset in this folder with a command like

## This is and example that works, but not what we want to do
Depression <- read_csv("~/stat113/data/depression.csv")

The ~ symbol is a shorthand that the computer will interpret as referring to your “Home” folder, and is typed literally. The forward slashes are used to indicate that the preceding thing is the name of a folder. So we can read this path as saying "go to the Home folder, then find the stat113 folder there, the data folder within that, and finally look for a file called depression.csv in that folder.

However, if we are sharing our work with someone else, they might not have the same arrangement of files and folders on their machine that we do. For portability, if the data cannot be hosted on the web, it is usually better to include a copy of the data somewhere within our project folder, and then point to its location relative to that folder.

In an RMarkdown document, if a path doesn’t start with a / or ~, then the computer will look for files and folders within the folder where the .Rmd file is located. The simplest case is when the file you need is in that same folder. In that case you can just give the filename (in quotes).

  1. Copy the depression.csv file to the project folder where your lab3.Rmd file is located. Then try reading it in using read_csv(), using only the file name (not the directory) in quotes inside the parentheses

YOUR CODE

Note: If you save your .Rmd file somewhere else later, you’ll also need to copy any data files there so that when the .Rmd is Knit from there it can find the data in its new context

Interacting with data sets

The Pollster08 data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.

You should see that in the Environment tab that Pollster08 consists of 102 observations (cases), each with 11 variables recorded.

We could have R print out the entire data table by simply typing the name of the data object:

Pollster08

However printing the whole dataset is usually not that useful, especially if the data contains a lot of cases.

One advantage of RStudio is that it comes with a built-in data viewer.

  1. Click on the name Pollster08 in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (PollTaker, PollDates, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.

You can close the data viewer by closing the tab it’s in.

R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.

Most of what we will do in R will consist, in one way or another, of taking actions on data frames.

Getting Help

The following command performs an operation on the Pollster08 dataset.

pull(Pollster08, var = PollTaker)

If have not seen the pull() function before, then it might not be clear what it is for or what it is doing.

In the same way that we used help(Pollster08) or ?Pollster08 to access documentation about the dataset, we can access documentation about a function like pull by typing help(pull) or ?pull at the console.

Try it:

help(pull)
?pull

What does this function appear to do?

  1. Look up the nrow and ncol functions using the Help interface. Add a sentence to this document briefly describing what they do.

YOUR SUMMARY

HOMEWORK

  1. Load the HoneybeeCircuits dataset from the Lock5Data package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (Hint: Use what you found in Exercise 4).

YOUR CODE AND RESPONSE

  1. Make a code chunk below that reads in the women_in_labor_force.csv dataset as an object called WomenInLaborForce. This data is also in the data directory on your server account, but as before, you should copy the data file to your project directory first and then point R to it there. You should see WomenInLaborForce in your Environment

YOUR CODE AND RESPONSE

  1. Knit your modified .Rmd file. Save a copy of your edited version of this lab (both the .Rmd and Knitted .html files) in the hw3 turnin folder. You should also copy the two .csv datasets we read in to the turnin folder so that the document Knits there.

Environment and Session Information

  • File creation date: 2022-03-10
  • R version 4.1.2 (2021-11-01)
  • R version (short form): 4.1.2
  • mosaic package version: 1.8.3
  • tidyverse package version: 1.3.1
  • Additional session information
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Stat2Data_2.0.0   mosaic_1.8.3      ggridges_0.5.3    mosaicData_0.20.2
##  [5] ggformula_0.10.1  ggstance_0.3.5    Matrix_1.3-4      lattice_0.20-44  
##  [9] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4      
## [13] readr_2.0.1       tidyr_1.1.3       tibble_3.1.6      ggplot2_3.3.5    
## [17] tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0          lubridate_1.7.10  bit64_4.0.5       httr_1.4.2       
##  [5] tools_4.1.2       backports_1.2.1   bslib_0.3.0       utf8_1.2.2       
##  [9] R6_2.5.1          DBI_1.1.1         colorspace_2.0-2  withr_2.4.3      
## [13] tidyselect_1.1.1  gridExtra_2.3     leaflet_2.0.4.1   curl_4.3.2       
## [17] bit_4.0.4         compiler_4.1.2    cli_3.1.0         rvest_1.0.1      
## [21] xml2_1.3.2        ggdendro_0.1.22   sass_0.4.0        mosaicCore_0.9.0 
## [25] scales_1.1.1      digest_0.6.29     rmarkdown_2.10    pkgconfig_2.0.3  
## [29] htmltools_0.5.2   labelled_2.8.0    dbplyr_2.1.1      fastmap_1.1.0    
## [33] htmlwidgets_1.5.4 rlang_0.4.12      readxl_1.3.1      rstudioapi_0.13  
## [37] jquerylib_0.1.4   farver_2.1.0      generics_0.1.0    jsonlite_1.7.2   
## [41] crosstalk_1.1.1   vroom_1.5.4       magrittr_2.0.1    Rcpp_1.0.8       
## [45] munsell_0.5.0     fansi_0.5.0       lifecycle_1.0.1   stringi_1.7.4    
## [49] yaml_2.2.1        MASS_7.3-54       plyr_1.8.6        grid_4.1.2       
## [53] parallel_4.1.2    ggrepel_0.9.1     crayon_1.4.2      haven_2.4.3      
## [57] splines_4.1.2     hms_1.1.0         knitr_1.34        pillar_1.6.4     
## [61] reprex_2.0.1      glue_1.5.1        evaluate_0.14     modelr_0.1.8     
## [65] vctrs_0.3.8       tzdb_0.1.2        tweenr_1.0.2      cellranger_1.1.0 
## [69] gtable_0.3.0      polyclip_1.10-0   assertthat_0.2.1  xfun_0.25        
## [73] ggforce_0.3.3     broom_0.7.9       ellipsis_0.3.2