Since we want to be working with data, let’s load a dataset into our environment.
There are three main ways to load a dataset for use in R:
Some R packages come with example datasets. There is one such package, Lock5Data
developed as a companion to our textbook for this course. I will also use examples from the Stat2Data
package.
Lock5Data
package using the library()
command. Run the line. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail!Once we have loaded the package that contains the dataset we want, we then (usually) have to load the dataset. We can do this using the data()
command.
For example, the "Pollster08"
dataset provided by the Stat2Data
package contains data about some political polls taken during the 2008 U.S. Presidential campaign.
We can load the package and then the dataset as follows:
Notice that Pollster08
now appears in our Environment.
In the future, any time you want to use a dataset from an R package, you will first need to load the package, then the dataset, as above.
For packaged datasets like this, as well as for R functions, we can get some information about the data using R’s help interface. At the console, type
This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.
This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset.
Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.
We can load a data set from a file on the web if we know the URL.
The read.file()
command makes this simple for well formatted data files.
This command instructs R to fetch the data from the given URL, and put it in an object in our environment called Depression
, which will now appear in the Environment pane.
Note that the read.file()
produces a result; namely a data frame object, which we store in a variable which we give a name that tells us something about what that data frame contains data about.
This is a bit different from the way the data()
command worked for loading data from an R package. There, we didn’t have to create a variable in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.
You can also point to a file on your computer by supplying the file’s location on the computer as a path, in place of the URL.
In your “Files” tab you should see a folder called stat209
. Double click it to see its contents in the Files tab. It should have several subdirectories, one of which is called data
. If you double click on data
you should see the file depression.csv
inside.
If instead of a URL you put the path to this file, which should be "~/stat209/data/depression.csv"
inside the parentheses in read.file()
, it would read in this file. The ~
symbol represents your “Home” folder.
A location that starts with a forward slash /
or a tilde ~
is an absolute path. That is, it doesn’t matter what directory (folder) you’re currently working in; it will look in that specific location.
Sometimes it is convenient to specify a file using a relative path instead. Particularly when you want your code to run on another computer, where your files will not be in the exact same location they’re in on the computer you were working on.
Instead, we can tell R to look in a location which is relative to the location you’re currently working in (which might be the location where your script is).
For example, if I am currently working in my home directory (abbreviated in paths as ~
), I could replace the absolute path above with "stat209/data/depression.csv"
. This tells R to look for a folder called stat209
inside the current working directory; a folder called data
inside that one; and a file called depression.csv
inside that.
If my file were directly inside my home folder, I could just type "depression.csv"
inside read.file()
.
If you have been following along, the code below will probably give you an error:
That’s because R is not looking inside your home directory, but inside your current working directory, which should be where your project is located.
R interprets relative paths relative to a starting point called your working directory. It is not, unfortunately, necessarily the same as the directory you are currently viewing in the Files tab.
You can see what directory is currently set as your working directory (relative to which all relative paths are interpreted) by typing getwd()
in the console.
You can change your working directory (for example, to ~/stat209
) by typing setwd("~/stat209")
(with the quotes).
stat209
, and then write a command using a relative path to read in the depression.csv
dataset using the read.file()
command. Don’t forget to store the resulting data frame in a named variable.The Pollster08
data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.
You should see that in the Environment tab that Pollster08
consists of 102 observations (cases), each with 11 variables recorded.
We could have R print out the entire data table by simply typing the name of the data object:
However printing the whole dataset is not that useful, especially if the data contains a lot of cases.
One advantage of RStudio is that it comes with a built-in data viewer.
Pollster08
in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (PollTaker
, PollDates
, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.You can close the data viewer by clicking on the “X” in the upper lefthand corner.
R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.
Most of what we will do in R will consist, in one way or another, of taking actions on data frames.
The same help interface works for functions just as for data frames. We can get more information about the pull()
or head()
functions by typing
at the console. We could also have done
to achieve the same result.
nrow
and ncol
functions using the Help interface. Add a sentence to this document briefly describing what they do.Load the HoneybeeCircuits
dataset from the Lock5Data
package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (hint: use what you found in Exercise 9).
Make a code chunk below and add code to read in the women_in_labor_force.csv
dataset from the data
directory on your server account using the read.file()
with an absolute path, storing the result as WomenInLabor
. Make a second chunk to read the same dataset with a relative path from your project directory, storing the result as WomenInLabor2
. What are the cases and variables in this dataset?
Knit your modified .Rmd
file. Save a copy of your edited version of this lab (both the .Rmd
and Knitted .html
files) in the hw2
turnin folder.
mosaic
package version: 1.5.0tidyverse
package version: 1.3.1## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Stat2Data_2.0.0 mosaic_1.5.0 Matrix_1.2-17
## [4] mosaicData_0.17.0 ggformula_0.9.1 ggstance_0.3.1
## [7] lattice_0.20-38 forcats_0.5.1 stringr_1.4.0
## [10] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0
## [13] tidyr_1.1.3 tibble_3.1.1 ggplot2_3.3.3
## [16] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] ggdendro_0.1-20 httr_1.4.2 jsonlite_1.7.2 splines_3.6.0
## [5] modelr_0.1.8 shiny_1.3.2 assertthat_0.2.1 cellranger_1.1.0
## [9] yaml_2.2.0 ggrepel_0.8.1 pillar_1.6.0 backports_1.1.4
## [13] glue_1.4.2 digest_0.6.21 promises_1.0.1 rvest_1.0.0
## [17] colorspace_1.4-1 htmltools_0.3.6 httpuv_1.5.1 pkgconfig_2.0.3
## [21] broom_0.7.6 haven_2.4.1 xtable_1.8-4 scales_1.0.0
## [25] later_0.8.0 generics_0.0.2 ellipsis_0.3.0 withr_2.4.2
## [29] lazyeval_0.2.2 cli_2.5.0 magrittr_2.0.1 crayon_1.4.1
## [33] readxl_1.3.1 mime_0.7 evaluate_0.14 fs_1.3.1
## [37] fansi_0.4.0 MASS_7.3-51.4 xml2_1.3.2 tools_3.6.0
## [41] hms_1.0.0 lifecycle_1.0.0 munsell_0.5.0 reprex_2.0.0
## [45] compiler_3.6.0 rlang_0.4.11 grid_3.6.0 rstudioapi_0.13
## [49] htmlwidgets_1.3 crosstalk_1.0.0 mosaicCore_0.6.0 rmarkdown_2.5
## [53] gtable_0.3.0 DBI_1.0.0 R6_2.4.0 gridExtra_2.3
## [57] lubridate_1.7.10 knitr_1.25 utf8_1.1.4 stringi_1.4.3
## [61] Rcpp_1.0.2 vctrs_0.3.8 leaflet_2.0.2 dbplyr_2.1.1
## [65] tidyselect_1.1.1 xfun_0.19