Since we want to be working with data, let’s load a dataset into our environment.
There are three main ways to load a dataset for use in R:
Some R packages come with example datasets. There is one such package,
Lock5Data developed as a companion to our textbook for this course. I will also use examples from the
Lock5Datapackage using the
library()command. Run the line. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail!
Once we have loaded the package that contains the dataset we want, we then (usually) have to load the dataset. We can do this using the
For example, the
"Pollster08" dataset provided by the
Stat2Data package contains data about some political polls taken during the 2008 U.S. Presidential campaign.
We can load the package and then the dataset as follows:
Pollster08 now appears in our Environment.
In the future, any time you want to use a dataset from an R package, you will first need to load the package, then the dataset, as above.
For packaged datasets like this, as well as for R functions, we can get some information about the data using R’s help interface. At the console, type
This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.
This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset.
Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.
We can load a data set from a file on the web if we know the URL.
read.file() command makes this simple for well formatted data files.
This command instructs R to fetch the data from the given URL, and put it in an object in our environment called
Depression, which will now appear in the Environment pane.
Note that the
read.file() produces a result; namely a data frame object, which we store in a variable which we give a name that tells us something about what that data frame contains data about.
This is a bit different from the way the
data() command worked for loading data from an R package. There, we didn’t have to create a variable in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.
You can also point to a file on your computer by supplying the file’s location on the computer as a path, in place of the URL.
In your “Files” tab you should see a folder called
stat209. Double click it to see its contents in the Files tab. It should have several subdirectories, one of which is called
data. If you double click on
data you should see the file
If instead of a URL you put the path to this file, which should be
"~/stat209/data/depression.csv" inside the parentheses in
read.file(), it would read in this file. The
~ symbol represents your “Home” folder.
A location that starts with a forward slash
/ or a tilde
~ is an absolute path. That is, it doesn’t matter what directory (folder) you’re currently working in; it will look in that specific location.
Sometimes it is convenient to specify a file using a relative path instead. Particularly when you want your code to run on another computer, where your files will not be in the exact same location they’re in on the computer you were working on.
Instead, we can tell R to look in a location which is relative to the location you’re currently working in (which might be the location where your script is).
For example, if I am currently working in my home directory (abbreviated in paths as
~), I could replace the absolute path above with
"stat209/data/depression.csv". This tells R to look for a folder called
stat209 inside the current working directory; a folder called
data inside that one; and a file called
depression.csv inside that.
If my file were directly inside my home folder, I could just type
If you have been following along, the code below will probably give you an error:
That’s because R is not looking inside your home directory, but inside your current working directory, which should be where your project is located.
R interprets relative paths relative to a starting point called your working directory. It is not, unfortunately, necessarily the same as the directory you are currently viewing in the Files tab.
You can see what directory is currently set as your working directory (relative to which all relative paths are interpreted) by typing
getwd() in the console.
You can change your working directory (for example, to
~/stat209) by typing
setwd("~/stat209") (with the quotes).
stat209, and then write a command using a relative path to read in the
depression.csvdataset using the
read.file()command. Don’t forget to store the resulting data frame in a named variable.
Pollster08 data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.
You should see that in the Environment tab that
Pollster08 consists of 102 observations (cases), each with 11 variables recorded.
We could have R print out the entire data table by simply typing the name of the data object:
However printing the whole dataset is not that useful, especially if the data contains a lot of cases.
One advantage of RStudio is that it comes with a built-in data viewer.
Pollster08in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (
PollDates, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.
You can close the data viewer by clicking on the “X” in the upper lefthand corner.
R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.
Most of what we will do in R will consist, in one way or another, of taking actions on data frames.
The same help interface works for functions just as for data frames. We can get more information about the
head() functions by typing
at the console. We could also have done
to achieve the same result.
ncolfunctions using the Help interface. Add a sentence to this document briefly describing what they do.
HoneybeeCircuits dataset from the
Lock5Data package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (hint: use what you found in Exercise 9).
Make a code chunk below and add code to read in the
women_in_labor_force.csv dataset from the
data directory on your server account using the
read.file() with an absolute path, storing the result as
WomenInLabor. Make a second chunk to read the same dataset with a relative path from your project directory, storing the result as
WomenInLabor2. What are the cases and variables in this dataset?
Knit your modified
.Rmd file. Save a copy of your edited version of this lab (both the
.Rmd and Knitted
.html files) in the
hw2 turnin folder.
mosaicpackage version: 1.5.0
tidyversepackage version: 1.3.1
## R version 3.6.0 (2019-04-26) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 18.04.5 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  Stat2Data_2.0.0 mosaic_1.5.0 Matrix_1.2-17 ##  mosaicData_0.17.0 ggformula_0.9.1 ggstance_0.3.1 ##  lattice_0.20-38 forcats_0.5.1 stringr_1.4.0 ##  dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 ##  tidyr_1.1.3 tibble_3.1.1 ggplot2_3.3.3 ##  tidyverse_1.3.1 ## ## loaded via a namespace (and not attached): ##  ggdendro_0.1-20 httr_1.4.2 jsonlite_1.7.2 splines_3.6.0 ##  modelr_0.1.8 shiny_1.3.2 assertthat_0.2.1 cellranger_1.1.0 ##  yaml_2.2.0 ggrepel_0.8.1 pillar_1.6.0 backports_1.1.4 ##  glue_1.4.2 digest_0.6.21 promises_1.0.1 rvest_1.0.0 ##  colorspace_1.4-1 htmltools_0.3.6 httpuv_1.5.1 pkgconfig_2.0.3 ##  broom_0.7.6 haven_2.4.1 xtable_1.8-4 scales_1.0.0 ##  later_0.8.0 generics_0.0.2 ellipsis_0.3.0 withr_2.4.2 ##  lazyeval_0.2.2 cli_2.5.0 magrittr_2.0.1 crayon_1.4.1 ##  readxl_1.3.1 mime_0.7 evaluate_0.14 fs_1.3.1 ##  fansi_0.4.0 MASS_7.3-51.4 xml2_1.3.2 tools_3.6.0 ##  hms_1.0.0 lifecycle_1.0.0 munsell_0.5.0 reprex_2.0.0 ##  compiler_3.6.0 rlang_0.4.11 grid_3.6.0 rstudioapi_0.13 ##  htmlwidgets_1.3 crosstalk_1.0.0 mosaicCore_0.6.0 rmarkdown_2.5 ##  gtable_0.3.0 DBI_1.0.0 R6_2.4.0 gridExtra_2.3 ##  lubridate_1.7.10 knitr_1.25 utf8_1.1.4 stringi_1.4.3 ##  Rcpp_1.0.2 vctrs_0.3.8 leaflet_2.0.2 dbplyr_2.1.1 ##  tidyselect_1.1.1 xfun_0.19