Since we want to be working with data, let’s load a dataset into our environment.
There are three main ways to load a dataset for use in R:
Some R packages come with example datasets. Two such packages are Lock5Data
and Stat2Data
, which consist of datasets used in two statistics textbooks.
Stat2Data
package with the library()
command. Type the name of the package inside the parentheses. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail! Run the chunk.Once we have loaded the package that contains the dataset we want, we then (usually) want to load the dataset. We can do this using the data()
command.
For example, the "Pollster08"
dataset provided by the Stat2Data
package contains data about some political polls taken during the 2008 U.S. Presidential campaign.
We can load the package and then the dataset as follows:
library(Stat2Data) #loads the package (you already did this)
data(Pollster08) #loads the specific dataset
Notice that Pollster08
now appears in our Environment.
In the future, any time you want to use a dataset which is included in an R package, you will first need to load the package with library(), then the dataset with data(), as above.
For packaged datasets like this, as with functions, we can get some information about the data using R’s help interface. This is better done at the console, because we don’t really want the help page to pop up when we Knit the document (I’ve set the following chunks not to run when Knitting). Type the following at the console:
help(Pollster08)
or, equivalently:
?Pollster08
This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.
This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset. It should include a brief description of how the data was collected and what each variable (column) means.
Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.
We can load a data set from a file on the web if we know the URL.
The read_csv()
command makes this simple for datasets that are represented as .csv
(comma-separated values) text files (such as those exported from Excel or Google Sheets).
The command below instructs R to fetch the data from the given URL, and put it in an object in our environment called Depression
, which will now appear in the Environment pane.
Make sure you click “Run Previous Chunks” before running it, because the read_csv()
function is provided by the tidyverse
package, which needs to have been loaded before you can use the function. This happens in the setup chunk above with the library(tidyverse)
command.
<- read_csv("http://colindawson.net/data/depression.csv") Depression
Note that the read_csv()
produces a result; namely a data frame object, which we store in an object which we give a name that tells us something about what that data is.
This is a bit different from the way the data()
command worked for loading data from an R package. There, we didn’t have to create an object in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.
Before you try to read in a dataset, make sure you know whether it ‘lives’ in an R package, on the web, or in a file, and use the appropriate method to read it in
You can also point to a file on the computer you are running RStudio on (in this case the server) by supplying the file’s location on the computer as a path, in place of the URL.
Note: RStudio as run on the server cannot read files directly from your personal machine. If you want to read in data from a file that you have created (for example, by making the data frame in Excel and exporting it to a .csv
file) you will need to Upload the file first, as we have done with homework assignments and quizzes
In your “Files” tab you should see a folder called stat113
. Double click it to see its contents in the Files tab. It should have several sub-directories, one of which is called data
. If you click on data
you should see the file depression.csv
inside.
We could point RStudio directly to the dataset in this folder with a command like
## This is and example that works, but not what we want to do
<- read_csv("~/stat113/data/depression.csv") Depression
The ~
symbol is a shorthand that the computer will interpret as referring to your “Home” folder, and is typed literally. The forward slashes are used to indicate that the preceding thing is the name of a folder. So we can read this path as saying "go to the Home folder, then find the stat113
folder there, the data
folder within that, and finally look for a file called depression.csv
in that folder.
However, if we are sharing our work with someone else, they might not have the same arrangement of files and folders on their machine that we do. For portability, if the data cannot be hosted on the web, it is usually better to include a copy of the data somewhere within our project folder, and then point to its location relative to that folder.
In an RMarkdown document, if a path doesn’t start with a /
or ~
, then the computer will look for files and folders within the folder where the .Rmd
file is located. The simplest case is when the file you need is in that same folder. In that case you can just give the filename (in quotes).
depression.csv
file to the project folder where your lab3.Rmd
file is located. Then try reading it in using read_csv()
, using only the file name (not the directory) in quotes inside the parenthesesNote: If you save your .Rmd
file somewhere else later, you’ll also need to copy any data files there so that when the .Rmd
is Knit from there it can find the data in its new context
The Pollster08
data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.
You should see that in the Environment tab that Pollster08
consists of 102 observations (cases), each with 11 variables recorded.
We could have R print out the entire data table by simply typing the name of the data object:
Pollster08
However printing the whole dataset is usually not that useful, especially if the data contains a lot of cases.
One advantage of RStudio is that it comes with a built-in data viewer.
Pollster08
in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (PollTaker
, PollDates
, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.You can close the data viewer by closing the tab it’s in.
R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.
Most of what we will do in R will consist, in one way or another, of taking actions on data frames.
The following command performs an operation on the Pollster08
dataset.
pull(Pollster08, var = PollTaker)
If have not seen the pull()
function before, then it might not be clear what it is for or what it is doing.
In the same way that we used help(Pollster08)
or ?Pollster08
to access documentation about the dataset, we can access documentation about a function like pull
by typing help(pull)
or ?pull
at the console.
Try it:
help(pull)
?pull
What does this function appear to do?
nrow
and ncol
functions using the Help interface. Add a sentence to this document briefly describing what they do.HoneybeeCircuits
dataset from the Lock5Data
package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (Hint: Use what you found in Exercise 4).women_in_labor_force.csv
dataset as an object called WomenInLaborForce. This data is also in the data
directory on your server account, but as before, you should copy the data file to your project directory first and then point R to it there. You should see WomenInLaborForce in your Environment.Rmd
file. Save a copy of your edited version of this lab (both the .Rmd
and Knitted .html
files) in the hw3
turnin folder. You should also copy the two .csv datasets we read in to the turnin folder so that the document Knits there.mosaic
package version: 1.8.3tidyverse
package version: 1.3.1## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Stat2Data_2.0.0 mosaic_1.8.3 ggridges_0.5.3 mosaicData_0.20.2
## [5] ggformula_0.10.1 ggstance_0.3.5 Matrix_1.3-4 lattice_0.20-44
## [9] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [13] readr_2.0.1 tidyr_1.1.3 tibble_3.1.6 ggplot2_3.3.5
## [17] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 lubridate_1.7.10 bit64_4.0.5 httr_1.4.2
## [5] tools_4.1.2 backports_1.2.1 bslib_0.3.0 utf8_1.2.2
## [9] R6_2.5.1 DBI_1.1.1 colorspace_2.0-2 withr_2.4.3
## [13] tidyselect_1.1.1 gridExtra_2.3 leaflet_2.0.4.1 curl_4.3.2
## [17] bit_4.0.4 compiler_4.1.2 cli_3.1.0 rvest_1.0.1
## [21] xml2_1.3.2 ggdendro_0.1.22 sass_0.4.0 mosaicCore_0.9.0
## [25] scales_1.1.1 digest_0.6.29 rmarkdown_2.10 pkgconfig_2.0.3
## [29] htmltools_0.5.2 labelled_2.8.0 dbplyr_2.1.1 fastmap_1.1.0
## [33] htmlwidgets_1.5.4 rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13
## [37] jquerylib_0.1.4 farver_2.1.0 generics_0.1.0 jsonlite_1.7.2
## [41] crosstalk_1.1.1 vroom_1.5.4 magrittr_2.0.1 Rcpp_1.0.8
## [45] munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.1 stringi_1.7.4
## [49] yaml_2.2.1 MASS_7.3-54 plyr_1.8.6 grid_4.1.2
## [53] parallel_4.1.2 ggrepel_0.9.1 crayon_1.4.2 haven_2.4.3
## [57] splines_4.1.2 hms_1.1.0 knitr_1.34 pillar_1.6.4
## [61] reprex_2.0.1 glue_1.5.1 evaluate_0.14 modelr_0.1.8
## [65] vctrs_0.3.8 tzdb_0.1.2 tweenr_1.0.2 cellranger_1.1.0
## [69] gtable_0.3.0 polyclip_1.10-0 assertthat_0.2.1 xfun_0.25
## [73] ggforce_0.3.3 broom_0.7.9 ellipsis_0.3.2