STAT 113: Lab 2

Objects and Variables (copied over and slightly modified from Lab 1)

R, like other programming languages, stores information (such as data) in objects, which are given labels so that we can refer to them as we are working.

Some kinds of objects are

text strings, which are used for labels, or the values of categorical variables (like “blue”), among other things.
numbers, on which arithmetic can be performed
logical values, TRUE and FALSE (note the all caps).

There are also objects called vectors, which are like lists whose entries can be text strings, numbers, or logical values.

When we want to use an object a lot (such as a numeric value, like a mean, from some statistical computation), it is helpful to give it a name so we can refer to it by what it represents, instead of by its values.

Assignment

We can give a name to an object using an expression of the form

name <- value

This process is called assignment, because we are “assigning” the value to a container with the name we’ve chosen. The named thing is called a variable (which means something a bit different than a variable in the statistical sense, although a variable in code can refer to a statistical variable).

For example:

myName <- "Colin Dawson"
myAge <- 38

You can read the <- symbol as “gets”, as in “(The name) my.name gets (the value)”Colin Dawson". Notice that there is just one hyphen in the arrow. A common error is to add an extra hyphen to the arrow, which R will misinterpret as a minus sign.

It is also legal to use underscores and digits in variable names, but none of these can be used at the beginning of a name.

Assigning the Result of a Command

We can also store the result of a command in a named variable. A simple example is the following:

myResult <- sqrt(25)

Now if I type the name of the new variable at the console, or refer to it by itself in a chunk, R will print out its contents:

myResult

## [1] 5

The 1 in brackets is there to indicate that the next value shown is the first entry in the variable called myResult (Note that if you try to access the variable MyResult, you will get an error, because you defined it with a lower case “m”). In this case the variable has only one entry, but sometimes we will hold lists of data or other values in a variable.

We can also use variables as the values of arguments, such as in:

a_squared <- 3^2
b_squared <- 4^2
a_squared_plus_b_squared <- a_squared + b_squared

Notice that if we run the chunk that defines these variables, we will see them appear in the Environment tab in the upper right pane. This shows us everything we’ve defined.

Make new code chunk. In it, define variables with your name, your birth year, and the current year. Run the chunk and verify that the variables you created appear in the environment tab.
Now, in a new chunk, create a variable that calculates your age by doing arithmetic with the variable for your birth year and the one for the current year.

The Global Environment vs the Knitting Environment

The Environment tab will also contain any variables that we defined in other documents, at the console, or in chunks that we’ve since deleted. This can cause problems, because variables can wind up referring to things that don’t exist in the current document, or to things that should have a different value in the current document.

Fortunately, when we Knit our document, the rendering program ignores the interactive environment and creates its own encapsulated environment that only contains variables we’ve defined in the current document (and similarly, only allows us to use datasets and functionality from packages that have been loaded in our document).

This means that we can only use variables in our document that have been defined prior to the point when we refer to them. If we try to use a variable above the chunk where it’s defined, it may work when we’re running chunks interactively (provided we’ve previously run the chunk where it’s defined), but it won’t work when we try to Knit. This is another reason why Knitting every so often is a good idea, since it helps us catch errors in our document that we might otherwise miss.

Try defining a variable called theAnswer at the console (rather than in a code chunk), and assign it the value 42. Then, create a code chunk that refers to theAnswer in an expression that computes twice the answer. What should happen is that the chunk will run fine when you just try to run it by itself, but if you try to Knit you’ll get an error.
Fix the error by adding a code chunk in an appropriate spot that defines theAnswer within the document.

Functions, arguments and commands

Most of what we do in R consists of applying functions to data objects, specifying some options for the function, which are called arguments. Together, the application of a function, together with its arguments, is called a command.

Many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of “inputs” (the arguments), which yields some kind of output, much as the sin() function in math takes a number as input and returns another number corresponding to its trigonometric sine.

A useful analogy is that commands are like sentences, where the function is the verb, and the arguments (one of which usually specifies the data object) are the nouns.

There is often a “main” argument that comes first. This is like the direct object of the command.

For example, in the English command, “Draw a picture for me with some paint”, the verb “draw” acts like the function (what is the listener supposed to do?); the noun “picture” is the direct object (draw what?), and “me” and “paint” are extra (in this case, optional) details, that we might call the “recipient” and the “instrument”.

In the grammar of R, I could write this sentence like:

## Note: this is not real R code
draw("picture", recipient = "me", material = "paint")

We are applying the function draw() to the object "picture", and adding some additional detail about the recipient and material. Here the function is called draw, and we have a main argument with the value "picture", and additional arguments recipient and material with the values "me", and "paint", respectively.

Technically speaking, "picture" is the value of an argument too; we might have written

### Note: this is not real R code
draw(object = "picture", recipient = "me", material = "paint")

However, in practice, there is often a required first “main” argument whose name is left out of the command.

In R, arguments always go inside parentheses, and are separated by commas when there is more than one. For arguments whose names are explicitly given, the name goes to the left of the =, and the value goes to the right.

The command

log(100, base = 10)

finds the logarithm of the number 100, using base 10 log.

We are applying the function log() function to the value 100 and modifying the behavior of log() through the optional argument base that in this case specifies what kind of logarithm we want.

Assigning the results of a function

As we have seen, when we apply a function to some arguments, it produces a result. If we simply call the function, most of the time the result is just printed out. But often times we want to refer to or use that result later. In this case we can assign the result of the function call to a “container”; that is, to a named variable.

For example, if I have a variable called Income, I might want to compute and store the log of that income variable:

Income <- 42000
logIncome <- log(Income, base = 10)

Now logIncome is a variable whose value is the log (base 10) of the Income value.

Loading Data

Since we want to be working with data, let’s load a dataset into our environment.

There are three main ways to load a dataset for use in R:

Accessing data that is bundled with an R package
Reading in data that is posted on the web
Reading in data from a file on your computer/server account

Accessing data from a package

Some R packages come with example datasets. There is one such package, Lock5Data developed as a companion to our textbook for this course. I will also use examples from the Stat2Data package.

Create a new code chunk, in which you load the Lock5Data package using the library() command. Run the line. Note: R is case-sensitive, so commands do not work (or worse, may run but do something different) if you use different capitalization. Pay attention to detail!

Once we have loaded the package that contains the dataset we want, we then (usually) have to load the dataset. We can do this using the data() command.

For example, the "Pollster08" dataset provided by the Stat2Data package contains data about some political polls taken during the 2008 U.S. Presidential campaign.

We can load the package and then the dataset as follows:

library(Stat2Data) #loads the package

data(Pollster08)   #loads the dataset

Notice that Pollster08 now appears in our Environment.

In the future, any time you want to use a dataset from an R package, you will first need to load the package, then the dataset, as above.

Accessing Documentation

For packaged datasets like this, as well as for R functions, we can get some information about the data using R’s help interface. At the console, type

help(Pollster08)

This will pop up a documentation window in the Help tab in the lower right. Here you can read about the source of this data, what each case is, what each variable means, how it is measured, etc.

This documentation of a dataset is what’s called the code book that accompanies the data. When collecting your own data, you should also create an accompanying code book (which might just be a text file) to give context to anyone using your dataset.

Reading data directly from the web

Reading data bundled from an R package is nice for a course, but in “real life” the data we want is usually not so conveniently packaged. More often it is stored in a file somewhere; either on the web or on our computer itself.

We can load a data set from a file on the web if we know the URL.

The read.file() command makes this simple for well formatted data files.

This command instructs R to fetch the data from the given URL, and put it in an object in our environment called Depression, which will now appear in the Environment pane.

Depression <- read.file("http://colindawson.net/data/depression.csv")

Note that the read.file() produces a result; namely a data frame object, which we store in a variable which we give a name that tells us something about what that data frame contains data about.

This is a bit different from the way the data() command worked for loading data from an R package. There, we didn’t have to create a variable in R; there was already one existing; we just had to tell R that we were going to be using it, and that it should be added to our environment.

Reading data from a file on your computer

You can also point to a file on your computer by supplying the file’s location on the computer as a path, in place of the URL.

Absolute paths

In your “Files” tab you should see a folder called stat113. Double click it to see its contents in the Files tab. It should have several subdirectories, one of which is called data. If you double click on data you should see the file depression.csv inside.

If instead of a URL you put the path to this file, which should be "~/stat113/data/depression.csv" inside the parentheses in read.file(), it would read in this file. The ~ symbol represents your “Home” folder.

A location that starts with a forward slash / or a tilde ~ is an absolute path. That is, it doesn’t matter what directory (folder) you’re currently working in; it will look in that specific location.

Relative paths

Sometimes it is convenient to specify a file using a relative path instead. Particularly when you want your code to run on another computer, where your files will not be in the exact same location they’re in on the computer you were working on.

Instead, we can tell R to look in a location which is relative to the location you’re currently working in (which might be the location where your script is).

For example, if I am currently working in my home directory (abbreviated in paths as ~), I could replace the absolute path above with "stat113/data/depression.csv". This tells R to look for a folder called stat113 inside the current working directory; a folder called data inside that one; and a file called depression.csv inside that.

If my file were directly inside my home folder, I could just type "depression.csv" inside read.file().

If you have been following along, the code below will probably give you an error:

## Probably generates an error
Depression <- read.file("stat113/data/depression.csv")

That’s because R is not looking inside your home directory, but inside your current working directory, which should be where your project is located.

Your working directory

R interprets relative paths relative to a starting point called your working directory. It is not, unfortunately, necessarily the same as the directory you are currently viewing in the Files tab.

You can see what directory is currently set as your working directory (relative to which all relative paths are interpreted) by typing getwd() in the console.

You can change your working directory (for example, to ~/stat113) by typing setwd("~/stat113") (with the quotes).

Make a code chunk in which you change your working directory to stat113, and then write a command using a relative path to read in the depression.csv dataset using the read.file() command. Don’t forget to store the resulting data frame in a named variable.

Interacting with data sets

The Pollster08 data that we read in earlier consists of several variables about various opinion polls taken during the 2008 U.S. Presidential election.

You should see that in the Environment tab that Pollster08 consists of 102 observations (cases), each with 11 variables recorded.

We could have R print out the entire data table by simply typing the name of the data object:

Pollster08

However printing the whole dataset is not that useful, especially if the data contains a lot of cases.

One advantage of RStudio is that it comes with a built-in data viewer.

Click on the name Pollster08 in the Environment pane. This will bring up a “spreadsheet”-style display of the data set in the Data Viewer (upper left window). What you should see are 11 columns, each row representing a poll, and each column representing a variable (PollTaker, PollDates, etc.) The first entry in each row is simply the row number. Use the scrollbar on the right side of the console window to examine the complete data set.

You can close the data viewer by clicking on the “X” in the upper lefthand corner.

R has stored this data in a data frame. This is the sort of table we have sketched in class: each row is a case, each column is a variable.

Most of what we will do in R will consist, in one way or another, of taking actions on data frames.

Run the command below, and then examine its structure. Identify all of the function names, argument names, and argument values. (Note that sometimes argument values, particularly for the first argument of a function, are supplied without an argument name: R can assign values to names according to the order they’re given in. However, the reverse is never true: if you give an argument name you need to supply a value.) What does this command appear to do?

pull(Pollster08, var = PollTaker)

Getting Help

The same help interface works for functions just as for data frames. We can get more information about the pull() or head() functions by typing

help(pull)
help(head)

at the console. We could also have done

?pull
?head

to achieve the same result.

Look up the nrow and ncol functions using the Help interface. Add a sentence to this document briefly describing what they do.

On Your Own (Homework)

Load the HoneybeeCircuits dataset from the Lock5Data package (see “Reading in data from a package”). Use the Help interface to look up what the dataset is about and add a brief summary in the text of this document below this question. Then, write an R command that returns the number of cases in the data (hint: use what you found in Exercise 9).
Make a code chunk below and add code to read in the women_in_labor_force.csv dataset from the data directory on your server account using the read.file() with an absolute path, storing the result as WomenInLabor. Make a second chunk to read the same dataset with a relative path from your project directory, storing the result as WomenInLabor2. What are the cases and variables in this dataset?
Knit your modified .Rmd file. Save a copy of your edited version of this lab (both the .Rmd and Knitted .html files) in the hw2 turnin folder.

Environment and Session Information

File creation date: 2020-09-09
R version 3.6.3 (2020-02-29)
R version (short form): 3.6.3
mosaic package version: 1.7.0
tidyverse package version: 1.3.0
Additional session information

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Stat2Data_2.0.0   mosaic_1.7.0      Matrix_1.2-18     mosaicData_0.18.0
##  [5] ggformula_0.9.4   ggstance_0.3.4    lattice_0.20-41   forcats_0.5.0    
##  [9] stringr_1.4.0     dplyr_1.0.2       purrr_0.3.4       readr_1.3.1      
## [13] tidyr_1.1.1       tibble_3.0.3      ggplot2_3.3.2     tidyverse_1.3.0  
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.2     Rcpp_1.0.5        lubridate_1.7.9   assertthat_0.2.1 
##  [5] digest_0.6.25     ggforce_0.3.2     R6_2.4.1          cellranger_1.1.0 
##  [9] backports_1.1.8   reprex_0.3.0      evaluate_0.14     httr_1.4.2       
## [13] pillar_1.4.6      rlang_0.4.7       lazyeval_0.2.2    readxl_1.3.1     
## [17] rstudioapi_0.11   blob_1.2.1        rmarkdown_2.3     splines_3.6.3    
## [21] htmlwidgets_1.5.1 polyclip_1.10-0   munsell_0.5.0     broom_0.7.0      
## [25] compiler_3.6.3    modelr_0.1.8      xfun_0.16         pkgconfig_2.0.3  
## [29] htmltools_0.5.0   tidyselect_1.1.0  gridExtra_2.3     mosaicCore_0.6.0 
## [33] fansi_0.4.1       crayon_1.3.4      dbplyr_1.4.4      withr_2.2.0      
## [37] MASS_7.3-52       grid_3.6.3        jsonlite_1.7.0    gtable_0.3.0     
## [41] lifecycle_0.2.0   DBI_1.1.0         magrittr_1.5      scales_1.1.1     
## [45] cli_2.0.2         stringi_1.4.6     farver_2.0.3      fs_1.5.0         
## [49] leaflet_2.0.3     xml2_1.3.2        ggdendro_0.1.21   ellipsis_0.3.1   
## [53] generics_0.0.2    vctrs_0.3.2       tools_3.6.3       glue_1.4.1       
## [57] tweenr_1.0.1      crosstalk_1.1.0.1 hms_0.5.3         yaml_2.2.1       
## [61] colorspace_1.4-1  rvest_0.3.6       knitr_1.29        haven_2.3.1