Recall our example from class where we had a bag containing four marbles each either black or white, and a data-generation process where we would reach in the bag, pull out a marble, write down its color, and then put it back.
Having obtained a dataset consisting of a sequence “black, white, black” we asked ourselves how plausible it was that the bag contained exactly 1 black marble.
To do this we assumed that each of the five possible bags was equally plausible going in, and that each time we reached into the bag, each of the four marbles inside had an equal chance of being selected. We constructed a “tree universe” where each “path” consisted of a possible data sequence, and where the universe was divided into “possible worlds” corresponding to hypotheses about the contents of the bag.
Let’s try simulating this process in R.
We’ll generate a data sequence in two steps:
First, let’s define the universe of possible worlds:
<- tribble(
worlds ~hypothesis, ~contents,
"H1", "WWWW",
"H2", "BWWW",
"H3", "BBWW",
"H4", "BBBW",
"H5", "BBBB")
R Note: Data Frames The object we’ve just constructed is a version of what’s called a “data frame”: a ‘table’ in which rows represent instances (or “cases”) and columns represent characteristics (or “variables”). Often we’ll read in data from a file, but here, the tribble()
function lets us fill in entries manually, first specifying column names, and then filling in the table by rows. Click on worlds
in the Environment tab to see a “spreadsheet” view of the resulting data frame.
R Tip: The Random Number Generator “Seed” When we use a computer to do something “random”, the computer needs to “mimic” true random number generation by doing a calculation involving some arbitrary input (like the current time in milliseconds) to produce a deterministic output. Each time we run the code, the input has changed (because by default it doesn’t come from our code), and so the output changes, but there is in fact a fixed relationship between input and output. When doing a simulation, however, it’s inconvenient if the output of a chunk of code changes each time we run it, and so it is often useful to “force” the input to a particular value before a sequence of random numbers are generated. In R we can do this with the set.seed()
function, putting some arbitrary number inside the parentheses.
In the chunk below I first set the seed and then pick one “world” from the “pool of worlds” at random. By default each row in the table of worlds has an equal chance of being selected.
Before you run it, change the seed to something unique to you, such as your T number.
set.seed(8675309)
<- worlds %>%
current_world slice_sample(n = 1)
R Note: “Pipe” syntax The %>%
symbol is called a “pipe”. It takes the result of whatever code precedes it and passes that result on as the first argument of the function that follows. In this case we’re just passing the contents of the worlds
object on to the sample()
function, whose first argument specifies what to sample from. Writing worlds %>% sample(size = 1)
is equivalent to writing sample(worlds, size = 1)
, but as expressions get more complex it will generally be easier to read code structured with the left-to-right pipe syntax.
Now we’ll simulate drawing 3 marbles from the bag replacing the marble between each draw. To access individual marbles we’ll first need to split the string representing the bag’s contents into a list of individual characters. Then we’ll sample with replacement from that list, producing a list of observations. Finally, for ease of readability, we’ll “smoosh” the individual characters back into a single string.
set.seed(29747)
<- current_world %>%
results mutate(
content_list = contents %>% str_split(pattern = ""),
observation_list = content_list %>% map(sample, size = 3, replace = TRUE),
observations = observation_list %>% map_chr(str_flatten)
)
R Note: The mutate()
function The mutate()
function is what lets us define new columns in our data frame. We pass in an existing data frame, and then provide the names of new columns we want to create, whose values are calculated via the expression to the right of the =
. This expression can include the names of any columns that have been defined up to that point.
R Note: Applying a function to each entry in a list with map()
and map_chr()
The map()
function lets us take a function and apply it to each entry in a list, returning the results as a list of the same length. Here, str_split()
returns a list with one entry, which is a vector of the color labels of the individual marbles in the bag we chose at random. We want to call sample on that vector, not the list containing that vector, so we use map()
to tell R to call sample()
on each entry in the list, and return a list of the same length consisting of the results, which we then put in the column observation_list
. The map_chr()
function does essentially the same thing as map()
except that it tries to turn the results into a “character vector” instead of a generic list. We use this when we pass the output list from sample()
to the str_flatten()
function so that the results can be put into a dataset as a column.
I’ve stored each intermediate result of this process in a column so you can see what R is doing at each step. But we probably don’t actually care about the intermediate results; we could instead have just done:
set.seed(29747)
<- current_world %>%
results mutate(
observations = contents %>%
str_split(pattern = "") %>%
map(sample, size = 3, replace = TRUE) %>%
map_chr(str_flatten))
which is equivalent except that it doesn’t keep the list versions of things.
A “brute force” way to estimate the posterior probability that we are in a particular world, given a set of observations, is to repeat the above process of sampling a world using the prior weights and then generating data from that world. If we do this enough times, we will have several instances of the data sequence we saw, and can simply count what proportion of those came from a particular world.
First, we’ll define our prior weights on bags, adding them as a column to our original worlds
data frame.
<- worlds %>%
worlds mutate(
prior = c(
H1 = 1/5,
H2 = 1/5,
H3 = 1/5,
H4 = 1/5,
H5 = 1/5))
Next, we’ll sample 5000 ‘bags’ with replacement, with probabilities as defined by the prior.
<- worlds %>%
worlds_sample slice_sample(
n = 5000,
replace = TRUE,
weight_by = prior)
Now, draw 3 marbles with replacement from each bag via the sample()
function, just as we did above. Other than using worlds_sample
instead of current_world
we can use identical code, since everything we did was operating in parallel on each row in the data table.
set.seed(29747)
<- worlds_sample %>%
results mutate(
observations = contents %>%
str_split(pattern = "") %>%
map(sample, size = 3, replace = TRUE) %>%
map_chr(str_flatten)
)
To verify that our simulation worked as intended, lets check to make sure that we actually did select each bag roughly an equal number of times.
tally(~hypothesis, data = results)
## hypothesis
## H1 H2 H3 H4 H5
## 990 999 1000 1033 978
R Note: “Formula” syntax The ~
symbol is used to indicate that we’re talking about column names in a dataset, rather than “free-standing” objects in the environment. This command says, “apply the tally()
function to the bag
column of the dataset supplied as the data=
argument.”
It should be pretty close to an even split — of course there is some variation due to it being a simulation: if we changed the random seed the counts would deviate with a different pattern.
Let’s check the distribution of different observations as well.
tally(~observations, data = results)
## observations
## BBB BBW BWB BWW WBB WBW WWB WWW
## 1544 320 327 317 299 337 298 1558
We can count combinations as well:
tally(hypothesis ~ observations, data = results)
## observations
## hypothesis BBB BBW BWB BWW WBB WBW WWB WWW
## H1 0 0 0 0 0 0 0 990
## H2 19 42 46 145 41 152 124 430
## H3 123 143 128 124 115 123 122 122
## H4 424 135 153 48 143 62 52 16
## H5 978 0 0 0 0 0 0 0
In most settings involving real data, even if we restrict ourselves to a binary outcome variable, the “world” that gave us our data is bigger than a bag of four marbles, and the “success chance” isn’t limited to just five possible values. We have used the “Beta-Bernoulli” model in class to represent a process with two possible outcomes, where the probability of each outcome is unknown and can be any real number between 0 and 1.
There, we used Bayes rule and algebraic manipulation of PMFs and PDFs to find posterior distributions. But we could use simulation here instead, as well.
Rather than a bag with four marbles let’s suppose we have a process (such as a weighted coin, or a political poll) with two possible outcomes, which we’ll label 0 and 1. We get three (conditionally) independent observations from this process, resulting in the observation sequence 1,0,1.
Let’s simulate this process 100000 times, letting \(\mu\) represent the probability of seeing a 1. We’ll start by using a Uniform prior on \(\mu\).
<- tibble(
SimulationData mu = runif(n = 100000, min = 0, max = 1))
We can plot the resulting collection of \(\mu\) values using a “density histogram”: a version of a histogram where the areas of the bars correspond to the proportion of values in that bin (in other words, it is an “empirical distribution” which is comparable to a density curve)
gf_dhistogram(~mu, data = SimulationData, breaks = seq(0,1,0.1))
For each of these simulated mu values, we can simulate an observation sequence of three values.
<-
SimulationData %>%
SimulationData mutate(
observations = mu %>%
map(
# function to sample number of 'successes' among independent Bernoullis
rbinom, n = 3, # How many runs to sample
size = 1 # How many independent Bernoullis per run
# prob = . # The prob= argument would control the success chance, but this is getting
# filled in via the values of mu coming in through the pipe
%>%
) map_chr(str_flatten))
Let’s see how often we got each sequence:
tally(~observations, data = SimulationData)
## observations
## 000 001 010 011 100 101 110 111
## 24970 8421 8280 8411 8332 8300 8321 24965
Now let’s “filter” our simulation results to retain only those iterations that produced “101”. That is, let’s “condition” on the data.
<- SimulationData %>%
FilteredSimulation filter(observations == "101")
Let’s plot the “empirical density” of the \(\mu\) values associated with this conditional world using a density histogram. Analogous to a PDF, areas represent proportions, and heights represent proportions per unit width.
gf_dhistogram(~mu, data = FilteredSimulation, breaks = seq(0,1,0.05))
shape1
and shape2
arguments to match what \(a\) and \(b\) ought to be for this posterior distribution, and then run the chunk below to make the graph of the simulated and analytical posterior densities. If your parameters are correct, the heights of the blue dots representing the analytic posterior should follow the histogram.<- FilteredSimulation %>%
FilteredSimulation mutate(
analytic_prior = dbeta(mu, shape1 = 1, shape2 = 1), # computes beta densities
analytic_posterior = dbeta(mu, shape1 = 1, shape2 = 1))
gf_dhistogram(~mu, data = FilteredSimulation, breaks = seq(0,1,0.05)) %>%
gf_point(analytic_prior ~ mu, size = 0.5, color = "red") %>%
gf_point(analytic_posterior ~ mu, size = 0.5, color = "blue")
.html
file (the default format). Then, save both this .Rmd
file and the Knitted .html
file to the ~/stat113/turnin/hw4/
folder.mosaic
package version: 1.8.3tidyverse
package version: 1.3.1## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mosaic_1.8.3 ggridges_0.5.3 mosaicData_0.20.2 ggformula_0.10.1
## [5] ggstance_0.3.5 Matrix_1.3-4 lattice_0.20-44 forcats_0.5.1
## [9] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.1
## [13] tidyr_1.1.3 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] ggdendro_0.1.22 httr_1.4.2 sass_0.4.0 jsonlite_1.7.2
## [5] splines_4.1.2 modelr_0.1.8 bslib_0.3.0 assertthat_0.2.1
## [9] highr_0.9 cellranger_1.1.0 ggrepel_0.9.1 yaml_2.2.1
## [13] pillar_1.6.4 backports_1.2.1 glue_1.5.1 digest_0.6.29
## [17] polyclip_1.10-0 rvest_1.0.1 colorspace_2.0-2 htmltools_0.5.2
## [21] plyr_1.8.6 pkgconfig_2.0.3 broom_0.7.9 labelled_2.8.0
## [25] haven_2.4.3 scales_1.1.1 tweenr_1.0.2 tzdb_0.1.2
## [29] ggforce_0.3.3 generics_0.1.0 farver_2.1.0 ellipsis_0.3.2
## [33] withr_2.4.3 cli_3.1.0 magrittr_2.0.1 crayon_1.4.2
## [37] readxl_1.3.1 evaluate_0.14 fs_1.5.0 fansi_0.5.0
## [41] MASS_7.3-54 xml2_1.3.2 tools_4.1.2 hms_1.1.0
## [45] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1 compiler_4.1.2
## [49] jquerylib_0.1.4 rlang_0.4.12 grid_4.1.2 rstudioapi_0.13
## [53] htmlwidgets_1.5.4 crosstalk_1.1.1 labeling_0.4.2 mosaicCore_0.9.0
## [57] rmarkdown_2.10 gtable_0.3.0 DBI_1.1.1 R6_2.5.1
## [61] gridExtra_2.3 lubridate_1.7.10 knitr_1.34 fastmap_1.1.0
## [65] utf8_1.2.2 stringi_1.7.4 Rcpp_1.0.8 vctrs_0.3.8
## [69] leaflet_2.0.4.1 dbplyr_2.1.1 tidyselect_1.1.1 xfun_0.25