STAT 237: Lab 3 – Simulating the Bayesian Universe

A Simple Simulation of the Marble (Cinematic?) Universe

Recall our example from class where we had a bag containing four marbles each either black or white, and a data-generation process where we would reach in the bag, pull out a marble, write down its color, and then put it back.

Having obtained a dataset consisting of a sequence “black, white, black” we asked ourselves how plausible it was that the bag contained exactly 1 black marble.

To do this we assumed that each of the five possible bags was equally plausible going in, and that each time we reached into the bag, each of the four marbles inside had an equal chance of being selected. We constructed a “tree universe” where each “path” consisted of a possible data sequence, and where the universe was divided into “possible worlds” corresponding to hypotheses about the contents of the bag.

Let’s try simulating this process in R.

We’ll generate a data sequence in two steps:

First, we’ll randomly select a “world” (hypothesis) from among the five possible worlds.
Then, we’ll generate a data sequence based on random draws from the bag corresponding to that hypothesis.

Defining the Universe

First, let’s define the universe of possible worlds:

worlds <- tribble(
  ~hypothesis, ~contents,
  "H1",   "WWWW",
  "H2",   "BWWW",
  "H3",   "BBWW",
  "H4",   "BBBW",
  "H5",   "BBBB")

R Note: Data Frames The object we’ve just constructed is a version of what’s called a “data frame”: a ‘table’ in which rows represent instances (or “cases”) and columns represent characteristics (or “variables”). Often we’ll read in data from a file, but here, the tribble() function lets us fill in entries manually, first specifying column names, and then filling in the table by rows. Click on worlds in the Environment tab to see a “spreadsheet” view of the resulting data frame.

Selecting a Random World

R Tip: The Random Number Generator “Seed” When we use a computer to do something “random”, the computer needs to “mimic” true random number generation by doing a calculation involving some arbitrary input (like the current time in milliseconds) to produce a deterministic output. Each time we run the code, the input has changed (because by default it doesn’t come from our code), and so the output changes, but there is in fact a fixed relationship between input and output. When doing a simulation, however, it’s inconvenient if the output of a chunk of code changes each time we run it, and so it is often useful to “force” the input to a particular value before a sequence of random numbers are generated. In R we can do this with the set.seed() function, putting some arbitrary number inside the parentheses.

In the chunk below I first set the seed and then pick one “world” from the “pool of worlds” at random. By default each row in the table of worlds has an equal chance of being selected.

Before you run it, change the seed to something unique to you, such as your T number.

set.seed(8675309)
current_world <- worlds %>% 
  slice_sample(n = 1)

R Note: “Pipe” syntax The %>% symbol is called a “pipe”. It takes the result of whatever code precedes it and passes that result on as the first argument of the function that follows. In this case we’re just passing the contents of the worlds object on to the sample() function, whose first argument specifies what to sample from. Writing worlds %>% sample(size = 1) is equivalent to writing sample(worlds, size = 1), but as expressions get more complex it will generally be easier to read code structured with the left-to-right pipe syntax.

Generating Data from the Chosen World

Now we’ll simulate drawing 3 marbles from the bag replacing the marble between each draw. To access individual marbles we’ll first need to split the string representing the bag’s contents into a list of individual characters. Then we’ll sample with replacement from that list, producing a list of observations. Finally, for ease of readability, we’ll “smoosh” the individual characters back into a single string.

set.seed(29747)
results <- current_world %>%
  mutate(
    content_list     = contents %>% str_split(pattern = ""),
    observation_list = content_list %>% map(sample, size = 3, replace = TRUE),
    observations     = observation_list %>% map_chr(str_flatten) 
  )

R Note: The mutate() function The mutate() function is what lets us define new columns in our data frame. We pass in an existing data frame, and then provide the names of new columns we want to create, whose values are calculated via the expression to the right of the =. This expression can include the names of any columns that have been defined up to that point.

R Note: Applying a function to each entry in a list with map() and map_chr() The map() function lets us take a function and apply it to each entry in a list, returning the results as a list of the same length. Here, str_split() returns a list with one entry, which is a vector of the color labels of the individual marbles in the bag we chose at random. We want to call sample on that vector, not the list containing that vector, so we use map() to tell R to call sample() on each entry in the list, and return a list of the same length consisting of the results, which we then put in the column observation_list. The map_chr() function does essentially the same thing as map() except that it tries to turn the results into a “character vector” instead of a generic list. We use this when we pass the output list from sample() to the str_flatten() function so that the results can be put into a dataset as a column.

I’ve stored each intermediate result of this process in a column so you can see what R is doing at each step. But we probably don’t actually care about the intermediate results; we could instead have just done:

set.seed(29747)
results <- current_world %>%
  mutate(
    observations = contents %>% 
      str_split(pattern = "") %>%
      map(sample, size = 3, replace = TRUE) %>%
      map_chr(str_flatten))

which is equivalent except that it doesn’t keep the list versions of things.

Posterior Probabilities via Repeated Sampling

A “brute force” way to estimate the posterior probability that we are in a particular world, given a set of observations, is to repeat the above process of sampling a world using the prior weights and then generating data from that world. If we do this enough times, we will have several instances of the data sequence we saw, and can simply count what proportion of those came from a particular world.

Simulating the process many times

First, we’ll define our prior weights on bags, adding them as a column to our original worlds data frame.

worlds <- worlds %>%
  mutate(
    prior = c(
      H1 = 1/5, 
      H2 = 1/5, 
      H3 = 1/5, 
      H4 = 1/5, 
      H5 = 1/5))

Next, we’ll sample 5000 ‘bags’ with replacement, with probabilities as defined by the prior.

worlds_sample  <- worlds %>%
  slice_sample(
    n         = 5000, 
    replace   = TRUE, 
    weight_by = prior)

Now, draw 3 marbles with replacement from each bag via the sample() function, just as we did above. Other than using worlds_sample instead of current_world we can use identical code, since everything we did was operating in parallel on each row in the data table.

set.seed(29747)
results <- worlds_sample %>%
  mutate(
    observations = contents %>% 
      str_split(pattern = "") %>%
      map(sample, size = 3, replace = TRUE) %>%
      map_chr(str_flatten)
  )

Counting instances and combinations

To verify that our simulation worked as intended, lets check to make sure that we actually did select each bag roughly an equal number of times.

tally(~hypothesis, data = results)

## hypothesis
##   H1   H2   H3   H4   H5 
##  990  999 1000 1033  978

R Note: “Formula” syntax The ~ symbol is used to indicate that we’re talking about column names in a dataset, rather than “free-standing” objects in the environment. This command says, “apply the tally() function to the bag column of the dataset supplied as the data= argument.”

It should be pretty close to an even split — of course there is some variation due to it being a simulation: if we changed the random seed the counts would deviate with a different pattern.

Let’s check the distribution of different observations as well.

tally(~observations, data = results)

## observations
##  BBB  BBW  BWB  BWW  WBB  WBW  WWB  WWW 
## 1544  320  327  317  299  337  298 1558

When we counted in class, we found that there were 20 different ways across all bags to get “BWB”, out of a total of 320 paths in the universe. Since we set up our simulation so that each path had an equal chance of being followed, the proportion of “BWB” in our simulation should be similar. How close is it?

YOUR CALCULATIONS AND RESPONSE

We can count combinations as well:

tally(hypothesis ~ observations, data = results)

##           observations
## hypothesis BBB BBW BWB BWW WBB WBW WWB WWW
##         H1   0   0   0   0   0   0   0 990
##         H2  19  42  46 145  41 152 124 430
##         H3 123 143 128 124 115 123 122 122
##         H4 424 135 153  48 143  62  52  16
##         H5 978   0   0   0   0   0   0   0

Using trees we found that P(BWB | H2) = 3/64, P(BWB | H3) = 8/64, and P(BWB | H4) = 9/64, with P(BWB | H1) = P(BWB | H5) = 0. Compare these results to the corresponding proportions in the simulation. How close are they?

YOUR CALCULATIONS AND RESPONSE

Now find P(H2 | BWB), first using Bayes rule and then from the simulation. How close are they?

YOUR CALCULATIONS AND RESPONSE

An infinite “bag” and a continuous prior

In most settings involving real data, even if we restrict ourselves to a binary outcome variable, the “world” that gave us our data is bigger than a bag of four marbles, and the “success chance” isn’t limited to just five possible values. We have used the “Beta-Bernoulli” model in class to represent a process with two possible outcomes, where the probability of each outcome is unknown and can be any real number between 0 and 1.

There, we used Bayes rule and algebraic manipulation of PMFs and PDFs to find posterior distributions. But we could use simulation here instead, as well.

Rather than a bag with four marbles let’s suppose we have a process (such as a weighted coin, or a political poll) with two possible outcomes, which we’ll label 0 and 1. We get three (conditionally) independent observations from this process, resulting in the observation sequence 1,0,1.

Let’s simulate this process 100000 times, letting \(\mu\) represent the probability of seeing a 1. We’ll start by using a Uniform prior on \(\mu\).

SimulationData <- tibble(
  mu = runif(n = 100000, min = 0, max = 1))

We can plot the resulting collection of \(\mu\) values using a “density histogram”: a version of a histogram where the areas of the bars correspond to the proportion of values in that bin (in other words, it is an “empirical distribution” which is comparable to a density curve)

gf_dhistogram(~mu, data = SimulationData, breaks = seq(0,1,0.1))

For each of these simulated mu values, we can simulate an observation sequence of three values.

SimulationData <- 
  SimulationData %>%
    mutate(
      observations = mu %>%
        map(
          rbinom,    # function to sample number of 'successes' among independent Bernoullis
          n    = 3,  # How many runs to sample
          size = 1   # How many independent Bernoullis per run
          # prob = . # The prob= argument would control the success chance, but this is getting
                     # filled in via the values of mu coming in through the pipe
          ) %>%
        map_chr(str_flatten))

Let’s see how often we got each sequence:

tally(~observations, data = SimulationData)

## observations
##   000   001   010   011   100   101   110   111 
## 24970  8421  8280  8411  8332  8300  8321 24965

What is the best estimate of the marginal probability of getting ‘101’, based on this simulation? What is the estimate of the marginal probability of getting two 1s in any positions?

YOUR CALCULATIONS AND RESPONSE

Now let’s “filter” our simulation results to retain only those iterations that produced “101”. That is, let’s “condition” on the data.

FilteredSimulation <- SimulationData %>%
  filter(observations == "101")

Let’s plot the “empirical density” of the \(\mu\) values associated with this conditional world using a density histogram. Analogous to a PDF, areas represent proportions, and heights represent proportions per unit width.

gf_dhistogram(~mu, data = FilteredSimulation, breaks = seq(0,1,0.05))

For comparison let’s draw the Beta distribution density that we get for the posterior distribution of \(\mu\) given a Uniform (i.e., Beta(1,1)) prior and observations consisting of two heads in three flips. Fix the values of the shape1 and shape2 arguments to match what \(a\) and \(b\) ought to be for this posterior distribution, and then run the chunk below to make the graph of the simulated and analytical posterior densities. If your parameters are correct, the heights of the blue dots representing the analytic posterior should follow the histogram.

FilteredSimulation <- FilteredSimulation %>%
  mutate(
    analytic_prior     = dbeta(mu, shape1 = 1, shape2 = 1), # computes beta densities
    analytic_posterior = dbeta(mu, shape1 = 1, shape2 = 1))

gf_dhistogram(~mu, data = FilteredSimulation, breaks = seq(0,1,0.05)) %>%
  gf_point(analytic_prior ~ mu, size = 0.5, color = "red") %>%
  gf_point(analytic_posterior ~ mu, size = 0.5, color = "blue")

Turning in Your Work

When finished, Knit your Markdown document to an .html file (the default format). Then, save both this .Rmd file and the Knitted .html file to the ~/stat113/turnin/hw4/ folder.

Environment and Session Information

File creation date: 2022-03-21
R version 4.1.2 (2021-11-01)
R version (short form): 4.1.2
mosaic package version: 1.8.3
tidyverse package version: 1.3.1
Additional session information

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mosaic_1.8.3      ggridges_0.5.3    mosaicData_0.20.2 ggformula_0.10.1 
##  [5] ggstance_0.3.5    Matrix_1.3-4      lattice_0.20-44   forcats_0.5.1    
##  [9] stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4       readr_2.0.1      
## [13] tidyr_1.1.3       tibble_3.1.6      ggplot2_3.3.5     tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] ggdendro_0.1.22   httr_1.4.2        sass_0.4.0        jsonlite_1.7.2   
##  [5] splines_4.1.2     modelr_0.1.8      bslib_0.3.0       assertthat_0.2.1 
##  [9] highr_0.9         cellranger_1.1.0  ggrepel_0.9.1     yaml_2.2.1       
## [13] pillar_1.6.4      backports_1.2.1   glue_1.5.1        digest_0.6.29    
## [17] polyclip_1.10-0   rvest_1.0.1       colorspace_2.0-2  htmltools_0.5.2  
## [21] plyr_1.8.6        pkgconfig_2.0.3   broom_0.7.9       labelled_2.8.0   
## [25] haven_2.4.3       scales_1.1.1      tweenr_1.0.2      tzdb_0.1.2       
## [29] ggforce_0.3.3     generics_0.1.0    farver_2.1.0      ellipsis_0.3.2   
## [33] withr_2.4.3       cli_3.1.0         magrittr_2.0.1    crayon_1.4.2     
## [37] readxl_1.3.1      evaluate_0.14     fs_1.5.0          fansi_0.5.0      
## [41] MASS_7.3-54       xml2_1.3.2        tools_4.1.2       hms_1.1.0        
## [45] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      compiler_4.1.2   
## [49] jquerylib_0.1.4   rlang_0.4.12      grid_4.1.2        rstudioapi_0.13  
## [53] htmlwidgets_1.5.4 crosstalk_1.1.1   labeling_0.4.2    mosaicCore_0.9.0 
## [57] rmarkdown_2.10    gtable_0.3.0      DBI_1.1.1         R6_2.5.1         
## [61] gridExtra_2.3     lubridate_1.7.10  knitr_1.34        fastmap_1.1.0    
## [65] utf8_1.2.2        stringi_1.7.4     Rcpp_1.0.8        vctrs_0.3.8      
## [69] leaflet_2.0.4.1   dbplyr_2.1.1      tidyselect_1.1.1  xfun_0.25

STAT 237: Lab 3 – Simulating the Bayesian Universe

Colin Dawson

March 16, 2022

A Simple Simulation of the Marble (Cinematic?) Universe

Defining the Universe

Selecting a Random World

Generating Data from the Chosen World

Posterior Probabilities via Repeated Sampling

Simulating the process many times

Counting instances and combinations

YOUR CALCULATIONS AND RESPONSE

YOUR CALCULATIONS AND RESPONSE

YOUR CALCULATIONS AND RESPONSE

An infinite “bag” and a continuous prior

YOUR CALCULATIONS AND RESPONSE

Turning in Your Work

Environment and Session Information