The Prior and Posterior Predictive Distributions

The prior and posterior are relatively simple objects, mathematically, but as probability distributions over an unknown quantity that represents a property of a process that we can’t observe directly, they’re a bit abstract in connection to the questions we care about. What does it mean if we have a Beta(3,3) prior, or a Beta(8,4) distribution for \(\mu\), for example?

In this case we were able to get some intuition about what these parameters represent: \(a = 3\) acts as though we’d seen 3 “successes” from this process and \(b = 3\) is as though we’d also seen 3 “failures”. If we combine these “virtual” observations with 5 real successes and 1 real failure, we get the Beta(8,4) posterior.

On its own this is of limited use, however. In most settings, the reason we are analyzing data in the first place is so it can guide some sort of decision-making. For example, we want to predict whether the actions we take will be successful, with some sense of how confident we should be.

How can we decide how much and what kind of “virtual data” to put into our priors? And how can we use a posterior distribution over some abstract parameter to make real predictions?

The Prior Encodes Predictions About the Data

It’s all well and good that a Beta(1,1) (uniform) prior encodes 1 virutal success and 1 virtual failure, or that a Beta(3,3) encodes 3 of each, but what is this actually saying?

We can get a more concrete sense of this by investigating the predictive distribution that the prior creates. That is, if we use a Uniform prior for \(\mu\), say, what does that say about the kinds of datasets we expect to see?

The following code chunk replicates the simulation we did in the last lab in a consolidated form:

  • Sample \(\mu\) from the prior, \(p(\mu)\)
  • Sample 3 observations from the distribution that \(\mu\) defines: \(p(x \given mu)\)
  • Repeat thousands of times to obtain a joint distribution of \(\mu\) and \(x\)s

In addition, I’ve created a column that simply counts the number of successes in each set of 3 observations; a useful summary statistic since we usually don’t care that much about the difference between, say, 101, and 011.

set.seed(8675309)
SimulationData <- tibble(
  mu = runif(n = 100000, min = 0, max = 1)) %>%
  mutate(
      obs_vector = mu %>%
        map(
          rbinom,    # function to sample number of 'successes' among independent Bernoullis
          n    = 3,  # How many runs to sample
          size = 1   # How many independent Bernoullis per run
          # prob = . # The prob= argument would control the success chance, but this is getting
                     # filled in via the values of mu coming in through the pipe
          ),
      obs_string  = obs_vector %>% map_chr(str_flatten), ## collect results into a string
      n_successes = obs_vector %>% map_int(sum)          ## count 1s in each result
      )

Each row in the resulting data frame consists of one element of the Bayesian universe (sample space), and the columns represent properties of that element. With this representation of the universe, we can do all sorts of things.

For example, we can graph the distribution of the number of successes, aggregating over all possible values of \(\mu\) (as weighted by the chosen prior). In other words, the estimated marginal PMF of this random variable.

gf_props(~n_successes, data = SimulationData)

This makes the prior a bit more concrete: it’s evidently saying that we think any number of successes from 0 to 3 is equally likely before we see any data.

We can repeat this process with a Beta(3,3) prior and see how the results differ.

SimulationData2 <- tibble(
  mu = rbeta(n = 100000, shape1 = 3, shape2 = 3)) %>%
  mutate(
      obs_vector = mu %>%
        map(
          rbinom,    # function to sample number of 'successes' among independent Bernoullis
          n    = 3,  # How many runs to sample
          size = 1   # How many independent Bernoullis per run
          # prob = . # The prob= argument would control the success chance, but this is getting
                     # filled in via the values of mu coming in through the pipe
          ),
      obs_string  = obs_vector %>% map_chr(str_flatten), ## collect results into a string
      n_successes = obs_vector %>% map_int(sum)          ## count 1s in each result
      )
gf_props(~n_successes, data = SimulationData2)

We can see that this prior corresponds to an expectation that we will see 1 or 2 successes about twice as often as we’ll see 0 or 3.

For comparison, what do the predictions look like for some specific values of \(\mu\)?

Conditioned on \(\mu\), the number of successes has a Binomial distribution. We can graph its PMF directly. Here’s the PMF of the number of successes in 3 trials with \(\mu = 0.5\)

ConditionalData_mu0.5 <- tibble(
  n_successes = 0:3,
  pmf         = dbinom(n_successes, size = 3, prob = 0.5)
)

gf_col(pmf ~ n_successes, data = ConditionalData_mu0.5)

Notice that qualitatively it has a similar shape to the previous graph which aggregated over all possible values of \(\mu\) weighting by the prior, but whereas there, 1s and 2s were about twice as likely as 0s and 3s, here they’re about three times as likely.

  1. Repeat the simulation above for different prior parameters. What happens to the prior predictive distribution as we increase a and b? What kind of a prior produces a predictive distribution that looks like the conditional distribution we just made? Why do you think this is?

YOUR CODE AND RESPONSE

The Posterior Encodes Predictions About Future Data

There is no mathematical difference between the role the prior plays in predicting data and the role the posterior plays in predicting data: both simply define a way to weight possible worlds, each of which has a specific description of how the data-generating process works.

After seeing three observations, two of which were 1s, suppose we want to estimate how likely it is that the next observation will be a 1. We could actually find this analytically in this case, but in more complex models the integrals involved will not be tractable, and so we will need to rely on computational approximations such as simulation.

  1. If we used the Uniform (Beta(1,1)) prior, then our posterior is a Beta(3,2) distribution. Modify the code below to simulate the next observation, again aggregating over possible worlds, but this time weighting them according to this posterior. Give an estimate of \(p(x_{next} = 1 | x_{old})\) from this simulation.

SOLUTION
PosteriorPredictive <- tibble(
  mu = rbeta(n = 100000, shape1 = 1, shape2 = 1)) %>%
  mutate(
      obs_vector = mu %>%
        map(
          rbinom,    # function to sample number of 'successes' among independent Bernoullis
          n    = 3,  # How many runs to sample
          size = 1   # How many independent Bernoullis per run
          # prob = . # The prob= argument would control the success chance, but this is getting
                     # filled in via the values of mu coming in through the pipe
          ),
      obs_string  = obs_vector %>% map_chr(str_flatten), ## collect results into a string
      n_successes = obs_vector %>% map_int(sum)          ## count 1s in each result
      )
gf_props(~n_successes, data = SimulationData2)

Quantities of Interest as Expected Values of Functions of Parameters

In many cases we can represent our “bottom line” quantities as expected values of some function of the unknown parameters.

For instance, if what we’re interested in is \(p(x_{next} = 1 | x_{old})\) as above, we can represent this via marginalization as

\[ p(x_{next} = 1 | x_{old}) = \int_{0}^1 p(x_{next} = 1, mu | x_{old}) d\mu \]

and rewrite the integrand using the product rule as

\[ p(x_{next} = 1, mu | x_{old}) = p(x_{next} = 1 | \mu, x_{old}) p(\mu | x_{old}) \] yielding

\[ p(x_{next} = 1 | x_{old}) = \int_{0}^1 p(x_{next} = 1 | \mu, x_{old}) p(\mu | x_{old}) d\mu \]

Notice that this has the form of an Expected Value of a function of \(\mu\), where the function is the predictive probability \(p(x_{next} = 1 | \mu, x_{old})\) which depends on \(\mu\) (in fact, in this case it simply is \(\mu\)), and where we are taking the expected value with respect to the posterior distribution of \(\mu\).

In other words, the quantity of interest is

\[ E[f(\mu) | x_{old}] \] simply the posterior mean of \(f(\mu)\) for \(f(\mu) = p(x_{next} = 1 | \mu, x_{old})\)

If we calculate this quantity for each value of \(\mu\) sampled from its posterior distribution in our simulation, we can estimate the expected value by simply finding the average in the simulation.

  1. Modify the code below, which right now calculates an estimate of the expected value of \(p(x_{next} = 1 | \mu)\) using the prior distribution for mu, to find \(E[p(x_{next} = 1 | mu)\) with respect to the posterior distribution of mu. Does your answer match the one you got by simulating the next observation?

SOLUTION

PosteriorSimulation <- tibble(
  mu = rbeta(n = 100000, shape1 = 1, shape2 = 1)) %>%
  mutate(
    f_mu = dbinom(x = 1, size = 1, prob = mu)  ## Calculate p(x = 1 | mu)
      )
estimated_px <- mean(~f_mu, data = PosteriorSimulation)

Turning in Your Work

  1. When finished, Knit your Markdown document to an .html file (the default format). Then, save both this .Rmd file and the Knitted .html file to the ~/stat113/turnin/hw4/ folder.

Environment and Session Information

  • File creation date: 2022-03-21
  • R version 4.1.2 (2021-11-01)
  • R version (short form): 4.1.2
  • mosaic package version: 1.8.3
  • tidyverse package version: 1.3.1
  • Additional session information
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mosaic_1.8.3      ggridges_0.5.3    mosaicData_0.20.2 ggformula_0.10.1 
##  [5] ggstance_0.3.5    Matrix_1.3-4      lattice_0.20-44   forcats_0.5.1    
##  [9] stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4       readr_2.0.1      
## [13] tidyr_1.1.3       tibble_3.1.6      ggplot2_3.3.5     tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] ggdendro_0.1.22   httr_1.4.2        sass_0.4.0        jsonlite_1.7.2   
##  [5] splines_4.1.2     modelr_0.1.8      bslib_0.3.0       assertthat_0.2.1 
##  [9] highr_0.9         cellranger_1.1.0  ggrepel_0.9.1     yaml_2.2.1       
## [13] pillar_1.6.4      backports_1.2.1   glue_1.5.1        digest_0.6.29    
## [17] polyclip_1.10-0   rvest_1.0.1       colorspace_2.0-2  htmltools_0.5.2  
## [21] plyr_1.8.6        pkgconfig_2.0.3   broom_0.7.9       labelled_2.8.0   
## [25] haven_2.4.3       scales_1.1.1      tweenr_1.0.2      tzdb_0.1.2       
## [29] ggforce_0.3.3     generics_0.1.0    farver_2.1.0      ellipsis_0.3.2   
## [33] withr_2.4.3       cli_3.1.0         magrittr_2.0.1    crayon_1.4.2     
## [37] readxl_1.3.1      evaluate_0.14     fs_1.5.0          fansi_0.5.0      
## [41] MASS_7.3-54       xml2_1.3.2        tools_4.1.2       hms_1.1.0        
## [45] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      compiler_4.1.2   
## [49] jquerylib_0.1.4   rlang_0.4.12      grid_4.1.2        rstudioapi_0.13  
## [53] htmlwidgets_1.5.4 crosstalk_1.1.1   labeling_0.4.2    mosaicCore_0.9.0 
## [57] rmarkdown_2.10    gtable_0.3.0      DBI_1.1.1         R6_2.5.1         
## [61] gridExtra_2.3     lubridate_1.7.10  knitr_1.34        fastmap_1.1.0    
## [65] utf8_1.2.2        stringi_1.7.4     Rcpp_1.0.8        vctrs_0.3.8      
## [69] leaflet_2.0.4.1   dbplyr_2.1.1      tidyselect_1.1.1  xfun_0.25