The prior and posterior are relatively simple objects, mathematically, but as probability distributions over an unknown quantity that represents a property of a process that we can’t observe directly, they’re a bit abstract in connection to the questions we care about. What does it mean if we have a Beta(3,3) prior, or a Beta(8,4) distribution for \(\mu\), for example?
In this case we were able to get some intuition about what these parameters represent: \(a = 3\) acts as though we’d seen 3 “successes” from this process and \(b = 3\) is as though we’d also seen 3 “failures”. If we combine these “virtual” observations with 5 real successes and 1 real failure, we get the Beta(8,4) posterior.
On its own this is of limited use, however. In most settings, the reason we are analyzing data in the first place is so it can guide some sort of decision-making. For example, we want to predict whether the actions we take will be successful, with some sense of how confident we should be.
How can we decide how much and what kind of “virtual data” to put into our priors? And how can we use a posterior distribution over some abstract parameter to make real predictions?
It’s all well and good that a Beta(1,1) (uniform) prior encodes 1 virutal success and 1 virtual failure, or that a Beta(3,3) encodes 3 of each, but what is this actually saying?
We can get a more concrete sense of this by investigating the predictive distribution that the prior creates. That is, if we use a Uniform prior for \(\mu\), say, what does that say about the kinds of datasets we expect to see?
The following code chunk replicates the simulation we did in the last lab in a consolidated form:
In addition, I’ve created a column that simply counts the number of successes in each set of 3 observations; a useful summary statistic since we usually don’t care that much about the difference between, say, 101, and 011.
set.seed(8675309)
<- tibble(
SimulationData mu = runif(n = 100000, min = 0, max = 1)) %>%
mutate(
obs_vector = mu %>%
map(
# function to sample number of 'successes' among independent Bernoullis
rbinom, n = 3, # How many runs to sample
size = 1 # How many independent Bernoullis per run
# prob = . # The prob= argument would control the success chance, but this is getting
# filled in via the values of mu coming in through the pipe
),obs_string = obs_vector %>% map_chr(str_flatten), ## collect results into a string
n_successes = obs_vector %>% map_int(sum) ## count 1s in each result
)
Each row in the resulting data frame consists of one element of the Bayesian universe (sample space), and the columns represent properties of that element. With this representation of the universe, we can do all sorts of things.
For example, we can graph the distribution of the number of successes, aggregating over all possible values of \(\mu\) (as weighted by the chosen prior). In other words, the estimated marginal PMF of this random variable.
gf_props(~n_successes, data = SimulationData)
This makes the prior a bit more concrete: it’s evidently saying that we think any number of successes from 0 to 3 is equally likely before we see any data.
We can repeat this process with a Beta(3,3) prior and see how the results differ.
<- tibble(
SimulationData2 mu = rbeta(n = 100000, shape1 = 3, shape2 = 3)) %>%
mutate(
obs_vector = mu %>%
map(
# function to sample number of 'successes' among independent Bernoullis
rbinom, n = 3, # How many runs to sample
size = 1 # How many independent Bernoullis per run
# prob = . # The prob= argument would control the success chance, but this is getting
# filled in via the values of mu coming in through the pipe
),obs_string = obs_vector %>% map_chr(str_flatten), ## collect results into a string
n_successes = obs_vector %>% map_int(sum) ## count 1s in each result
)gf_props(~n_successes, data = SimulationData2)
We can see that this prior corresponds to an expectation that we will see 1 or 2 successes about twice as often as we’ll see 0 or 3.
For comparison, what do the predictions look like for some specific values of \(\mu\)?
Conditioned on \(\mu\), the number of successes has a Binomial distribution. We can graph its PMF directly. Here’s the PMF of the number of successes in 3 trials with \(\mu = 0.5\)
.5 <- tibble(
ConditionalData_mu0n_successes = 0:3,
pmf = dbinom(n_successes, size = 3, prob = 0.5)
)
gf_col(pmf ~ n_successes, data = ConditionalData_mu0.5)
Notice that qualitatively it has a similar shape to the previous graph which aggregated over all possible values of \(\mu\) weighting by the prior, but whereas there, 1s and 2s were about twice as likely as 0s and 3s, here they’re about three times as likely.
There is no mathematical difference between the role the prior plays in predicting data and the role the posterior plays in predicting data: both simply define a way to weight possible worlds, each of which has a specific description of how the data-generating process works.
After seeing three observations, two of which were 1s, suppose we want to estimate how likely it is that the next observation will be a 1. We could actually find this analytically in this case, but in more complex models the integrals involved will not be tractable, and so we will need to rely on computational approximations such as simulation.
<- tibble(
PosteriorPredictive mu = rbeta(n = 100000, shape1 = 1, shape2 = 1)) %>%
mutate(
obs_vector = mu %>%
map(
# function to sample number of 'successes' among independent Bernoullis
rbinom, n = 3, # How many runs to sample
size = 1 # How many independent Bernoullis per run
# prob = . # The prob= argument would control the success chance, but this is getting
# filled in via the values of mu coming in through the pipe
),obs_string = obs_vector %>% map_chr(str_flatten), ## collect results into a string
n_successes = obs_vector %>% map_int(sum) ## count 1s in each result
)gf_props(~n_successes, data = SimulationData2)
In many cases we can represent our “bottom line” quantities as expected values of some function of the unknown parameters.
For instance, if what we’re interested in is \(p(x_{next} = 1 | x_{old})\) as above, we can represent this via marginalization as
\[ p(x_{next} = 1 | x_{old}) = \int_{0}^1 p(x_{next} = 1, mu | x_{old}) d\mu \]
and rewrite the integrand using the product rule as
\[ p(x_{next} = 1, mu | x_{old}) = p(x_{next} = 1 | \mu, x_{old}) p(\mu | x_{old}) \] yielding
\[ p(x_{next} = 1 | x_{old}) = \int_{0}^1 p(x_{next} = 1 | \mu, x_{old}) p(\mu | x_{old}) d\mu \]
Notice that this has the form of an Expected Value of a function of \(\mu\), where the function is the predictive probability \(p(x_{next} = 1 | \mu, x_{old})\) which depends on \(\mu\) (in fact, in this case it simply is \(\mu\)), and where we are taking the expected value with respect to the posterior distribution of \(\mu\).
In other words, the quantity of interest is
\[ E[f(\mu) | x_{old}] \] simply the posterior mean of \(f(\mu)\) for \(f(\mu) = p(x_{next} = 1 | \mu, x_{old})\)
If we calculate this quantity for each value of \(\mu\) sampled from its posterior distribution in our simulation, we can estimate the expected value by simply finding the average in the simulation.
<- tibble(
PosteriorSimulation mu = rbeta(n = 100000, shape1 = 1, shape2 = 1)) %>%
mutate(
f_mu = dbinom(x = 1, size = 1, prob = mu) ## Calculate p(x = 1 | mu)
)<- mean(~f_mu, data = PosteriorSimulation) estimated_px
.html
file (the default format). Then, save both this .Rmd
file and the Knitted .html
file to the ~/stat113/turnin/hw4/
folder.mosaic
package version: 1.8.3tidyverse
package version: 1.3.1## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mosaic_1.8.3 ggridges_0.5.3 mosaicData_0.20.2 ggformula_0.10.1
## [5] ggstance_0.3.5 Matrix_1.3-4 lattice_0.20-44 forcats_0.5.1
## [9] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.1
## [13] tidyr_1.1.3 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] ggdendro_0.1.22 httr_1.4.2 sass_0.4.0 jsonlite_1.7.2
## [5] splines_4.1.2 modelr_0.1.8 bslib_0.3.0 assertthat_0.2.1
## [9] highr_0.9 cellranger_1.1.0 ggrepel_0.9.1 yaml_2.2.1
## [13] pillar_1.6.4 backports_1.2.1 glue_1.5.1 digest_0.6.29
## [17] polyclip_1.10-0 rvest_1.0.1 colorspace_2.0-2 htmltools_0.5.2
## [21] plyr_1.8.6 pkgconfig_2.0.3 broom_0.7.9 labelled_2.8.0
## [25] haven_2.4.3 scales_1.1.1 tweenr_1.0.2 tzdb_0.1.2
## [29] ggforce_0.3.3 generics_0.1.0 farver_2.1.0 ellipsis_0.3.2
## [33] withr_2.4.3 cli_3.1.0 magrittr_2.0.1 crayon_1.4.2
## [37] readxl_1.3.1 evaluate_0.14 fs_1.5.0 fansi_0.5.0
## [41] MASS_7.3-54 xml2_1.3.2 tools_4.1.2 hms_1.1.0
## [45] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1 compiler_4.1.2
## [49] jquerylib_0.1.4 rlang_0.4.12 grid_4.1.2 rstudioapi_0.13
## [53] htmlwidgets_1.5.4 crosstalk_1.1.1 labeling_0.4.2 mosaicCore_0.9.0
## [57] rmarkdown_2.10 gtable_0.3.0 DBI_1.1.1 R6_2.5.1
## [61] gridExtra_2.3 lubridate_1.7.10 knitr_1.34 fastmap_1.1.0
## [65] utf8_1.2.2 stringi_1.7.4 Rcpp_1.0.8 vctrs_0.3.8
## [69] leaflet_2.0.4.1 dbplyr_2.1.1 tidyselect_1.1.1 xfun_0.25