If you are in some field that has data (which is a lot of fields these days), you will have undoubtly encountered the term Bayesian statistics at some point. When I first encountered it, I did what most people probably do. I googled “What is Bayesian statistics?”. After reading through some resources and getting through the idiosynatric terms/concepts (e.g. conjugate priors, posteriors, Markov Chain Monte Carlo), I still went away not really understanding what was so important about Bayesian statistics.

It wasn’t until I took the course “Introduction to Bayesian statistics using BUGS” (offered by the MRC Biostatisics Unit from the University of Cambridge) did an “Ah ha” moment hit me. Suddenly, it became clear to me what the hype was all about. Once I got the “crux” of it, I felt like this set me up for the rest of Bayesian thinking. As such, I hope to be able to share this same “Ah ha” moment I had in this post.

I’m all ‘bout that uncertainty, ‘bout that uncertainty, no certainity

That was my lame and sad attempt at trying to come up with some catching section name (inspired by Meghan Trainor’s “All About That Bass”) to describe the crux of Bayesian statistics.

meghan-trainor-bayes-stats

But joking aside, the one concept that is fundamental to Bayesian statistics is that it’s all about representing uncertainty about an unknown quantity. To illustrate this, imagine I had a coin and I asked you what is the probability that the coin gives a head if I flip it? You most likely guessed 0.5, which is a reasonable guess given your prior knowledge on how coins work. But what if I told you that I got this coin from a magic shop? You’ll probably have some doubts that it is 0.5 now. It could be anything now. Maybe it’s a trick coin that always gives you head (i.e. 1 probability) ? Or maybe it always gives you a tail (i.e. 0 probability)? Or maybe it is biased towards towards some value (e.g. 0.75 probability). The point is there is some uncertainty in your estimation of this unknown quantity.

So let’s say I wanted to flip this coin n times. We can represent the number of expected heads ($r_{n}$) as follows:

$$r_{n} \sim Binomial(\theta, n)$$

Here we are just saying that the ($r_{n}$) is distributed as a binomial distribution, which is parameterized by $\theta$ (probability of getting a head) and $n$ (number of flips/trials).

Now let’s consider the following question:

$$P(\theta_{1} < \theta < \theta_{2} | n, r_{n})$$

Verbosely put, what’s the probability that this coin gives a head ($\theta$) is between $\theta_{1}$ and $\theta_{2}$ given you have $n$ flips and $r_{n}$ heads. In classical/frequentist statistics, this question actually makes no sense. This is because in classical statistics, parameters (unknown quantities) are fixed and have no uncertainty in their value; They are either that value or they are not. But in a Bayesian world, we are never completely certain about any estimations. As such, all estimations of an unknown quantity (e.g. the probability that a coin gives a head) have some degree of uncertain.

How do we represent these uncertainities?

We have just learned that it’s all about the uncertainty in our estimations. So how do we actually represent these uncertainities? They are expressed as a probability distributions. For instance, imagine you had the following probability distribution:

library("magrittr")
library("ggplot2")
library("glue")

data.frame(x = c(0, 1)) %>%
  ggplot(aes(x)) +
  stat_function(fun = dbeta, n = 101, args = list(shape1 = 5, shape2 = 5)) +
  ylim(c(0, 2.8)) +
  xlab(expression("Probability of a head ("*theta*")")) +
  ylab("Density") +
  ggtitle(
    expression("Uncertainity of a coin's probability of a head ("*theta*")")
  ) +
  # Credible interval arrow
  geom_segment(
    aes(x = 0.26, xend = 0.74, y = 0.75, yend = 0.75),
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  geom_segment(
    aes(x = 0.74, xend = 0.26, y = 0.75, yend = 0.75),
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  geom_label(aes(x = 0.5, y = 0.9, label =  "90% credible interval")) +
  # 95% area < 0.75
  geom_segment(
    aes(x = 0.74, xend = 0.125, y = 0.05, yend = 0.05),
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  geom_label(aes(x = 0.5, y = 0.2, label =  "95% area < 0.75")) +
  # MAP value
  geom_segment(
    aes(x = 0.6, xend = 0.5, y = 2.6, yend = 2.5),
    arrow = arrow(length = unit(0.2, "cm"))
  ) +
  geom_label(aes(x = 0.73, y = 2.6, label = "Most likely value"))

plot of chunk uncertainty_as_prob_distr

The x-axis represents the plausible values that $\theta$ could take. The y-axis represents the “confidence” (this isn’t entirely accurate in mathematical terms, but will suffice for this example) that the probability of a head takes this value. By expressing our uncertainty as a probability distribution, we get these benefits:

The x value with the highest density peak represents the most likely value. In this case, that would be 0.5.
Credible intervals (CI) can be formed. For instance, [0.25 - 0.75] forms a 90% CI that tells us we are 90% confident that the parameter is in this interval. This is quite different from a confidence interval in classical statistics, which is actually quite a counter-inituitive statistic (see my “How do I Interpret a Confidence Interval?” post).
No “p-value” calculations. Just calculate the relevant tail areas. For instance, what is $P(\theta) < 0.75$? We just look at the area left of 0.75, which ends up being 0.95 of the total area. So we can say there is a 95% chance of $P(\theta) < 0.75$.
There is a technique called Bayesian inference that allows us to adapt the distribution in light of additional evidence (see my “How to Do Bayesian Inference 101” post). This ultimately means we can update our estimation of our quantity when we get more data while still accounting for our prior information on the quantity. post for more details on this)

Where do these probability distributions for representing uncertainty come from?

While you could theoreticaly make your own probability distributions, in practice people use established probability distributions (e.g. beta, normal). For instance, here are 6 different beta distributions:

library("tibble")
library("cowplot")
library("glue")

# Cowplot overrides the default ggplot2 theme. This sets it back to the default
# theme.
theme_set(theme_grey())

beta_params_df <-
  tribble(
    ~ beta_distr_num, ~ shape1, ~ shape2,
                   1,      0.5,      0.5, 
                   2,        1,        1, 
                   3,        5,        1, 
                   4,        5,        5, 
                   5,        5,       20, 
                   6,        50,     200
  )

#' Plots a beta distribution 
#'
#' @param in_data List of arguments for the beta density function (dbeta)
#' @return ggplot plot
plot_beta <- function(in_data) {
  # Create aliases for easier reference below
  cur_shape_1 <- in_data[["shape1"]]
  cur_shape_2 <- in_data[["shape2"]]

  data.frame(x = c(0, 1)) %>%
    ggplot(aes(x)) +
    stat_function(
      fun = dbeta, 
      n = 101, 
      args = list(shape1 = cur_shape_1, shape2 = cur_shape_2)
    ) +
    ylab("Density") +
    xlab("Probability of a head") +
    ggtitle(glue("Beta({cur_shape_1}, {cur_shape_2})"))
}

beta_plots <- 
  split(beta_params_df, seq_len(nrow(beta_params_df))) %>%
  lapply(plot_beta)

plot_grid(
  plotlist = beta_plots, 
  ncol = 3, nrow = 2, 
  labels = LETTERS[1:6],
  align = "v"
)

Different beta distributions

Each of these beta distributions represents a different prior knowledge. For instance, Beta(0.5, 0.5) represents a belief that the coin always gives heads or tails. Beta(1, 1) represents a global uncertainty in that the probability of a head could be any value (this is often referred to as an uniform prior). Beta(50, 200) represents a strong belief that the probability is 0.2 for getting a head.

Which type of distribution (e.g. beta, normal) depends on the type data you are modeling.

So what do I do with this uncertainty probability distribution?

If we go back to the original scenario with the coin and recall this equation representing the number of expected heads:

$$r_{n} \sim Binomial(\theta, n)$$

Rather than having our $\theta$ value as a single point estimate, we have it represented as a probability distribution representing all possible estimates with associated levels of uncertainity. This is effectively what Bayesian statisticans mean when they say setting a prior on a uncertain parameter. In this case, we will use a beta distribution as our prior.

Heads Up!

It's worth noting that in theory you can use any distribution. However in practice, certain prior distributions are used for specific models because it makes the math easier. These are called conjugate priors. I won't go into the details in this post, but is worth knowing this if you come across this term.

So what we will literally do here is substitute the beta distribution into the binomial distribution (this forms what is called a beta-binomial) and then we can generate some expected outcomes. As you might guess, the shape of your beta will influence your expected outcomes. For instance, here we test the different effects of the 6 beta distributions on the distribution of the expected number of heads if we were to flip a coin 20 times across 1000 simulations (i.e. Monte Carlo simulation):

set.seed(1)
library("dplyr")

plot_prior_prediction_distr <- function(in_data) {
  # Create aliases for easier reference below
  cur_shape_1 <- in_data[["shape1"]]
  cur_shape_2 <- in_data[["shape2"]]

  num_sims <- 1000 
  num_trials <- 20

  thetas <- rbeta(num_sims, shape1 = cur_shape_1, shape = cur_shape_2)
  coin_flip_df <- 
    tibble(
      trial_no = seq_len(num_sims),
      theta = thetas,
      heads = rbinom(num_sims, num_trials, thetas)
    ) %>%
    mutate(heads = factor(heads, levels = 0:num_trials))

  prior_p <- plot_beta(in_data)
  predictive_p <- 
    coin_flip_df %>%
    ggplot(aes(x = heads)) +
    geom_bar() +
    xlab("Number of heads") +
    ylab("Number of simulations") +
    # So that the x-scale includes 0 - num_trials so that we can compare across
    # plots
    scale_x_discrete(drop = FALSE)

  plot_grid(prior_p, predictive_p, nrow = 2)
}

beta_predictive_plots <- 
  split(beta_params_df, seq_len(nrow(beta_params_df))) %>%
  lapply(plot_prior_prediction_distr)

plot_grid(
  plotlist = beta_predictive_plots, 
  ncol = 3, nrow = 2, 
  labels = LETTERS[1:6],
  align = "v")

plot of chunk implications-of-a-prior

In each panel, the top plot is the beta prior distribution representing the uncertainty in your $\theta$ parameter for your binomial. Then the bottom plot is approximate beta-binomial distribution of heads you would get if you use the corresponding beta prior (its an approximation because we are running a Monte Carlo simulation to estimate the distribution). This bottom plot is formally called a prior predictive distribution. As you can see, the bottom plots end up following the shape of the beta prior, which is actually what you would expect.

Conclusions

You’ve made it to the end of the post! Hopefully this post helped illuminate the key concept of Bayesian statistics. Remember that it is all about representing uncertainty regarding an unknown quantity. This uncertainty is expression as a probability distribution and ultimately impacts your expected outcome. As probability distributions are used, this is what makes Bayesian statistics fundamentally probabilistic.

References

Introduction to Bayesian statistics using BUGS