Probabilities represent the chances of an event x occurring. In the classic interpretation, a probability is measured by the number of times event x occurs divided by the total number of trials; In other words, the frequency of the event occurring. There are three types of probabilities:

  1. Joint Probabilities
  2. Marginal Probabilities
  3. Conditional Probabilities

In this post, we will discuss each of these probabilities. Here is an overview of what will be discussed in this post.

Table of Contents

    Joint Probabilities

    The first type of probability we will discuss is the joint probability which is the probability of two different events occurring at the same time. Let’s use the diamonds dataset, from ggplot2, as our example dataset. The two different variables we are interested in are diamond colors and cuts. First we will measure the frequency of each type of diamond color-cut combination. We can represent these data using a “two-way table”:

    library("ggplot2")
    library("dplyr")
    library("reshape2")
    library("knitr")
    
    diamonds.color.cut.df <-
      diamonds %>%
      group_by(color, cut) %>%
      summarize(n = n())
    
    diamonds.color.cut.df %>%
      dcast(color ~ cut, value.nar = "n") %>%
      kable(align = "l", format = "html",
            table.attr='class="table table-striped table-hover"')
    
    color Fair Good Very Good Premium Ideal
    D 163 662 1513 1603 2834
    E 224 933 2400 2337 3903
    F 312 909 2164 2331 3826
    G 314 871 2299 2924 4884
    H 303 702 1824 2360 3115
    I 175 522 1204 1428 2093
    J 119 307 678 808 896

    Table 1: Color-Cut Two Way Frequency Table.

    Joint probabilities can be calculated by taking the proportion of times a specific color-cut combination occurs divided by total number of all color-cut combinations (i.e. frequency):

    diamonds.color.cut.prop.df <- 
      diamonds.color.cut.df %>%
      ungroup() %>%
      mutate(prop = n / sum(n))
    
    diamonds.color.cut.prop.df %>%
      dcast(color ~ cut, value.var = "prop") %>%
      kable(align = "l", format = "html", 
            table.attr = 'class="table table-striped table-hover"')
    
    color Fair Good Very Good Premium Ideal
    D 0.0030219 0.0122729 0.0280497 0.0297182 0.0525399
    E 0.0041528 0.0172970 0.0444939 0.0433259 0.0723582
    F 0.0057842 0.0168521 0.0401187 0.0432147 0.0709307
    G 0.0058213 0.0161476 0.0426214 0.0542084 0.0905451
    H 0.0056174 0.0130145 0.0338154 0.0437523 0.0577494
    I 0.0032443 0.0096774 0.0223211 0.0264739 0.0388024
    J 0.0022062 0.0056915 0.0125695 0.0149796 0.0166110

    Table 2: Color-Cut Two Way Probability Table.

    Based on Table 2, we can see that certain color-cut combinations are more probable than others.

    Heads Up!

    For brevity and simpler mathematical notions, for the rest of this post we will use the random variable X to represent color and Y to represent cut.

    For instance, a diamond with the X = G and Y = cut , $P(X = G, Y= ideal) = 0.09$, is much more probable than a diamond with X = D and Y = fair, $P(X = D, Y = fair) = 0.003$.

    Marginal Probabilities

    The second type of probability is the marginal probability. The interesting thing about a marginal probability is that the term sounds complicated, but it’s actually the probability that we are most familiar with. Basically anytime you are in interested in a single event irrespective of any other event (i.e. “marginalizing the other event”), then it is a marginal probability. For instance, the probability of a coin flip giving a head is considered a marginal probability because we aren’t considering any other events. Typically, we just say probability and not the marginal part of it because this part only comes into play when we have to factor in a second event.

    To illustrate this, we will go back to our diamond color-cut combination two-way table (Table 2). If we are interested in say the marginal probabilty $P(X = D)$, then basically we are asking “what is the probability of getting a diamond that is color D irrespective of its cut?” It should be initutive that we can calculate this information by simply summing up the joint probabilities of the row color D. Mathematically:

    $$P(X = D) = \sum_{y \in S_{Y}}P(X = D, Y = y)$$

    Where $S_{Y}$ represents all the possible values of the random variable Y. In other words, we are holding X constant ($X = D$) while iterating over all the possible Y values and summing up the joint probabilities.

    We can calculate the marginal probability of all the different colors. We can also calculate the marginal probability of cut by using the same logic and summing up the joint probabilities of the columns. For instance, to calculate $P(Y = Fair)$,

    $$P(Y = Fair) = \sum_{x \in S_{X}}P(X = x, Y = Fair)$$

    Let’s add the marginal probabilities to the two way table now:

    color.marginal.df <- 
      diamonds.color.cut.prop.df %>%
      group_by(color) %>%
      summarize(marginal = sum(prop))
    
    cut.marginal.df <- 
      diamonds.color.cut.prop.df %>%
      group_by(cut) %>%
      summarize(marginal = sum(prop))
    
    diamonds.color.cut.prop.df %>%
      dcast(color ~ cut, value.var = "prop") %>%
      left_join(color.marginal.df, by = "color") %>%
      bind_rows(
        cut.marginal.df %>%
        mutate(color = "marginal") %>%
        dcast(color ~ cut, value.var = "marginal")
      ) %>%
      kable(align = "l", format = "html",
            table.attr = 'class="table table-striped table-hover"')
    
    color Fair Good Very Good Premium Ideal marginal
    D 0.0030219 0.0122729 0.0280497 0.0297182 0.0525399 0.1256025
    E 0.0041528 0.0172970 0.0444939 0.0433259 0.0723582 0.1816277
    F 0.0057842 0.0168521 0.0401187 0.0432147 0.0709307 0.1769003
    G 0.0058213 0.0161476 0.0426214 0.0542084 0.0905451 0.2093437
    H 0.0056174 0.0130145 0.0338154 0.0437523 0.0577494 0.1539488
    I 0.0032443 0.0096774 0.0223211 0.0264739 0.0388024 0.1005191
    J 0.0022062 0.0056915 0.0125695 0.0149796 0.0166110 0.0520578
    marginal 0.0298480 0.0909529 0.2239896 0.2556730 0.3995365 NA

    Table 3: Color-Cut Two Way Probability Table with Marginal Probabilities. The marginal probability of each cut is represented in the last row whereas the marginal probability of each color ir represented in the last column.

    Conditional Probability

    The final type of probability is the conditional probability. A conditional probability is the probability of an event X occurring when a secondary event Y is true. Mathematically, it is represented as $P(X \ |\ Y)$. This is read as “probability of X given/conditioned on Y”.

    For example, if someone asked you the probability of getting a diamond with the G color, $P(X = G)$, we can use Table 3 to find the marginal probability of this event. But what if you had an additional layer of information where you knew that the diamond was also of ideal cut? This becomes a conditional probability since we have an event that is already true. A conditional probability can be calculated as follows:

    $$P(X\ |\ Y) = \frac{P(X, Y)}{P(Y)}$$

    Recall that the marginal probability is simply summing up the joint probabilities while holding one variable constant. So we can further breakdown this equation as follows:

    $$P(X\ |\ Y) = \frac{P(X, Y)}{\sum_{x \in S_{X}}P(X = x, Y)}$$

    So for us to work this out for our particular question, we need two pieces of information:

    1. $P(Y = ideal)$: Marginal probability of Y = ideal.
    2. $P(X = G, Y = ideal)$: Joint probability of X = G and Y = ideal.

    So we can calculate the conditional probability as follows:

    $$P(X = G\ |\ Y = ideal) = \frac{P(X = G, Y = ideal)}{\sum_{x \in S_{X}}P(X = x, Y = ideal)}$$

    So the conditional probability would be in this case:

    joint.prob <- 
      diamonds.color.cut.prop.df %>%
      filter(color == "G", cut == "Ideal") %>%
      .$prop
    
    marg.prob <- 
      cut.marginal.df %>%
      filter(cut == "Ideal") %>%
      .$marginal
    
    cond.prob <- joint.prob / marg.prob
    cond.prob
    
    ## [1] 0.2266252
    

    So basically if we didn’t factor in any other information, our $P(X = G)$ was 0.2093437. But once we factored in an additional level of information which was Y = ideal, our probability changed to 0.2266252. Put another way, we had a “reallocation of our belief” in an event once we factored in additional information.

    Defining a Joint Probability Equation

    In the conditional and marginal probabilities section, we defined the mathematical equations for them. We can now define a mathematical equation for joint probabilities which actually uses both the conditional and marginal probability equations. Starting with the conditional probability equation, we can do a bit of algebraic manipulation for defining joint probabilities now:

    $$\begin{align} P(X\ |\ Y) &= \frac{P(X, Y)}{P(Y)} \\ P(X\ |\ Y)\ P(Y) &= P(X, Y) \\ P(X\ |\ Y)\ \sum_{x \in S_{X}}P(X = x, Y) &= P(X, Y) \\ \end{align}$$

    What about Continuous Random Variables?

    In this post’s example dataset of diamonds, we used the random variables X and Y to represent diamond colors and cuts respectively. Both of which are discrete random variables. If dealing with continuous random variables, these probabilities still exist with the exception that we are dealing with integrals instead of summations. For instance, the mathematical representation of marginal probabilities for continuous variables becomes an integral:

    $$P(X = D) = \int_{}P(X = D, Y = y)\ dY$$

    Frequentist vs. Bayesian View

    One last thing worth mentioning is that in introduction of this post I made a statement regarding the “classic interpretation” of probability. Specifically this “classic interpretation” is referred to the frequentist view of probability. In this view, probabilities are based purely on objective, random experiments with the assumption that given enough trials (“long run”) the relative frequency of event x will equal to the true probability of x. Notice how all of the probabilities we reported in this post were based purely on the frequency.

    If you’ve done any statistics or analytics, you’ll likely have come across the term “bayesian statistics”. In brief, bayesian statistics differs from the frequentists view in that it incorporates subjective probability which is the “degree of belief” in an event. This degree of belief is called the “prior probability distribution” and is incorporated along with the data from random experiments when determining probabilities. Bayesian statistics will be discussed in a separate post.

    References

    R Session

    ## Session info --------------------------------------------------------------
    
    ##  setting  value                       
    ##  version  R version 3.2.2 (2015-08-14)
    ##  system   x86_64, darwin13.4.0        
    ##  ui       unknown                     
    ##  language (EN)                        
    ##  collate  en_CA.UTF-8                 
    ##  tz       America/Vancouver           
    ##  date     2016-03-22
    
    ## Packages ------------------------------------------------------------------
    
    ##  package    * version    date       source        
    ##  argparse   * 1.0.1      2014-04-05 CRAN (R 3.2.2)
    ##  assertthat   0.1        2013-12-06 CRAN (R 3.2.2)
    ##  captioner  * 2.2.3.9000 2015-09-16 local         
    ##  colorspace   1.2-6      2015-03-11 CRAN (R 3.2.2)
    ##  DBI          0.3.1      2014-09-24 CRAN (R 3.2.2)
    ##  devtools     1.9.1      2015-09-11 CRAN (R 3.2.2)
    ##  digest       0.6.9      2016-01-08 CRAN (R 3.2.2)
    ##  dplyr      * 0.4.3      2015-09-01 CRAN (R 3.2.2)
    ##  evaluate     0.8        2015-09-18 CRAN (R 3.2.2)
    ##  findpython   1.0.1      2014-04-03 CRAN (R 3.2.2)
    ##  formatR      1.2.1      2015-09-18 CRAN (R 3.2.2)
    ##  getopt       1.20.0     2013-08-30 CRAN (R 3.2.2)
    ##  ggplot2    * 2.0.0      2015-12-18 CRAN (R 3.2.2)
    ##  gtable       0.1.2      2012-12-05 CRAN (R 3.2.2)
    ##  highr        0.5.1      2015-09-18 CRAN (R 3.2.2)
    ##  knitr      * 1.12.7     2016-02-09 local         
    ##  lazyeval     0.1.10     2015-01-02 CRAN (R 3.2.2)
    ##  magrittr     1.5        2014-11-22 CRAN (R 3.2.2)
    ##  memoise      0.2.1      2014-04-22 CRAN (R 3.2.2)
    ##  munsell      0.4.3      2016-02-13 CRAN (R 3.2.2)
    ##  plyr         1.8.3      2015-06-12 CRAN (R 3.2.2)
    ##  proto      * 0.3-10     2012-12-22 CRAN (R 3.2.2)
    ##  R6           2.1.2      2016-01-26 CRAN (R 3.2.2)
    ##  Rcpp         0.12.3     2016-01-10 CRAN (R 3.2.2)
    ##  reshape2   * 1.4.1      2014-12-06 CRAN (R 3.2.2)
    ##  rjson        0.2.15     2014-11-03 CRAN (R 3.2.2)
    ##  scales       0.3.0      2015-08-25 CRAN (R 3.2.2)
    ##  stringi      1.0-1      2015-10-22 CRAN (R 3.2.2)
    ##  stringr      1.0.0      2015-04-30 CRAN (R 3.2.2)