In my previous post “Using Mixture Models for Clustering in R”, I covered the concept of mixture models and how one could use a gaussian mixture model (GMM), one type of mixure model, for clustering. If you are like me, not knowing what is happening “under the hood” may bug you. What is actually happening when I run the normalmixEM function from mixtools? How does it know where to “put” the components? In this post, I will cover how we can implement your own GMM in R.

We will first start this post by discussing a bit about parameter estimation. Then we will introduce the expectation-maximization (EM) algorithm and how it can be used in parameter estimation. Then we will walk through how to implement the EM algorithm.

Table of Contents

    Parameter Estimation in the “Complete Data” Scenario

    Let’s start with the scenario where you are given the following 1-dimensional data with the colors representing the “source” they are from:

    library("ggplot2")
    library("dplyr")
    library("reshape2")
    
    options(scipen = 999)
    set.seed(1)
    
    comp1.vals <- data_frame(comp = "A", 
                             vals = rnorm(50, mean = 1, sd = 0.5))
    comp2.vals <- data_frame(comp = "B", 
                             vals = rnorm(50, mean = 1.5, sd = 0.5))
    
    vals.df <- bind_rows(comp1.vals, comp2.vals)
    
    vals.df %>%
      ggplot(aes(x = vals, y = "A", color = factor(comp))) +
      geom_point(alpha = 0.4) +
      scale_color_discrete(name = "Source of Data") +
      xlab("Values") +
      theme(axis.ticks.y = element_blank(),
            axis.text.y = element_blank(),
            axis.title.y = element_blank(),
            legend.position = "top")
    

    plot of chunk parameter_est_complete_data

    Let’s say you believe that the data sources are actually a gaussian distribution. Then with these values and knowledge of their data source, how could you estimate the parameters of the gaussian distributions ($\mathcal{N}(X|\mu,\sigma$)? In other words, what parameters most likely gave rise to the data we are observing.

    We can use maximum likelihood estimation (MLE) to estimate the mean and standard deviation:

    • $\mu_{k} = \frac{\sum_{i}^{N_{k}}x_{i,k}}{N_{k}}$
    • $\sigma_{k} = \frac{\sum_{i}^{N_{k}}(x_{i,k} - \mu_{k})^2}{N_{k}}$

    Let’s give this a try and see what values we get:

    vals.df %>%
      group_by(comp) %>%
      summarize(mean_vals = mean(vals),
                sd_vals = sd(vals))
    
    ## # A tibble: 2 × 3
    ##    comp mean_vals  sd_vals
    ##   <chr>     <dbl>    <dbl>
    ## 1     A  1.050224 0.415697
    ## 2     B  1.558663 0.484414
    

    This is very close to the true parameters of the gaussian which were mean and standard deviation of (1, 0.5), (1.5, 0.5) for the respective gaussians (NOTE: the more data we sample, the closer we get to the true parameters).

    Parameter Estimation in the “Incomplete Data” Scenario

    Now let us consider the following scenario where you have the same data but you don’t know the source now.

    vals.df %>%
      ggplot(aes(x = vals, y = 0)) +
      geom_point(alpha = 0.4) +
      xlab("Values") +
      theme(axis.ticks.y = element_blank(),
            axis.text.y = element_blank(),
            axis.title.y = element_blank())
    

    plot of chunk parameter_est_incomplete_data

    Without the labels on the data (often called latent variables) indicating their source, we are hopeless in applying MLE. But if we somehow find a way to “complete” the data (i.e. find the labels for each data point), then we can go back to using MLE. One way around this could be to:

    1. Set some initial parameter estimates on your gaussians.
    2. Assign (label) the data to one of the gaussians based on which one most likely generated the data.
    3. Treat the labels as being correct and then use MLE to re-estimate the parameters for the different gaussians.
    4. Repeat steps 2 and 3 until there is convergence.

    This iterative approach is the basis for something called the expectation-maximization (EM) algorithm. The main difference is rather than assigning the data to one of the possible gaussians, EM considers the possibilities of the data belonging to all of the possible gaussians. This is analogous to the concept of “hard” and “soft” labeling that I mentioned in my previous post “Using Mixture Models for Clustering in R”. Working in a probablistic framework allows us to assign probabilities (responsibilities) of data belonging to a gaussian so that we don’t have to “commit” the data to any one gaussian.

    The best way to remember what EM is used for is as follows:

    The expectation maximization algorithm is a natural generalization of maximum likelihood estimation to the incomplete data case. – Chuong B Do & Serafim Batzoglou. What is the expectation maximization algorithm? Nature Biotechnology. 2008.

    In this post, we will use the EM algorithm to fit our GMM.

    Fitting a GMM using Expectation Maximization

    The EM algorithm consists of 3 major steps:

    1. Initialization
    2. Expectation (E-step)
    3. Maximization (M-step)

    Steps 2 and 3 are repeated until convergence. We will cover each of these steps and how convergence is reached below. But first we must understand how to mathematically represent a GMM:

    $$P(X|\mu,\sigma,\alpha) = \sum_{k=1}^{K}\alpha_k\mathcal{N}(X|\mu_{k},\sigma_{k}^{2})$$
    • X = Dataset of n elements ($x_{1}, …, x_{n}$).
    • $\alpha_{k}$ = Mixing weight of the kth component. $\sum_{k=1}^{K}\alpha_{k} = 1$.
    • $\mathcal{N}(x|\mu_{k},\sigma_{k})$ = Gaussian probability density function (pdf) of the kth component defined by the parameters $\mu_{k}$ and $\sigma_{k}$.
    • $\mu_{k}$ = Mean of the kth component.
    • $\sigma_{k}^{2}$ = Variance of the kth component.

    So for a two component GMM, we would mathematically represent this as:

    $$P(X|\mu,\sigma,\alpha) = \alpha_1\mathcal{N}(X|\mu_{1},\sigma_{1}^{2}) + \alpha_2\mathcal{N}(X|\mu_{2},\sigma_{2}^{2})$$

    In case you were wondering what the $P(X|\mu,\sigma,\alpha)$ means, don’t worry about that for now. We will explain exactly what this means later on in the post.

    Initialization: Determining the Initial GMM Parameters

    When it comes to initialization of a GMM, we are asking the fundamental question of what model parameters do we first assign? This can be done in different ways, but for GMMs it’s very common to first run k-means on your data to get some hard-labels on the data. With these hard-labels, we can use MLE to estimate the component parameters for our initialization (remember MLE works in the “Complete Data” scenario):

    • $\mu_{k} = \frac{\sum_{i}^{N_{k}}x_{i,k}}{N_{k}}$
    • $\sigma_{k}^{2} = \frac{\sum_{i}^{N_{k}}(x_{i,k} - \mu_{k})^2}{N_{k}}$
    • $\alpha_{k} = \frac{N_{k}}{N}$

    Where $N_{k}$ indicates the number of data points in the kth component. Let’s try that here:

    wait <- faithful$waiting
    
    wait.kmeans <- kmeans(wait, 2)
    wait.kmeans.cluster <- wait.kmeans$cluster
    
    wait.df <- data_frame(x = wait, cluster = wait.kmeans.cluster)
    
    wait.df %>%
      mutate(num = row_number()) %>%
      ggplot(aes(y = num, x = x, color = factor(cluster))) +
      geom_point() +
      ylab("Values") +
      ylab("Data Point Number") +
      scale_color_discrete(name = "Cluster") +
      ggtitle("K-means Clustering")
    

    plot of chunk kmeans_init

    Since we specified 2 clusters, k-means nicely splits the data into 2 clusters with means and standard deviation as follows:

    wait.summary.df <- wait.df %>%
      group_by(cluster) %>%
      summarize(mu = mean(x), variance = var(x), std = sd(x), size = n())
    
    wait.summary.df %>%
      select(cluster, mu, variance, std)
    
    ## # A tibble: 2 × 4
    ##   cluster       mu variance      std
    ##     <int>    <dbl>    <dbl>    <dbl>
    ## 1       1 54.75000 34.75505 5.895341
    ## 2       2 80.28488 31.66690 5.627335
    

    We can also generate the initial mixing weights as follows:

    wait.summary.df <- wait.summary.df %>%
      mutate(alpha = size / sum(size))
    
    wait.summary.df %>%
      select(cluster, size, alpha)
    
    ## # A tibble: 2 × 3
    ##   cluster  size     alpha
    ##     <int> <int>     <dbl>
    ## 1       1   100 0.3676471
    ## 2       2   172 0.6323529
    

    Expectation: Calculating the “Soft Labels” of Each Data Point (E-step)

    Now that we have the initial parameters of our GMM, we now have to determine what is the probability (soft label; responsibility) that the data point ($x_{i}$) belongs to component ($k_{j}$)? This is considered the expectation step (E-step) of MLE where we are calculating the “expectation values” of the soft labels for each data point.

    Mathematically, the question can be posed like this $P(x_{i} \in k_{j} | x_{i})$. How do we actually solve this equation? To help us, we can apply Bayes’ rule here:

    $$P(x_{i} \in k_{j} | x_{i}) = \frac{P(x_{i} | x_{i} \in k_{j})P(k_{j})}{P(x_{i})}$$

    The parts of this equation are related to the GMM equation above as follows:

    • $P(x_{i} | x_{i} \in k_{j}) = \mathcal{N}(x_{i}|\mu_{k_{j}},\sigma_{k_{j}}^{2})$
    • $P(k_{j}) = \alpha_{k_{j}}$
    • $P(x_{i}) = \sum_{k=1}^{K}\alpha_k\mathcal{N}(x_{i}|\mu_{k},\sigma_{k}^{2})$

    What we are interested in is $P(x_{i} \in k_{j} | x_{i})$ which is called the posterior probability. Knowing these equations, we can easily calculate this. For instance, what is the posterior probability of x = 66 belong to the first component? We can first calculate the top part of the equation like this in R:

    comp1.prod <- 
      dnorm(66, wait.summary.df$mu[1], wait.summary.df$std[1]) *
      wait.summary.df$alpha[1]
    

    Here we are using the dnorm function from R to make use of the gaussian pdf.

    To calculate the bottom part of the equation, we actually need to calculate this value for both components and sum them up:

    comp2.prod <- 
      dnorm(66, wait.summary.df$mu[2], wait.summary.df$std[2]) *
      wait.summary.df$alpha[2]
    
    normalizer <- comp1.prod + comp2.prod
    

    Now that we have all the components of the equation, let’s plug and solve this:

    comp1.prod / normalizer
    
    ## [1] 0.6926023
    

    We can easily calculate this for every data point as follows:

    comp1.prod <- dnorm(x = wait, mean = wait.summary.df$mu[1], 
                        sd = wait.summary.df$std[1]) * wait.summary.df$alpha[1]
    
    comp2.prod <- dnorm(x = wait, mean = wait.summary.df$mu[2], 
                        sd = wait.summary.df$std[2]) * wait.summary.df$alpha[2]
    
    normalizer <- comp1.prod + comp2.prod
    
    comp1.post <- comp1.prod / normalizer
    comp2.post <- comp2.prod / normalizer
    

    Maximization: Re-estimate the Component Parameters (M-step)

    Now that we have posterior probabilites (i.e. soft labels), we can re-estimate our component parameters. We simply have to make a little adjustment to the MLE equations that we specified early. Specifically, the $N_{k}$ (remember there are no hard labels) is replaced with the posterior probability $P(x_{i} \in k_{j} | x_{i})$ in each equation.

    • $\mu_{k} = \frac{\sum_{i}^{N}P(x_{i} \in k_{j} | x_{i})x_{i}}{\sum_{i}^{N}P(x_{i} \in k_{j} | x_{i})}$
    • $\sigma_{k}^{2} = \frac{\sum_{i}^{N}P(x_{i} \in k_{j} | x_{i})(x_{i} - \mu_{k})^2}{\sum_{i}^{N}P(x_{i} \in k_{j} | x_{i})}$
    • $\alpha_{k} = \frac{\sum_{i}^{N}P(x_{i} \in k_{j} | x_{i})}{N}$

    With these equations, we can now plug in our values and calculate the components parameters using our example from above:

    comp1.n <- sum(comp1.post)
    comp2.n <- sum(comp2.post)
    
    comp1.mu <- 1/comp1.n * sum(comp1.post * wait)
    comp2.mu <- 1/comp2.n * sum(comp2.post * wait)
    
    comp1.var <- sum(comp1.post * (wait - comp1.mu)^2) * 1/comp1.n
    comp2.var <- sum(comp2.post * (wait - comp2.mu)^2) * 1/comp2.n
    
    comp1.alpha <- comp1.n / length(wait)
    comp2.alpha <- comp2.n / length(wait)
    
    comp.params.df <- data.frame(comp = c("comp1", "comp2"),
                                 comp.mu = c(comp1.mu, comp2.mu),
                                 comp.var = c(comp1.var, comp2.var),
                                 comp.alpha = c(comp1.alpha, comp2.alpha),
                                 comp.cal = c("self", "self"))
    

    Checking for Convergence

    As mentioned above, we repeat the expectation and maximization step until we reach “convergence”. But what exactly is convergence? The concept of convergence means that we have a change that is minimal enough for us to consider it to neligible and stop running EM. So the question becomes what is the change we are measuring?

    Well since we are fitting trying to fit a GMM to our data, then inituitvely we should have something that measures the fit of our GMM! This is actually the final piece of the puzzle. Formally speaking, what we are looking for is called a cost function (aka. objective function, loss function).

    As it turns out, we’ve already seen the cost function that we need (See “Fitting a GMM using Expectation Maximization”):

    $$ P(X|\mu,\sigma,\alpha) = \sum_{k=1}^{K}\alpha_k\mathcal{N}(X|\mu_{k},\sigma_{k}^{2}) $$

    This called the likelihood (see this post for a good explanation on what a likelihood is) and is essentially the fit of your model. Really what we are asking in layman terms is given these model parameters ($\mu,\sigma,\alpha$), what is the probability that our data X was generated by them. A slight modification of this is the log likelihood which equates to:

    $$ \ln P(X|\mu,\sigma,\alpha) = \sum_{n=1}^{N}\ln \sum_{k=1}^{K}\alpha_k\mathcal{N}(x_{n}|\mu_{k},\sigma_{k}^{2}) $$

    The reason why we do this is because if we simply calculate the likelihood we would end up dealing with very small values which can be problematic. So we take the natural logarithm of the likelihood to circumvent this.

    The larger the log likelihood = Better the model parameters fit the data

    For instance, the log likelihood of our first EM step:

    # Already calculate component responsibilities for each data point from above
    sum.of.comps <- comp1.prod + comp2.prod
    sum.of.comps.ln <- log(sum.of.comps, base = exp(1))
    sum(sum.of.comps.ln)
    
    ## [1] -1034.246
    

    So to test for convergency, we can calculate the log likelihood at the end of each EM step (i.e. model fit with these parameters) and then test whether it has changed “significantly” (defined by the user) from the last EM step. If it has, then we repeat another step of EM. If not, then we consider that EM has converged and then these are our final parameters.

    Putting it All Together

    Now that we have all these pieces of information together, let’s put it altogether:

    #' Expectation Step of the EM Algorithm
    #'
    #' Calculate the posterior probabilities (soft labels) that each component
    #' has to each data point.
    #'
    #' @param sd.vector Vector containing the standard deviations of each component
    #' @param sd.vector Vector containing the mean of each component
    #' @param alpha.vector Vector containing the mixing weights  of each component
    #' @return Named list containing the loglik and posterior.df
    e_step <- function(x, mu.vector, sd.vector, alpha.vector) {
      comp1.prod <- dnorm(x, mu.vector[1], sd.vector[1]) * alpha.vector[1]
      comp2.prod <- dnorm(x, mu.vector[2], sd.vector[2]) * alpha.vector[2]
      sum.of.comps <- comp1.prod + comp2.prod
      comp1.post <- comp1.prod / sum.of.comps
      comp2.post <- comp2.prod / sum.of.comps
    
      sum.of.comps.ln <- log(sum.of.comps, base = exp(1))
      sum.of.comps.ln.sum <- sum(sum.of.comps.ln)
    
      list("loglik" = sum.of.comps.ln.sum,
           "posterior.df" = cbind(comp1.post, comp2.post))
    }
    
    #' Maximization Step of the EM Algorithm
    #'
    #' Update the Component Parameters
    #'
    #' @param x Input data.
    #' @param posterior.df Posterior probability data.frame.
    #' @return Named list containing the mean (mu), variance (var), and mixing
    #'   weights (alpha) for each component.
    m_step <- function(x, posterior.df) {
      comp1.n <- sum(posterior.df[, 1])
      comp2.n <- sum(posterior.df[, 2])
    
      comp1.mu <- 1/comp1.n * sum(posterior.df[, 1] * x)
      comp2.mu <- 1/comp2.n * sum(posterior.df[, 2] * x)
    
      comp1.var <- sum(posterior.df[, 1] * (x - comp1.mu)^2) * 1/comp1.n
      comp2.var <- sum(posterior.df[, 2] * (x - comp2.mu)^2) * 1/comp2.n
    
      comp1.alpha <- comp1.n / length(x)
      comp2.alpha <- comp2.n / length(x)
    
      list("mu" = c(comp1.mu, comp2.mu),
           "var" = c(comp1.var, comp2.var),
           "alpha" = c(comp1.alpha, comp2.alpha))
    }
    

    Now we just need to write a loop to go between the functions for each EM step. Each iteration will consist of us first calling the e_step function and then calling the m_step function (if needed). We will run this for 50 iterations or when the log likelihood difference between two iteration is less than 1e-6 (whichever comes first):

    for (i in 1:50) {
      if (i == 1) {
        # Initialization
        e.step <- e_step(wait, wait.summary.df[["mu"]], wait.summary.df[["std"]],
                         wait.summary.df[["alpha"]])
        m.step <- m_step(wait, e.step[["posterior.df"]])
        cur.loglik <- e.step[["loglik"]]
        loglik.vector <- e.step[["loglik"]]
      } else {
        # Repeat E and M steps till convergence
        e.step <- e_step(wait, m.step[["mu"]], sqrt(m.step[["var"]]), 
                         m.step[["alpha"]])
        m.step <- m_step(wait, e.step[["posterior.df"]])
        loglik.vector <- c(loglik.vector, e.step[["loglik"]])
    
        loglik.diff <- abs((cur.loglik - e.step[["loglik"]]))
        if(loglik.diff < 1e-6) {
          break
        } else {
          cur.loglik <- e.step[["loglik"]]
        }
      }
    }
    loglik.vector
    
    ##  [1] -1034.246 -1034.047 -1034.020 -1034.010 -1034.005 -1034.003 -1034.002
    ##  [8] -1034.002 -1034.002 -1034.002 -1034.002 -1034.002 -1034.002 -1034.002
    ## [15] -1034.002 -1034.002
    

    As you can see, we actual stopped running EM after 16 iterations because the log likelihood didn’t change much (specifically, the difference between the 15th and 16th iteration was < 1e6). We classify this as convergence of the algorithm and this represents our final fit.

    So our final component parameters are as follows:

    m.step
    
    ## $mu
    ## [1] 54.61510 80.09122
    ## 
    ## $var
    ## [1] 34.47368 34.42849
    ## 
    ## $alpha
    ## [1] 0.3608934 0.6391066
    

    Which produces the following mixture model:

    #' Plot a Mixture Component
    #' 
    #' @param x Input ata.
    #' @param mu Mean of component.
    #' @param sigma Standard of component.
    #' @param lam Mixture weight of component.
    plot_mix_comps <- function(x, mu, sigma, lam) {
      lam * dnorm(x, mu, sigma)
    }
    
    data.frame(x = wait) %>%
      ggplot() +
      geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", 
                     fill = "white") +
      stat_function(geom = "line", fun = plot_mix_comps,
                    args = list(m.step$mu[1], sqrt(m.step$var[1]), 
                               lam = m.step$alpha[1]),
                    colour = "red", lwd = 1.5) +
      stat_function(geom = "line", fun = plot_mix_comps,
                    args = list(m.step$mu[2], sqrt(m.step$var[2]), 
                               lam = m.step$alpha[2]),
                    colour = "blue", lwd = 1.5) +
      ylab("Density") +
      xlab("Values") +
      ggtitle("Final GMM Fit")
    

    plot of chunk final_gmm

    Summary

    In this post, I have demonstrated how to implement your own mixture model where each component is a gaussian. The logic can easily be extended to other types of mixture models by simply substituting the components for the appropriate statistical distribution needed. We used the well known EM algorithm to allow us to use MLE in the incomplete data scenario.

    I hope this post gives you some insight into the inner workings of how to fit a gaussian mixture model!

    • 13-03-2017 - Fixed maximization equations. Used the likelihood instead of posterior equation by accident. Thanks Koustuv Sinha for pointing this out!
    • 14-12-2017 - Fixed math symbol for variance ($\sigma^{2}$) across various equations. Was using accidentally using $\sigma$ instead. Thanks Eric for pointing this out!

    R Session

    devtools::session_info()
    
    ## Session info --------------------------------------------------------------
    
    ##  setting  value                       
    ##  version  R version 3.3.2 (2016-10-31)
    ##  system   x86_64, darwin11.4.2        
    ##  ui       unknown                     
    ##  language (EN)                        
    ##  collate  en_CA.UTF-8                 
    ##  tz       Europe/London               
    ##  date     2017-12-14
    
    ## Packages ------------------------------------------------------------------
    
    ##  package    * version date       source         
    ##  argparse   * 1.0.4   2016-10-28 CRAN (R 3.3.2) 
    ##  assertthat   0.1     2013-12-06 CRAN (R 3.3.2) 
    ##  colorspace   1.3-1   2016-11-18 CRAN (R 3.3.2) 
    ##  DBI          0.5-1   2016-09-10 CRAN (R 3.3.2) 
    ##  devtools     1.12.0  2016-12-05 CRAN (R 3.3.2) 
    ##  digest       0.6.10  2016-08-02 CRAN (R 3.3.2) 
    ##  dplyr      * 0.5.0   2016-06-24 CRAN (R 3.3.2) 
    ##  evaluate     0.10    2016-10-11 CRAN (R 3.3.2) 
    ##  findpython   1.0.1   2014-04-03 CRAN (R 3.3.2) 
    ##  gdtools    * 0.1.3   2016-11-11 CRAN (R 3.3.2) 
    ##  getopt       1.20.0  2013-08-30 CRAN (R 3.3.2) 
    ##  ggplot2    * 2.2.0   2016-11-11 CRAN (R 3.3.2) 
    ##  gtable       0.2.0   2016-02-26 CRAN (R 3.3.2) 
    ##  highr        0.6     2016-05-09 CRAN (R 3.3.2) 
    ##  knitr      * 1.15.1  2016-11-22 CRAN (R 3.3.2) 
    ##  labeling     0.3     2014-08-23 CRAN (R 3.3.2) 
    ##  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.2) 
    ##  magrittr     1.5     2014-11-22 CRAN (R 3.3.2) 
    ##  memoise      1.0.0   2016-01-29 CRAN (R 3.3.2) 
    ##  munsell      0.4.3   2016-02-13 CRAN (R 3.3.2) 
    ##  nvimcom    * 0.9-14  2017-03-08 local (@0.9-14)
    ##  plyr         1.8.4   2016-06-08 CRAN (R 3.3.2) 
    ##  proto      * 1.0.0   2016-10-29 CRAN (R 3.3.2) 
    ##  R6           2.2.0   2016-10-05 CRAN (R 3.3.2) 
    ##  Rcpp         0.12.8  2016-11-17 CRAN (R 3.3.2) 
    ##  reshape2   * 1.4.2   2016-10-22 CRAN (R 3.3.2) 
    ##  rjson        0.2.15  2014-11-03 CRAN (R 3.3.2) 
    ##  scales       0.4.1   2016-11-09 CRAN (R 3.3.2) 
    ##  stringi      1.1.2   2016-10-01 CRAN (R 3.3.2) 
    ##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.2) 
    ##  svglite    * 1.2.0   2016-11-04 CRAN (R 3.3.2) 
    ##  tibble       1.2     2016-08-26 CRAN (R 3.3.2) 
    ##  withr        1.0.2   2016-06-20 CRAN (R 3.3.2)
    

    References