The goal of this notebook is to gain intuition for Expectation Maximization using a simple example involving coin tosses.
In Section XIV, we introduce Expectation-Maximization (EM) as a practical way to perform maximum likelihood estimation (MLE) even when some of the data is hidden (i.e in the presence of latent or hidden variables). To better understand EM, in this short notebook we'll explore a very simple coin-tossing example adapted from Do and Batzoglou, Nat. Biotechnol. (2008).
Suppose that we are given two coins A and B with unkown bias $\theta_A$ and $\theta_B$, respectively. Our goal is to estimate the bias vector $\boldsymbol{\theta}= (\theta_A, \theta_B)$ from the outcomes of the following experiment:
First choose one coin at random. Then toss the selected coin 10 times independently and record the number of heads observed. Repeat this procedure 5 times.
Formally, let $z_i\in\{A,B\}$ be the coin selected in experiment $i$ and $x_i\in\{0,1,\cdots 10\}$ be the number heads recorded by tossing $z_i$ 10 times. Since we conduct $n=5$ such experiments, we can summarize the outcomes of these 50 tosses by two vectors: $\boldsymbol{x}=(x_1,x_2\cdots, x_5)$ and $\boldsymbol{z}=(z_1,z_2,\cdots, z_5)$.
Consider first the case where we have complete knowledge of the experiment, namely, both $\boldsymbol{x}$ and $\boldsymbol{z}$ are known. How would you intuitively estimate the biases of the two coins $\boldsymbol{\theta}= (\theta_A, \theta_B)$ ?
What's the likelihood of observing the complete outcomes of these experiments? In other words, what is $P(\boldsymbol{x},\boldsymbol{z}| n,\boldsymbol{\theta} )$? You may assume this is a Bernoulli trial. Namely, every time coin A(B) is tossed, we have, with probability $\theta_A$($\theta_B$), that the outcome is heads.
What's the Maximum Likelihood Estimator (MLE)? Is this consistent with your intuition?
To test your answer, let's do some numerics! We will compare the MLE estimates of biases with an Expectation Maximization procedure where we do not know ${\bf z}$. The following code computes our best guess for the biases using MLE -- assuming we know the identity of the coin used -- and compares it estimates arrived at using an EM procedure where we have no knowledge about which coin was being tossed (though we know the same coin was tossed 10 times).
import numpy as np
from scipy.special import comb
import math
def compute_likelihood(obs, n, pheads): # No surprise, it's Binomial!!!
likelihood = comb(n, obs, exact=True)*(pheads**obs)*(1.0-pheads)**(n-obs)
return likelihood
# generate experiments
num_coin_toss = 10 # each experiment contains num_coin_toss tosses
num_exp = 5 # we perform 5 such experiments
theta_A_true = 0.8
theta_B_true = 0.4
coin_choice = np.zeros(num_exp) # initialize: 0 for A and 1 for B
head_counts = np.zeros(num_exp)
# MLE
MLE_A = 0.0
MLE_B = 0.0
# generate the outcomes of experiment
for i in np.arange(num_exp):
if np.random.randint(2) == 0: # coin A is selected
head_counts[i] = np.random.binomial(num_coin_toss , theta_A_true, 1) # toss coin A num_coin_toss times
MLE_A = MLE_A + head_counts[i] # add the number of heads observed to total headcounts
else: # coin B is selected
head_counts[i] = np.random.binomial(num_coin_toss , theta_B_true, 1) # toss coin B num_coin_toss times
coin_choice[i] = 1 # record the selection of coin B during experiment i
MLE_B = MLE_B + head_counts[i] # add the number of heads observed to total headcounts
tail_counts = num_coin_toss - head_counts
# MLE is merely the proportion of heads for each coin toss
MLE_A = MLE_A / ((num_exp - np.count_nonzero(coin_choice))*num_coin_toss)
MLE_B = MLE_B / (np.count_nonzero(coin_choice)*num_coin_toss)
# initialize the pA(heads) and pB(heads), namely, coin biases
pA_heads = np.zeros(100);
pB_heads = np.zeros(100);
pA_heads[0] = 0.60 # initial guess
pB_heads[0] = 0.50 # initial guess
# E-M begins!
epsilon = 0.001 # error threshold
j = 0 # iteration counter
improvement = float('inf')
while (improvement > epsilon):
expectation_A = np.zeros((num_exp,2), dtype=float)
expectation_B = np.zeros((num_exp,2), dtype=float)
for i in np.arange(min(len(head_counts),len(tail_counts))):
eH = head_counts[i]
eT = tail_counts[i]
# E step:
lA = compute_likelihood(eH, num_coin_toss, pA_heads[j])
lB = compute_likelihood(eH, num_coin_toss, pB_heads[j])
weightA = lA / (lA + lB)
weightB = lB / (lA + lB)
expectation_A[i] = weightA*np.array([eH, eT])
expectation_B[i] = weightB*np.array([eH, eT])
# M step
theta_A = np.sum(expectation_A, axis = 0)[0] / np.sum(expectation_A)
theta_B = np.sum(expectation_B, axis = 0)[0] / np.sum(expectation_B)
print('At iteration %d, theta_A = %2f, theta_B = %2f' % (j, theta_A, theta_B))
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
# END of E-M, print the outcome
print('E-M converges at iteration %d' %j)
print('RESULT:')
print('E-M: theta_A = %2f, theta_B = %2f' % (theta_A, theta_B))
print('MLE with complete data: theta_A = %2f, theta_B = %2f' % (MLE_A, MLE_B))
How fast does EM converge? Is the converged result close to what you'd get from MLE?
Following Exercise 1, what's the objective function we're optimizing in the E-step? Does this function have a unique global maximum?
Compare both the results of MLE and EM to the actual bias (i.e. theta_A_true and theta_B_true in the snippet above), comment on their performance.
From Exercise 2 and Section XIV, we know that the E-M algorithm often approximates the MLE even in the presence of latent (hidden variables). Like with most optimization methods for nonconcave functions, E-M only guarantees convergence to a local maximum of the objective function. For this reason, its performance can be boosted by running the EM procedure starting with multiple initial parameters.
Now instead of having a fixed initial guess of coin biases (i.e. pA_heads[0] and pB_heads[0] in the snippet), draw these values uniformly at random from $[0,1]$ and run the E-M algorithm. Repeat this twenty times and report what you observed. What's the best initial guess that gives the closest estimate to the true parameters?
As we discussed in Section X (LinReg), Maximum a posteriori (MAP) estimation differs from MLE in that it employs an augmented objective function which incorporates a prior distribution over the quantities we want to estimate, and the prior distribution can be think of as a regularizer for the objective fuction used in MLE. Here we will explore how to extend E-M to MAP estimation.
(1) First derive the MAP estimate for the one-coin-flipping example, namely, $$ \hat{{\theta}}_{MAP}(\boldsymbol{x}) = \arg\max_{\theta\in[0,1]} \log P(\boldsymbol{x}|n,{\theta} ) + \log P({\theta}), $$ where $$P(\boldsymbol{x}|n,{\theta}) = \prod_{i=1}^{10} \text{Binomial} (x_i|n,\theta)$$
$$P({\theta})=\mathcal{N}(\theta|\mu, \sigma)$$
(2) Based on (1), now modify the E-M snippet above to incorporate this prior distribution into the M-step. Comment on the performance. For the prior choice, try $P(\boldsymbol{\theta})=\mathcal{N}(\theta_A|0.83, 1)\mathcal{N}(\theta_B|0.37, 1)$.