The expectation-maximization method (EM algorithm) is an unsupervised learning algorithm that uses observable data to uncover latent variables.

Many relevant features are frequent in real-world machine learning applications, although only a fraction may be observable. When dealing with variables that are sometimes visible and sometimes not, it is possible to use the instances when the variable is seen or observed to learn and predict for the cases when it is not visible or observed. 

This method is commonly referred to as dealing with missing data. Machine learning algorithms can learn patterns and correlations from observed data using the available instances where the variable is observable. These known patterns can subsequently be utilized to predict variable values when the variable is missing or not observable.

The EM algorithm can be used to deal with variables that are only partially observable. When specific variables can be observed, we can use those observations to learn and estimate their values. The values of these variables can then be predicted even when they are not observable. 

EM technique

The EM method iteratively optimizes numerous unsupervised machine learning techniques to find the maximum likelihood or posterior estimate of parameters in statistical models with unobserved latent variables. The EM approach, which can manage missing data, is often used for latent variable models. It comprises an estimating phase (E-step) and a maximization step (M-step), forming an iterative method for improving model fit.

  • The algorithm computes the latent variables, i.e. the log-likelihood expectation, using the current parameter estimations in the E step.
  • The algorithm identifies the parameters that maximize the expected log-likelihood acquired in the E step in the M step, and the relevant model parameters are updated based on the estimated latent variables. 

The EM method aims to maximize the likelihood of the observed data by iteratively repeating these stages. It is often used for unsupervised learning tasks like clustering, where latent variables are inferred, and it has applications in various domains like machine learning, computer vision, and natural language processing.

Perform EM algorithm

  • Iterates for the number of epochs supplied (20 in this case).
  • The E-step calculates the responsibilities (gamma values) for each epoch by evaluating the Gaussian probability densities for each component and weighting them by the respective proportions.
  • To update the parameters, the M-step computes each component's weighted mean and standard deviation.
# Perform EM algorithm for 20 epochs
num_epochs = 20
log_likelihoods = []
for epoch in range(num_epochs):
	# E-step: Compute responsibilities
	gamma1 = pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat)
	gamma2 = pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)
	total = gamma1 + gamma2
	gamma1 /= total
	gamma2 /= total
	
	# M-step: Update parameters
	mu1_hat = np.sum(gamma1 * X) / np.sum(gamma1)
	mu2_hat = np.sum(gamma2 * X) / np.sum(gamma2)
	sigma1_hat = np.sqrt(np.sum(gamma1 * (X - mu1_hat)**2) / np.sum(gamma1))
	sigma2_hat = np.sqrt(np.sum(gamma2 * (X - mu2_hat)**2) / np.sum(gamma2))
	pi1_hat = np.mean(gamma1)
	pi2_hat = np.mean(gamma2)
	
	# Compute log-likelihood
	log_likelihood = np.sum(np.log(pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat)
								+ pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)))
	log_likelihoods.append(log_likelihood)
# Plot log-likelihood values over epochs
plt.plot(range(1, num_epochs+1), log_likelihoods)
plt.xlabel('Epoch')
plt.ylabel('Log-Likelihood')
plt.title('Log-Likelihood vs. Epoch')
plt.show()

Code source: Geeksforgeeks

Conclusion

EM generally converges to a local optimum rather than the global optimum, with no overall restriction on the convergence rate. It is feasible that it is arbitrarily bad in large dimensions and that there are infinite local optima. As a result, there is a need for alternative techniques of guaranteeing learning, particularly in high-dimensional settings. Moment-based approaches or spectrum techniques are EM alternatives that provide higher consistency guarantees. 

Moment-based techniques for learning a probabilistic model's parameters have recently gained popularity because they provide assurances such as global convergence under certain conditions, in contrast to EM, which frequently suffers from the problem of becoming stuck in local optima. Algorithms with learning guarantees can be derived for various relevant models such as mixture models, HMMs, etc.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE