CMU 10601 的课程笔记。EM 算法计算含有隐含变量的概率模型参数估计,能使用在一些无监督的聚类方法上。在 EM 算法总结提出以前就有该算法思想的方法提出,例如 HMM 中用的 Baum-Welch 算法。
Mixtures of Gaussians
MLE 的通用步骤:
Steps:
- 写出似然函数;
- 对似然函数取对数,并整理;
- 求导数,令导数为0,得到似然方程;
- 解似然方程,得到的参数即为所求;
然而有的时候我们会陷入困境,因为似然方程没法求解,比如 Mixtures of Gaussians。
One Gaussian
Given data: X1,…Xn
Modeling assumption: Xi(independently) ~ Gaussian, $\sigma^2$=1/2
LE
$$P(x_1,..,x_n|\mu)={1 \over \sqrt{2 \pi \sigma^2}} e^{-(x_i-\mu)^2 \over 2 \sigma^2} $$
MLE: $\hat \mu_{ML}= argmax \ L(x_1,..,x_n|\mu)$
$\mu_{ML} = {\sum_{i=1}^n \ x_i \over n} = \overline x_i$
K Gaussians
Given data: X1,…Xn
Modeling assumption: Xi(independently) ~ Mixture of K Gaussians $\sigma^2$=1/2, $\mu_1,…\mu_k$
Prior probability over Gaussians: $\lambda_1,…,\lambda_k \ 0 \le \lambda_j \le 1 \ \sum \lambda_j=1$, distribution of j is unknown, $\lambda_j, \mu_j$ is unknown
LE
$$
\begin{aligned}
P(x_1,..,x_n|\hat \lambda,\hat \mu,\sigma^2=0.5) & = \prod_i^n[\sum_j^k P(Gaussian \ j \ was \ chosen)P(xi|Gaussian \ j \ was \ chosen)] \\
& = \prod_i^n[\sum_j^k \lambda_j{1 \over \sqrt{2 \pi \sigma^2}} e^{-(x_i-\mu_j)^2 \over 2 \sigma_j^2}]
\end{aligned}
$$
MLE: $(\hat \mu, \hat \lambda)_{ML}= argmax \ L(x1,..,xn|\hat \lambda,\hat \mu,\sigma^2=0.5)$
How to find MLE?
“SOFT” K-MEANS
soft: our guesses are probabilities and taking values in [0,1]
hard: represents a single best guess (taking values in {0,1} or {1,…k})
Special case of the EM Algorithm
$\sigma^2$=1/2
初始化
初始化分布参数θ, $\mu^{(0)}$,$\lambda^{(0)}$
重复 EM 步骤直到收敛
1.E-step
Fill in the missing variables with the expected values
根据参数初始值或上一次迭代的模型参数来计算出隐性变量的后验概率,其实就是隐性变量的期望。作为隐藏变量的现估计值:
$$w_j^{(i)}=P(z^{(i)}=j|x^{(i)};\mu,\lambda,\sigma^2=0.5) $$
2.M-Step
Regular maximum likelihood estimation (MLE) using the values computed in the E step and the values of the other variables
将似然函数最大化以获得新的参数值 $\mu^{(l+1)}$,$\lambda^{(l+1)}$
- l: index over EM iterations
- j: index of Gaussians
- i: index of datapoints
$$\hat \lambda_j={\# j \ was \ chosen \over n}={\sum_{i=1}^n w_j^{(i)} \over n}$$
$$\hat \mu_j={\sum_{i=1}^n w_j^{(i)} x^{(i)} \over \sum_{i=1}^n w_j^{(i)}}$$
EM Algorithm guarantees:
$$L(D|\theta^{[0]}) \le L(D|\theta^{[1]}) \le L(D|\theta^{[2]})…$$
This is a Non-decreasing function. And if the likelihood function is bounded, the sequence will converge. Here the example is bounded, because $\sigma$ is fixed.
It might end up with local optimal.
例题
第一次 iteration 的表格表示
General EM algorithm
X observed data
Z unobserved ‘data’
Complete data Y=(X,Z)
Likelihood function $L(X,Z|\theta)$ known
Problem: find MLE
$$
\begin{aligned}
\hat \theta_{ML} & = argmax \ L(X|\theta)= argmax \sum_Z P(X,Z|\theta) \\
& = argmax \sum_Z P(Z|\theta)P(X|Z,\theta) \\
& = argmax \sum_Z P(X|\theta)P(Z|X,\theta)P(X|Z,\theta)
\end{aligned}
$$
Chicken-egg Problem with regard to P(Z),$\theta$
先有鸡还是先有蛋的问题。当我们知道了哪些 datapoints 属于同一个高斯分布的时候,我们才能够对这个分布的参数作出靠谱的预测,然而现在 datapoints 混在了一起,我们不知道哪个属于哪个,也就没办法估计参数。反过来,只有当我们对 K 个分布的参数作了准确的估计的时候,才知道哪些 datapoints 属于同一个高斯分布。
所以陷入了僵局,怎么办?不如就先随便整一个值出来,看你怎么变,然后我再根据你的变化调整我的变化,然后如此迭代着不断互相推导,最终就会收敛到一个解。这就是EM算法的基本思想。 -> chicken stay, egg change
EM solution
- Initialize $\theta$ to some value
- EM 方程
$$\theta^{[l+1]}=argmax \ E_{Z|\theta^l}[logP(Z,X|\theta)]$$
E 就是 expectation 的 E, write expression for E as function of $\theta$, expression: auxiliary function $Q(\theta^{[l]}|\theta)$
M 就是 argmax 的 M, find the MAX over $\theta$
=>
Simple EM solution:- E-step: express MLE as function of X’s and Z’s
know the value of Z, what would the ML solution - M-step: replace each Z with $E[Z|\theta]$
- E-step: express MLE as function of X’s and Z’s
EM guarantees
$$L(X|\theta^{[l]}) \le L(X|\theta^{l+1})$$
If likelihood is bounded, the sequence will converge.
Special but most commonly use:
$L(X,Z|\theta)$ is a member of the exponential family
适用情景
When to use:
- Data is only partially observable
- Unsupervised clustering(target value unobservable)
- Supervised learning(some instance attributes unobservable)
Some uses:
- Train Bayesian Belief Networks
- Unsupervised clustering(AUTOCLASS)
- Learning Hidden Markov Models
参考链接:
从最大似然到EM算法浅解
(EM算法)The EM Algorithm
EM(Expectation-Maximization)算法