## 多變量常態分布的最大似然估計

$\mathbf{x}$ 為一 $p$ 維連續型隨機向量。若 $\mathbf{x}$ 服從 (非退化) 多變量常態分布，則機率 (概率) 密度函數完全由 $p$ 維平均數向量 $\boldsymbol{\mu}=E[\mathbf{x}]$$p\times p$ 階共變異數矩陣 $\Sigma=\hbox{cov}[\mathbf{x}]$ 決定，如下：

$\displaystyle \mathcal{N}(\mathbf{x}\vert\boldsymbol{\mu},\Sigma)=\frac{1}{(2\pi)^{p/2}\vert\Sigma\vert^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}$

\displaystyle\begin{aligned} L(\mathcal{X}\vert\boldsymbol{\mu},\Sigma)&=\prod_{k=1}^n\mathcal{N}(\mathbf{x}_k\vert\boldsymbol{\mu},\Sigma)\\ &=\prod_{k=1}^n\frac{1}{(2\pi)^{p/2}\vert\Sigma\vert^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right\}. \end{aligned}

\displaystyle\begin{aligned} \ln L(\mathcal{X}\vert\boldsymbol{\mu},\Sigma)&=\sum_{k=1}^n\ln \mathcal{N}(\mathbf{x}_k\vert\boldsymbol{\mu},\Sigma)\\ &=\sum_{k=1}^n\left(-\frac{p}{2}\ln(2\pi)-\frac{1}{2}\ln\vert\Sigma\vert-\frac{1}{2}(\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right). \end{aligned}

\displaystyle\begin{aligned} \frac{\partial \ln L}{\partial\boldsymbol{\mu}}&=\frac{\partial}{\partial\boldsymbol{\mu}}\sum_{k=1}^n\left(-\frac{p}{2}\ln(2\pi)-\frac{1}{2}\ln\vert\Sigma\vert-\frac{1}{2}(\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right) \\ &=\sum_{k=1}^n\frac{\partial}{\partial\boldsymbol{\mu}}\left(-\frac{p}{2}\ln(2\pi)-\frac{1}{2}\ln\vert\Sigma\vert-\frac{1}{2}(\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right) \\ &=\sum_{k=1}^n\frac{\partial}{\partial\boldsymbol{\mu}}\left(-\frac{1}{2}(\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right) \\ &=\sum_{k=1}^n\frac{\partial}{\partial\boldsymbol{\mu}}\left(-\frac{1}{2}\boldsymbol{\mu}^T\Sigma^{-1}\boldsymbol{\mu}+\boldsymbol{\mu}^T\Sigma^{-1}\mathbf{x}_k -\frac{1}{2}\mathbf{x}_k^T\Sigma^{-1}\mathbf{x}_k\right) \\ &=\sum_{k=1}^n\left(-\Sigma^{-1}\boldsymbol{\mu}+\Sigma^{-1}\mathbf{x}_k\right)\\ &=\Sigma^{-1}\sum_{k=1}^n(\mathbf{x}_k-\boldsymbol{\mu}). \end{aligned}

$\frac{\partial \ln L}{\partial\boldsymbol{\mu}}=\mathbf{0}$，可解出 $\boldsymbol{\mu}$ 的最大似然估計即為樣本平均數向量，記作

$\displaystyle \hat{\boldsymbol{\mu}}=\overline{\mathbf{x}}=\frac{1}{n}\sum_{k=1}^n\mathbf{x}_k$

\displaystyle\begin{aligned} \sum_{k=1}^n(\mathbf{x}_i-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu}) &=\sum_{k=1}^n\text{trace}\left((\mathbf{x}_k-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})\right)\\ &=\sum_{k=1}^n\text{trace}\left(\Sigma^{-1}(\mathbf{x}_k-\boldsymbol{\mu})(\mathbf{x}_k-\boldsymbol{\mu})^T\right)\\ &=\text{trace}\left(\Sigma^{-1}\sum_{k=1}^n(\mathbf{x}_k-\boldsymbol{\mu})(\mathbf{x}_k-\boldsymbol{\mu})^T\right), \end{aligned}

$\displaystyle S=\frac{1}{n-1}\sum_{k=1}^n(\mathbf{x}_k-\overline{\mathbf{x}})(\mathbf{x}_k-\overline{\mathbf{x}})^T$

$\displaystyle s_{ij}=\frac{1}{n-1}\sum_{k=1}^n(x_{ki}-\overline{x}_i)(x_{kj}-\overline{x}_j)$

$\displaystyle \ln L(\mathcal{X}\vert\overline{\mathbf{x}},\Sigma)=-\frac{np}{2}\ln(2\pi)-\frac{n}{2}\ln\vert\Sigma\vert-\frac{n-1}{2}\text{trace}\left(\Sigma^{-1}S\right)$

$\displaystyle \frac{\partial \ln L(\mathcal{X}\vert\overline{\mathbf{x}},\Sigma)}{\partial \Sigma}=-\frac{n}{2}\Sigma^{-1}+\frac{n-1}{2}\Sigma^{-1}S\Sigma^{-1}$

$\frac{\partial \ln L(\mathcal{X}\vert\overline{\mathbf{x}},\Sigma)}{\partial \Sigma}=0$，可得共變異數矩陣 $\Sigma$ 的最大似然估計

$\displaystyle \hat{\Sigma}=\frac{n-1}{n}S=\frac{1}{n}\sum_{k=1}^n(\mathbf{x}_k-\overline{\mathbf{x}})(\mathbf{x}_k-\overline{\mathbf{x}})^T$

\displaystyle\begin{aligned} \sum_{k=1}^n(\mathbf{x}_k-\overline{\mathbf{x}})(\mathbf{x}_k-\overline{\mathbf{x}})^T &=\sum_{k=1}^n\left(\mathbf{x}_k\mathbf{x}_i^T-\mathbf{x}_k\overline{\mathbf{x}}^T-\overline{\mathbf{x}}\mathbf{x}_k^T+\overline{\mathbf{x}}\overline{\mathbf{x}}^T\right)\\ &=\sum_{k=1}^n\mathbf{x}_k\mathbf{x}_k^T-\left(\sum_{k=1}^n\mathbf{x}_k\right)\overline{\mathbf{x}}^T-\overline{\mathbf{x}}\left(\sum_{k=1}^n\mathbf{x}_k^T\right)+\sum_{k=1}^n\overline{\mathbf{x}}\overline{\mathbf{x}}^T\\ &=\sum_{k=1}^n\mathbf{x}_k\mathbf{x}_k^T-n\overline{\mathbf{x}}\overline{\mathbf{x}}^T-n\overline{\mathbf{x}}\overline{\mathbf{x}}^T+n\overline{\mathbf{x}}\overline{\mathbf{x}}^T\\ &=\sum_{k=1}^n\mathbf{x}_k\mathbf{x}_k^T-n\overline{\mathbf{x}}\overline{\mathbf{x}}^T. \end{aligned}

$\displaystyle \sum_{k=1}^n(\mathbf{x}_k-\overline{\mathbf{x}})(\mathbf{x}_k-\overline{\mathbf{x}})^T=\begin{bmatrix} \mathbf{x}_1-\overline{\mathbf{x}}&\cdots&\mathbf{x}_n-\overline{\mathbf{x}} \end{bmatrix}\begin{bmatrix} (\mathbf{x}_1-\overline{\mathbf{x}})^T\\ \vdots\\ (\mathbf{x}_n-\overline{\mathbf{x}})^T \end{bmatrix}=AA^T$

$\displaystyle (\mathbf{x}_1-\overline{\mathbf{x}})+\cdots+(\mathbf{x}_n-\overline{\mathbf{x}})=\sum_{k=1}^n\mathbf{x}_k-n\overline{\mathbf{x}}=\mathbf{0}$

$\displaystyle \text{rank}\hat{\Sigma}=\text{rank}(AA^T)=\text{rank}A

$\displaystyle \rho_{ij}=\frac{\sigma_{ij}}{\sqrt{\sigma_{ii}}\sqrt{\sigma_{jj}}}$

$\displaystyle \hat{\sigma}_{ij}=\frac{n-1}{n}s_{ij}=\frac{1}{n}\sum_{k=1}^n(x_{ki}-\overline{x}_i)(x_{kj}-\overline{x}_j)$

\displaystyle\begin{aligned} \hat{\rho}_{ij}&=\frac{\hat{\sigma}_{ij}}{\sqrt{\hat{\sigma}_{ii}}\sqrt{\hat{\sigma}_{jj}}}\\ &=\frac{\sum_{k=1}^n(x_{ki}-\overline{x}_i)(x_{kj}-\overline{x}_j)}{\sqrt{\sum_{k=1}^n(x_{ki}-\overline{x}_i)^2}\sqrt{\sum_{k=1}^n(x_{kj}-\overline{x}_j)^2}}\\ &=\frac{\sum_{k=1}^nx_{ki}x_{kj}-n\overline{x}_i\overline{x}_j}{\sqrt{\sum_{k=1}^nx_{ki}^2-n\overline{x}_i^2}\sqrt{\sum_{k=1}^nx_{kj}^2-n\overline{x}_j^2}}, \end{aligned}

$\displaystyle m_{\mathbf{x}_k}(\mathbf{t})=\exp\left(\mathbf{t}^T\boldsymbol{\mu}+\frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t}\right)$

\displaystyle\begin{aligned} m_{\overline{\mathbf{x}}}(\mathbf{t})&=E\left[\exp\left(\mathbf{t}^T\overline{\mathbf{x}}\right)\right]\\ &=E\left[\exp\left(\mathbf{t}^T\left(\frac{1}{n}\sum_{k=1}^n\mathbf{x}_k\right)\right)\right]\\ &=E\left[\exp\left(\sum_{k=1}^n\frac{1}{n}\mathbf{t}^T\mathbf{x}_k\right)\right]\\ &=E\left[\prod_{k=1}^n\exp\left(\frac{1}{n}\mathbf{t}^T\mathbf{x}_k\right)\right]\\ &=\prod_{k=1}^nE\left[\exp\left(\frac{1}{n}\mathbf{t}^T\mathbf{x}_k\right)\right]\\ &=\prod_{k=1}^nm_{\mathbf{x}_k}\left(\frac{1}{n}\mathbf{t}\right)\\ &=\prod_{k=1}^n\exp\left(\frac{1}{n}\mathbf{t}^T\boldsymbol{\mu}+\frac{1}{2n^2}\mathbf{t}^T\Sigma\mathbf{t}\right)\\ &=\exp\left(\mathbf{t}^T\boldsymbol{\mu}+\frac{1}{2}\mathbf{t}^T\left(\frac{1}{n}\Sigma\right)\mathbf{t}\right), \end{aligned}

\displaystyle\begin{aligned} E\left[\hat{\Sigma}\right]&=E\left[\frac{1}{n}\left(\sum_{k=1}^n\mathbf{x}_k\mathbf{x}_k^T-n\overline{\mathbf{x}}\overline{\mathbf{x}}^T\right)\right]\\ &=\frac{1}{n}\sum_{k=1}^nE\left[\mathbf{x}_k\mathbf{x}_k^T\right]-E\left[\overline{\mathbf{x}}\overline{\mathbf{x}}^T\right]\\ &=\frac{1}{n}\sum_{k=1}^n\left(\Sigma+\boldsymbol{\mu}\boldsymbol{\mu}^T\right)-\left(\frac{1}{n}\Sigma+\boldsymbol{\mu}\boldsymbol{\mu}^T\right)\\ &=\Sigma+\boldsymbol{\mu}\boldsymbol{\mu}^T-\frac{1}{n}\Sigma-\boldsymbol{\mu}\boldsymbol{\mu}^T\\ &=\frac{n-1}{n}\Sigma, \end{aligned}

$\displaystyle \hbox{cov}[\mathbf{x}_k]=\Sigma=E\left[\mathbf{x}_k\mathbf{x}_k^T\right]-\boldsymbol{\mu}\boldsymbol{\mu}^T,~~~k=1,\ldots,n$

$\displaystyle \hbox{cov}[\overline{\mathbf{x}}]=\frac{1}{n}\Sigma=E\left[\overline{\mathbf{x}}\overline{\mathbf{x}}^T\right]-\boldsymbol{\mu}\boldsymbol{\mu}^T$

[1] 維基百科：Maximum likelihood

This entry was posted in 機率統計 and tagged , , , , . Bookmark the permalink.

### 3 則回應給 多變量常態分布的最大似然估計

1. 徐建平 說：

請問一下，中間有一段推導我看不太懂"利用跡數運算化簡似然函數裡的二次型"，在第一行的推導中trace運算為何可以直接就套用？

2. ccjou 說：

二次型 $\mathbf{x}^TA\mathbf{x}$ 是一個純量，可視之為 $1\times 1$ 階矩陣，故可取跡數計算。多變量統計學經常使用這個運算技巧。例如，設 $\mathbf{x}=\begin{bmatrix} a\\ b \end{bmatrix}$，則 $a^2+b^2=\mathbf{x}^T\mathbf{x}=\hbox{trace}(\mathbf{x}^T\mathbf{x})=\hbox{trace}(\mathbf{x}\mathbf{x}^T)=\hbox{trace}\begin{bmatrix} a^2&ab\\ ab&b^2 \end{bmatrix}$

• 徐建平 說：

謝謝周老師，受教了。