PRML Chapter2 Note Cont.

Exponential Family

Definition: A family of probability distributions is said to be in the exponential family if it can be expressed in the form \[ p(x | \eta) = h(x) g(\eta) \exp \left\{ \eta^\top u(x) \right\}\] in which
- \(\eta\) is the natural parameter of the distribution, determining its position and shape
- \(u(x)\) is the sufficient statistics, which is a function of the data \(x\) that captures all the information in the data relevant to the parameter \(\eta\)
- \(h(x)\) is the base measure. It is usually a constant or a simple function of \(x\)
- \(g(\eta)\) is the normalization factor ensuring the distribution integrates to 1
For Bernoulli distribution, we have \[ \begin{align*} p(x | \mu) &= \text{Bern}(x | \mu) = \mu^x (1 - \mu)^{1 - x} \\ &= \exp \{ x \ln \mu + (1 - x) \ln (1 - \mu) \} \\ &= (1 - \mu) \exp \left\{ \ln \left( \frac{\mu}{1 - \mu} \right) x \right\} \end{align*}\] where \[ \begin{align*} \eta &= \ln \left( {\frac{\mu }{ {1 - \mu } } } \right) \Leftrightarrow \mu = \sigma (\eta ) = \frac{1}{ {1 + \exp ( - \eta )} }\\ u(x) &= x \\h(x)& = 1 \\ g(\eta) &= 1-\mu =\frac{1}{1+\exp(\eta)} =\sigma(-\eta) \end{align*} \] Thus, Bernoulli distribution can also be expressed in the form of exponential family \[p(x|\eta)=\sigma(-\eta)\exp(\eta x)\]
Multinomial distribution \[ \begin{align*} p(\mathbf{x} | \mathbf{\mu}) &= \text{Mult}(\mathbf{x} | \mathbf{\mu}) = \prod_{k = 1}^M \mu_k^{x_k} \\ &= \exp \left\{ \sum_{k = 1}^M x_k \ln \mu_k \right\} \\ \end{align*}\] The parameters are respectively \[\eta_k =\ln \mu_k,u(\mathbf{x}) = \mathbf{x},h(\mathbf{x}) = 1,g(\mathbf{\eta}) = 1\] Note that \(\eta_k\) is not independent, since \(\sum_{k=1}^M \mu_k = 1\), so we have only \(M-1\) independent parameters and \(\mu_M\) can be expressed by previous \(M-1\) parameters. Redifine \[\begin{align*} {\mu _M} &= 1 - \sum\limits_{k = 1}^{M - 1} { {\mu _k} } \\ {\eta _k} &= \ln \frac{ { {\mu _k} } }{1 - \sum_{j = 1}^{M - 1} { {\mu _j} } } \Leftrightarrow {\mu _k} = \frac{ {\exp ({\eta _k})} }{ {1 + \sum_{j = 1}^{M - 1} {\exp } ({\eta _j})} } \end{align*}\] Thus, the multinomial distribution can also be expressed in the form of exponential family \[p(\mathbf{x}|\mathbf{\eta})=\exp(\mathbf{\eta}^\top \mathbf{x})\frac{1}{1+\sum_{j=1}^{M-1}\exp(\eta_j)}\] The paremeters are respectively \[ \begin{align*} \eta_k &= \ln \frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}, \eta = {({\eta _1}, \ldots ,{\eta _{M - 1} },0)^{\rm{T} } } \\u(\mathbf{x}) &= \mathbf{x}\\h(\mathbf{x}) &= 1\\g(\mathbf{\eta}) &= \frac{1}{1+\sum_{j=1}^{M-1}\exp(\eta_j)} \end{align*} \]
Gaussian distribution \[ \begin{align*} p(x | \mu, \sigma^2) &= \frac{1}{(2 \pi \sigma^2)^{1/2} } \exp \left\{ -\frac{1}{2 \sigma^2} (x - \mu)^2 \right\} \\ &= \frac{1}{(2 \pi \sigma^2)^{1/2} } \exp \left\{ -\frac{1}{2 \sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{1}{2 \sigma^2} \mu^2 \right\} \\ &= h(x) g(\eta) \exp \left\{ \eta^T u(x) \right\} \end{align*} \] where the parameters are respectively \[ \begin{align*} \eta = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ -\frac{1}{2 \sigma^2} \end{bmatrix}, u(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix}, h(x) = (2 \pi)^{-1/2}, g(\eta) = (-2 \eta_2)^{-1/2} \exp \left( \frac{\eta_1^2}{4 \eta_2} \right) \end{align*} \]

Maximum Likelihood for Exponential Family

First, we will establish an important relationship which is useful in future derivation. \[ \begin{align*} p(\mathbf{x}|\boldsymbol{\eta}) &= h(\mathbf{x})g(\boldsymbol{\eta})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \\ &\Rightarrow \int h(\mathbf{x})g(\boldsymbol{\eta})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} d\mathbf{x} = \int p(\mathbf{x}|\boldsymbol{\eta}) d\mathbf{x} = 1 \\ &\Rightarrow \frac{\partial}{\partial g(\boldsymbol{\eta})} \left( \int h(\mathbf{x})g(\boldsymbol{\eta})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} d\mathbf{x} \right) \\&= \nabla g(\boldsymbol{\eta}) \int h(\mathbf{x}) \exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} d\mathbf{x} + g(\boldsymbol{\eta}) \int h(\mathbf{x}) \exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \mathbf{u}(\mathbf{x}) d\mathbf{x} \\ &= 0 \end{align*} \] Notice that \[ \begin{align*} p(\mathbf{x}|\boldsymbol{\eta}) &= h(\mathbf{x})g(\boldsymbol{\eta})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \\ &\Rightarrow \frac{p(\mathbf{x}|\boldsymbol{\eta})}{g(\boldsymbol{\eta})} = h(\mathbf{x})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \\ &\Rightarrow \int \frac{p(\mathbf{x}|\boldsymbol{\eta})}{g(\boldsymbol{\eta})} d\mathbf{x} = \int h(\mathbf{x})\exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} d\mathbf{x} = \frac{1}{g(\boldsymbol{\eta})} \end{align*} \] and \[ \begin{align*} g(\boldsymbol{\eta}) \int h(\mathbf{x}) \exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \mathbf{u}(\mathbf{x}) d\mathbf{x} &= \int h(\mathbf{x}) g(\boldsymbol{\eta}) \exp \left\{ \boldsymbol{\eta}^\top \mathbf{u}(\mathbf{x}) \right\} \mathbf{u}(\mathbf{x}) d\mathbf{x} \\ &= \int p(\mathbf{x}|\boldsymbol{\eta}) \mathbf{u}(\mathbf{x}) d\mathbf{x} \\ &= \mathbb{E}\left[ \mathbf{u}(\mathbf{x}) \right] \end{align*} \] Substitute the above two equations into the previous equation, we have \[-\nabla\ln g(\boldsymbol{\eta})={E}[\mathbf{u}(\mathbf{x})]\]

Then, re-consider the likelihood function of the exponential family to get the same result. Given the dataset \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_N \}\), the likelihood function is \[ p(\mathbf{X}|\boldsymbol{\eta}) = \prod_{n=1}^N p(\mathbf{x}_n|\boldsymbol{\eta}) = \left( \prod_{n=1}^N h(\mathbf{x}_n) \right) g(\boldsymbol{\eta})^N \exp \left\{ \boldsymbol{\eta}^\top \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n) \right\} \] Take the logarithm of the likelihood function and maximize it with respect to \(\boldsymbol{\eta}\), we have \[ \begin{align*} \ell(\boldsymbol{\eta}) &= \log p(\mathbf{X}|\boldsymbol{\eta}) \\ &= \sum_{n=1}^N \log h(\mathbf{x}_n) + N \ln g(\boldsymbol{\eta}) + \boldsymbol{\eta}^\top \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n).\\&\Rightarrow \frac{\partial \ell(\boldsymbol{\eta})}{\partial \boldsymbol{\eta} } = N \frac{\nabla g(\boldsymbol{\eta})}{g(\boldsymbol{\eta})} + \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n) = 0. \\ &\Rightarrow -\nabla \ln g(\boldsymbol{\eta}) = \mathbb{E}[\mathbf{u}(\mathbf{x})]\\&\Rightarrow -\nabla \ln g(\boldsymbol{\eta}_{\text{ML} }) = \frac{1}{N} \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n). \end{align*} \]

Conjugate Prior for Exponential Family

For any member of the exponential family, there exists a conjugate prior that ensures the posterior distribution has the same form as the prior. This prior distribution can be written as: \[ p(\boldsymbol{\eta} | \boldsymbol{\chi}, \nu) = f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^\nu \exp \left\{ \nu \boldsymbol{\eta}^\top \boldsymbol{\chi} \right\} \] in which * \(\boldsymbol{\chi}\) represents the value of a virtual or pseudo-observation in the prior * \(\nu\) is the number of observations, which is a hyperparameter. It is often called the "strength" or "pseudo-count" of the prior, indicating the weight of prior information.

In Bayesian inference, we combine the likelihood function with the prior distribution to obtain the posterior distribution. Given a dataset \(\mathbf{X}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{N}\}\), the likelihood function is \[ p(\mathbf{X}|\boldsymbol{\eta}) \propto g(\boldsymbol{\eta})^N \exp \left\{ \boldsymbol{\eta}^\top \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n) \right\} \] Combining this likelihood with the prior distribution yields the posterior distribution \[ p(\boldsymbol{\eta}|\mathbf{X}, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{\nu + N} \exp \left\{ \boldsymbol{\eta}^\top \left( \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n) + \nu \boldsymbol{\chi} \right) \right\} \]