Pretty nuggets of math, CS, bio: Model selection

Now for the hardcore information criteria part :)

The goal is still the same - pick a model $M$ to maximize the log-likelihood $ln p(D|M)$ of the data. This is given by $ln \int p(D | {\bf \theta} , M) p({\bf \theta} |M) d {\bf \theta}$ We can approximate the integral with a Laplace approximation, which is similar in idea to the previous post - the probability mass will be centered around the mode of the distribution. We can fit a normal distribution with the mode as mean, and variance approximated from Taylor expansion at the mode. Next 2 paragraphs can be skipped if you believe this :)

For example, to approximate a function $ln f({\bf z})$ that has a mode (and thus a local maximum) at ${\bf z_0}$ , we use the 2nd order Taylor:

$ln f({\bf z}) \simeq ln f({\bf z_0}) + \frac{1}{2} ({\bf z} - {\bf z_0})^T (\nabla \nabla ln f({\bf z_0})) ({\bf z} - {\bf z_0})$ (the first order term is 0 because of the local maximum)

Taking $A$ as the negative of the second derivative matrix, we get $f({\bf z}) \simeq f({\bf z_0}) \exp (- \frac{1}{2} ({\bf z} -{ \bf z_0})^T A ({\bf z} - {\bf z_0}))$ If we are looking for a probability distribution that is proportional to $f$ , we have ${\bf z_0}$ as the mean, $A^{-1}$ as the covariance matrix, and $\frac{ (2 \pi) ^ {\frac{M}{2}}}{| A | ^ {\frac{1}{2}}}$ as the normalizing coefficient - voila!

So we can fit a Gaussian to a function - back to information criteria. We'll fit a Gaussian to $ln \int p(D | {\bf \theta} , M) p({\bf \theta} |M) d {\bf \theta}$ at the mode (with the most likely parameter setting) ${\bf \theta_{MAP}}$ :

$ln p(D | M) \simeq ln \int p(D | {\bf \theta_{MAP}}, M) p({\bf \theta_{MAP}} | M) exp(-\frac{1}{2}({\bf \theta}-{\bf \theta_{MAP}})^T A ({\bf \theta}-{\bf \theta_{MAP}})) d {\bf \theta}=$
$= ln \left( p(D | {\bf \theta_{MAP}}, M)p({\bf \theta_{MAP}} | M) \frac{(2 \pi)^{\frac{M}{2}}}{|A|^{\frac{1}{2}}}\right) =$

$= ln p(D | {\bf \theta_{MAP}}, M) +ln p({\bf \theta_{MAP}} | M)- 0.5 ln |A| + Mc$

As before, the first term is the fit of the model to the data. The rest of the terms are the complexity penalty. The a wide prior probability for the parameters, the second term is small, and the last term scales with $M$ - the main penalty comes from $ln |A|$

To evaluate the determinant of the covariance matrix, we assume that it has full rank, and is due to $N$ iid data points. This means that $A$ is the sum of variances $A_n$ due to the data points, and since the data is iid, $A_i = A_j \forall i,j$ . So $ln|A| = ln |NA_1| = ln N^M |A_1| = M ln N + ln |A_1|$ . Again, last term is constant, so all in all we have
$ln p (D | M) \simeq ln p(D | {\bf \theta}_{MAP}, M) - 0.5 M ln N$

To recap, we estimated the probability of the data under the model, using the Laplace approximation to fit a Gaussian for the log-likelihood, and used some simplifying assumptions to arrive at the final form.

The end result is pretty much the Bayesian Information Criterion, and it penalizes model complexity more than AIC. Note that the constants in front are not arbitrary, since we never made any simplifications for them, and there's a 2:1 ratio. That should show Matti :)

Pretty nuggets of math, CS, bio

Friday, May 2, 2008

Model selection - Information criteria, part II

No comments:

Contributors

Blog Archive