Pretty nuggets of math, CS, bio: Model selection

We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.

In practice, people use various information criteria to formalize the tradeoff, usually of the form
$I(D, M) = - ln p(D | M) + f(M)$ where $I$ is the information, $p(D | M)$ is the probability of the data given the model, and $f(M)$ is a penalty for model complexity.

Matti asked me last week what the rationale behind choosing $f$ is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of $f$ compared to $-ln(p)$ either :)

Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models $M_i$ , and want to select the one with highest probability $p(M_i | D)$ after seeing the data $D$ . This is given by $p(M_i | D) = \frac{p(D | M_i) p(M_i)}{p(D)}$ If we take prior over models to be uniform, we just need to evaluate the evidence for each model.

Pick one model $M$ , and say it has $K$ tunable parameters. Select one of them, $w$ , and let's assume it's prior distribution is flat with width $\delta_{prior}: p(w | M) = \frac{1}{\delta_{prior}}$

We have $p(D | M) = \int P(D | M, w)p(w | M) dw$

Suppose $p(D|M,w)$ is sharply peaked around $w_{MAP}$ , and the width of the peak is $\delta_{MAP}.$ Then the probability of $D$ will be all from that region, and given by $p(D | M, w_{MAP}) \delta_{MAP}$ , since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get

$ln P(D | M) \simeq ln p (D | M, w_{MAP}) - ln \frac{\delta{prior}}{\delta_{MAP}}$

Now repeating the similar argument over all parameters (taking $K$ integrals), we get

$ln p(D | M) \simeq ln p (D | M, \underline w_{MAP}) - K ln \frac{\delta{prior}}{\delta_{MAP}}$

The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.

Pretty nuggets of math, CS, bio

Sunday, April 27, 2008

Model selection - Information criteria, part I

No comments:

Contributors

Blog Archive