Sunday, April 27, 2008

Model selection - Information criteria, part I

We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.

In practice, people use various information criteria to formalize the tradeoff, usually of the form
where is the information, is the probability of the data given the model, and is a penalty for model complexity.

Matti asked me last week what the rationale behind choosing is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of compared to either :)

Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models , and want to select the one with highest probability after seeing the data . This is given by If we take prior over models to be uniform, we just need to evaluate the evidence for each model.

Pick one model , and say it has tunable parameters. Select one of them, , and let's assume it's prior distribution is flat with width

We have

Suppose is sharply peaked around , and the width of the peak is Then the probability of will be all from that region, and given by , since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get



Now repeating the similar argument over all parameters (taking integrals), we get



The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.

No comments: