Pretty nuggets of math, CS, bio: April 2008

Monday, April 28, 2008

Misfolded proteins

There are nice sentences in high school text books along the lines of 'misfolded proteins are recognized and degraded'. But in reality, it seems like a tough job to sort these proteins out. There are unfolded proteins, not completely folded proteins, completely folded proteins, and misfolded proteins - how does the cell distinguish which ones deserve to go on?

Anyway, turns out there is a way of dealing with this. Firstly, everything in biology is shape. Shape, shape, shape - John Archer, a old professor, used to stress this a lot. You can recognize when a protein is not done folding because it will display portions that it shouldn't - for example, hydrophobic areas that would be buried in a beta-sheet.

Now how to distinguish between not completely folded and misfolded? This is where sugar tagging comes in. Proteins in the ER are glycosylated in the N-terminus. Glucosyl transferase proteins recognize the hydrophobic portions of the protein, and add another sugar to the N-terminal oligosaccharide. As long as the sugar tag has at least one more glucose, the protein is recognized by calnexin and it cannot exit the ER. To escape calnexin binding, the bound glucose needs to be cleaved by glucosidase.

The proteins in ER cycle between being bound by calnexin and having a sugar cleaved, and being recognized by the glucosyl transferase and having a sugar added until they are completely folded.

This still leaves recognizing misfolded proteins - and apparently the mechanism is similar. Once the protein has spent enough time in the ER and not gotten completely folded, a sugar will be linked that will be recognized by a chaperone which will direct it to the nucleus for degradation.

Iterative parameter finding

I should be getting ready for my viva, but instead, I reread this cool bit on how to find the best parameters for a model.

You have a model with parameters $\underline \theta$ , and you want to fit the parameters to the data $D$ . For the maximum log-likelihood, you would find the zero of the derivative $f(\underline \theta) = \nabla ln p(D | \underline \theta)$ . But suppose that is nontrivial. Then you can use the Newton-Rhapson method for iterative parameter finding.

Using the multivariate Taylor series, we can update the initial guess $\hat{ \underline \theta}$ for the best parameters.

$0 = f(\underline \theta) = f(\hat{ \underline \theta}) + \nabla f (\underline{ \hat \theta})(\underline \theta - \hat {\underline \theta}) +$ small terms

This gives $\underline \theta \simeq \hat {\underline \theta }- (\nabla f(\underline {\hat \theta}))^{-1} f(\underline {\hat \theta})$

This iterative procedure is nice because the functions involved often have simple forms (e.g in logistic regression, $f = \sum (y_n - t_n) \underline x_n$ is a combination of inputs that give classification errors), and it gives a solution in cases that are not analytically tractable.

Sunday, April 27, 2008

Model selection - Information criteria, part I

We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.

In practice, people use various information criteria to formalize the tradeoff, usually of the form
$I(D, M) = - ln p(D | M) + f(M)$ where $I$ is the information, $p(D | M)$ is the probability of the data given the model, and $f(M)$ is a penalty for model complexity.

Matti asked me last week what the rationale behind choosing $f$ is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of $f$ compared to $-ln(p)$ either :)

Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models $M_i$ , and want to select the one with highest probability $p(M_i | D)$ after seeing the data $D$ . This is given by $p(M_i | D) = \frac{p(D | M_i) p(M_i)}{p(D)}$ If we take prior over models to be uniform, we just need to evaluate the evidence for each model.

Pick one model $M$ , and say it has $K$ tunable parameters. Select one of them, $w$ , and let's assume it's prior distribution is flat with width $\delta_{prior}: p(w | M) = \frac{1}{\delta_{prior}}$

We have $p(D | M) = \int P(D | M, w)p(w | M) dw$

Suppose $p(D|M,w)$ is sharply peaked around $w_{MAP}$ , and the width of the peak is $\delta_{MAP}.$ Then the probability of $D$ will be all from that region, and given by $p(D | M, w_{MAP}) \delta_{MAP}$ , since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get

$ln P(D | M) \simeq ln p (D | M, w_{MAP}) - ln \frac{\delta{prior}}{\delta_{MAP}}$

Now repeating the similar argument over all parameters (taking $K$ integrals), we get

$ln p(D | M) \simeq ln p (D | M, \underline w_{MAP}) - K ln \frac{\delta{prior}}{\delta_{MAP}}$

The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.

Sunday, April 20, 2008

The Gaussian

I did not go deep enough in my math studies as an undergrad to get a glimpse of calculus of variations. It looks fascinating, but I won't pick it up, at least not yet.

Calculus of variations can be used to show why the Gaussian is interesting. It's a limiting distribution of several families, sum of IID variables is approximately Gaussian - but it is also a distribution that conveys our ignorance of the data. Given a distribution $p$ has mean $\mu$ and variance $\sigma^2$ , normal distribution is the one that conveys the least extra information, i.e, it has the maximum entropy among all possible $p$ .

The way to show it involves calculus of variations and Lagrange multipliers. We encode the 3 conditions of the distribution function (integrates to 1, mean and variance given), and combine it with the entropy in the Lagrangian:

$L(p) = -\int p(x) ln p(x) dx + \lambda_1 \left(\int p(x) dx - 1\right) + \\ + \lambda_2 \left(\int x p(x) dx - \mu \right) + \lambda_3 \left( \int p(x) (x - \mu)^2 dx - \sigma^2 \right)$

Now differentiating with respect to $p$ , and finding the maximum, we get

$0 = -1 - ln p(x) + \lambda_1 + \lambda_2 x + \lambda_3 x^2$ , from where we instantly get the form of the Gaussian:

$p(x) = \frac{1}{Z} exp(\lambda_2 x + \lambda_3 x^2)$

Completing the square and solving for the Lagrange multipliers using the constraints for mean and variance, we arrive at the Gaussian distribution. This holds similarly for the multivariate case.

So having a Gaussian as a prior distribution for observed data is equivalent to saying that we know nothing about the data except its mean and variance. Once again - cool :)

Monday, April 14, 2008

Ion channels are amazing

I just realised that so far it's been quite math heavy. So now for something different, inspired by Molecular Biology of the Cell (Alberts et. al), 5th edition.

Cells rely on different gradients (voltage, concentration) to drive processes. The gradients are maintained by active and passive transport across the membranes. K+ leak channel takes care of a part of them. The cool bit is in its selectivity - this channel conducts K+ 10000 times better than Na+, while both ions are almost uniform spheres of similar diameter.

Firstly, the channel is selective for cations by the virtue of negatively charged amino acids at its entrance. Then there is a maximum size limit of how fat of an ion can fit through, the skinny ones make it past the first hurdle into the vestibule. But how do you make sure smaller cations do not slip by?

All cations are associated with the polar water molecules. To get through the vestibule, the ion has to shed the water molecules. For K+, this is exactly balanced by bonds created with carbonyl oxygens. Na+, however, is too small to create the bonds, thus it is energetically unfavorable for it to pass on from the vestibule. Neat!

Sunday, April 13, 2008

Least squares intuition

The ideas for these will probably keep coming from Chris Bishop's book. Today, I liked the intuition behind the least squares solution for the (a bit generalized) linear regression problem.

The general problem is this: given a set of $N$ data points $(\underline{x}_i, t_i)$ , we want to find a predictor function $y$ of the form
$y(\underline x,\underline w) = w_0 + \sum_{i = 1}^{M-1} w_i \phi_i(\underline{x})$ $= \underline w \underline \phi (\underline x)^T$

that minimizes the mean-squared error $E = \sum (y(\underline x_i, \underline w) - t_i)^2$

The $\phi_j$ are basis functions that allow for richer models, and the $\underline w$ are weights of the basis. For ordinary linear regression, we can take $\underline \phi_j(\underline x) = \underline x$ , but in general, we can try to match the output with any basis functions - gaussians, sinusoids, sigmoids, etc.

Now, consider a $N$ -dimensional space $S$ whose axes are given by the regression targets $\underline t = (t_1, t_2, ..., t_N)$ . Then any basis function evaluated at the $N$ data points is also a point in this space: $( \phi_j (\underline x_1), \phi_j (\underline x_2), ..., \phi_j (\underline x_N))$

If the number of basis functions $M$ is less than the number of data points $N$ , then the linear combinations of the basis function values define a linear subspace of $S$ .
Particularly, $\underline y (X, \underline w)$ is a point in this subspace for any choice of $\underline w$ .

Now for the cherry on the cake - the choice of weights $\underline w$ that minimize error $E$ corresponds to the choice of $y$ that is the projection of the given data vector $\underline t$ onto the subspace spanned by the basis functions.

Perhaps obvious (and proof omitted), but I thought it was nice that the world is consistent :)

Pretty nuggets of math, CS, bio