There are nice sentences in high school text books along the lines of 'misfolded proteins are recognized and degraded'. But in reality, it seems like a tough job to sort these proteins out. There are unfolded proteins, not completely folded proteins, completely folded proteins, and misfolded proteins - how does the cell distinguish which ones deserve to go on?
Anyway, turns out there is a way of dealing with this. Firstly, everything in biology is shape. Shape, shape, shape - John Archer, a old professor, used to stress this a lot. You can recognize when a protein is not done folding because it will display portions that it shouldn't - for example, hydrophobic areas that would be buried in a beta-sheet.
Now how to distinguish between not completely folded and misfolded? This is where sugar tagging comes in. Proteins in the ER are glycosylated in the N-terminus. Glucosyl transferase proteins recognize the hydrophobic portions of the protein, and add another sugar to the N-terminal oligosaccharide. As long as the sugar tag has at least one more glucose, the protein is recognized by calnexin and it cannot exit the ER. To escape calnexin binding, the bound glucose needs to be cleaved by glucosidase.
The proteins in ER cycle between being bound by calnexin and having a sugar cleaved, and being recognized by the glucosyl transferase and having a sugar added until they are completely folded.
This still leaves recognizing misfolded proteins - and apparently the mechanism is similar. Once the protein has spent enough time in the ER and not gotten completely folded, a sugar will be linked that will be recognized by a chaperone which will direct it to the nucleus for degradation.
Monday, April 28, 2008
Iterative parameter finding
I should be getting ready for my viva, but instead, I reread this cool bit on how to find the best parameters for a model.
You have a model with parameters
, and you want to fit the parameters to the data
. For the maximum log-likelihood, you would find the zero of the derivative
. But suppose that is nontrivial. Then you can use the Newton-Rhapson method for iterative parameter finding.
Using the multivariate Taylor series, we can update the initial guess
for the best parameters.
small terms
This gives)^{-1} f(\underline {\hat \theta}))
This iterative procedure is nice because the functions involved often have simple forms (e.g in logistic regression,
is a combination of inputs that give classification errors), and it gives a solution in cases that are not analytically tractable.
You have a model with parameters
Using the multivariate Taylor series, we can update the initial guess
This gives
This iterative procedure is nice because the functions involved often have simple forms (e.g in logistic regression,
Sunday, April 27, 2008
Model selection - Information criteria, part I
We covered a paper by Schadt and others last week that dealt with selecting the best model for the data. We want to select the simplest model that explains the most data, and there is a tradeoff between model fit and complexity.
In practice, people use various information criteria to formalize the tradeoff, usually of the form
where
is the information,
is the probability of the data given the model, and
is a penalty for model complexity.
Matti asked me last week what the rationale behind choosing
is. I'll write about the 2 widely used forms (Akaike Information Criterion (AIC) this time, and Bayesian Information Criterion (BIC) the next), but don't really know the answer to the weights of
compared to
either :)
Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models
, and want to select the one with highest probability
after seeing the data
. This is given by
If we take prior over models to be uniform, we just need to evaluate the evidence for each model.
Pick one model
, and say it has
tunable parameters. Select one of them,
, and let's assume it's prior distribution is flat with width  = \frac{1}{\delta_{prior}})
We have = \int P(D | M, w)p(w | M) dw)
Suppose
is sharply peaked around
, and the width of the peak is
Then the probability of
will be all from that region, and given by
, since the integral drops to 0 outside the peak. Combining that with the prior, and taking the log we get
 \simeq ln p (D | M, w_{MAP}) - ln \frac{\delta{prior}}{\delta_{MAP}})
Now repeating the similar argument over all parameters (taking
integrals), we get
 \simeq ln p (D | M, \underline w_{MAP}) - K ln \frac{\delta{prior}}{\delta_{MAP}})
The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.
In practice, people use various information criteria to formalize the tradeoff, usually of the form
Matti asked me last week what the rationale behind choosing
Here's the intuition for AIC (the idea is presented in both MacKay and Bishop books): we have a set of models
Pick one model
We have
Suppose
Now repeating the similar argument over all parameters (taking
The first part of the expression is the fit of the model of the data, and the second part is a linear penalty of the number of parameters, scaled by the log of the fold-difference between the size of the prior and posterior parameter space.
Sunday, April 20, 2008
The Gaussian
I did not go deep enough in my math studies as an undergrad to get a glimpse of calculus of variations. It looks fascinating, but I won't pick it up, at least not yet.
Calculus of variations can be used to show why the Gaussian is interesting. It's a limiting distribution of several families, sum of IID variables is approximately Gaussian - but it is also a distribution that conveys our ignorance of the data. Given a distribution
has mean
and variance
, normal distribution is the one that conveys the least extra information, i.e, it has the maximum entropy among all possible
.
The way to show it involves calculus of variations and Lagrange multipliers. We encode the 3 conditions of the distribution function (integrates to 1, mean and variance given), and combine it with the entropy in the Lagrangian:
 = -\int p(x) ln p(x) dx + \lambda_1 \left(\int p(x) dx - 1\right) + \\ + \lambda_2 \left(\int x p(x) dx - \mu \right) + \lambda_3 \left( \int p(x) (x - \mu)^2 dx - \sigma^2 \right))
Now differentiating with respect to
, and finding the maximum, we get
, from where we instantly get the form of the Gaussian:
 = \frac{1}{Z} exp(\lambda_2 x + \lambda_3 x^2))
Completing the square and solving for the Lagrange multipliers using the constraints for mean and variance, we arrive at the Gaussian distribution. This holds similarly for the multivariate case.
So having a Gaussian as a prior distribution for observed data is equivalent to saying that we know nothing about the data except its mean and variance. Once again - cool :)
Calculus of variations can be used to show why the Gaussian is interesting. It's a limiting distribution of several families, sum of IID variables is approximately Gaussian - but it is also a distribution that conveys our ignorance of the data. Given a distribution
The way to show it involves calculus of variations and Lagrange multipliers. We encode the 3 conditions of the distribution function (integrates to 1, mean and variance given), and combine it with the entropy in the Lagrangian:
Now differentiating with respect to
Completing the square and solving for the Lagrange multipliers using the constraints for mean and variance, we arrive at the Gaussian distribution. This holds similarly for the multivariate case.
So having a Gaussian as a prior distribution for observed data is equivalent to saying that we know nothing about the data except its mean and variance. Once again - cool :)
Monday, April 14, 2008
Ion channels are amazing
I just realised that so far it's been quite math heavy. So now for something different, inspired by Molecular Biology of the Cell (Alberts et. al), 5th edition.
Cells rely on different gradients (voltage, concentration) to drive processes. The gradients are maintained by active and passive transport across the membranes. K+ leak channel takes care of a part of them. The cool bit is in its selectivity - this channel conducts K+ 10000 times better than Na+, while both ions are almost uniform spheres of similar diameter.
Firstly, the channel is selective for cations by the virtue of negatively charged amino acids at its entrance. Then there is a maximum size limit of how fat of an ion can fit through, the skinny ones make it past the first hurdle into the vestibule. But how do you make sure smaller cations do not slip by?
All cations are associated with the polar water molecules. To get through the vestibule, the ion has to shed the water molecules. For K+, this is exactly balanced by bonds created with carbonyl oxygens. Na+, however, is too small to create the bonds, thus it is energetically unfavorable for it to pass on from the vestibule. Neat!
Cells rely on different gradients (voltage, concentration) to drive processes. The gradients are maintained by active and passive transport across the membranes. K+ leak channel takes care of a part of them. The cool bit is in its selectivity - this channel conducts K+ 10000 times better than Na+, while both ions are almost uniform spheres of similar diameter.
Firstly, the channel is selective for cations by the virtue of negatively charged amino acids at its entrance. Then there is a maximum size limit of how fat of an ion can fit through, the skinny ones make it past the first hurdle into the vestibule. But how do you make sure smaller cations do not slip by?
All cations are associated with the polar water molecules. To get through the vestibule, the ion has to shed the water molecules. For K+, this is exactly balanced by bonds created with carbonyl oxygens. Na+, however, is too small to create the bonds, thus it is energetically unfavorable for it to pass on from the vestibule. Neat!
Sunday, April 13, 2008
Least squares intuition
The ideas for these will probably keep coming from Chris Bishop's book. Today, I liked the intuition behind the least squares solution for the (a bit generalized) linear regression problem.
The general problem is this: given a set of
data points
, we want to find a predictor function
of the form

that minimizes the mean-squared error
The
are basis functions that allow for richer models, and the
are weights of the basis. For ordinary linear regression, we can take
, but in general, we can try to match the output with any basis functions - gaussians, sinusoids, sigmoids, etc.
Now, consider a
-dimensional space
whose axes are given by the regression targets
. Then any basis function evaluated at the
data points is also a point in this space: 
If the number of basis functions
is less than the number of data points
, then the linear combinations of the basis function values define a linear subspace of
.
Particularly,
is a point in this subspace for any choice of
.
Now for the cherry on the cake - the choice of weights
that minimize error
corresponds to the choice of
that is the projection of the given data vector
onto the subspace spanned by the basis functions.
Perhaps obvious (and proof omitted), but I thought it was nice that the world is consistent :)
The general problem is this: given a set of
that minimizes the mean-squared error
The
Now, consider a
If the number of basis functions
Particularly,
Now for the cherry on the cake - the choice of weights
Perhaps obvious (and proof omitted), but I thought it was nice that the world is consistent :)
Subscribe to:
Posts (Atom)