Pretty nuggets of math, CS, bio

Friday, March 26, 2010

Rabbit proofing a wild vegetable patch

Nothing to do with Science. More related to common sense. I went for a walk with a kind soul today who thought it wise to plant vegetable seeds in the random acres of land that our campus is endowed with. It never occurred to him that rabbits would be chewing away at the seedlings in a few weeks time.

Instead of sending him a book on Gardening for Dummies, it's far easier to disseminate information on the blog.

Well known Bunny repellents:

Pee on the patch As repugnent as it sounds, this is probably the easiest deterrant for rabbits since it's perceived as a territorial marker. Works with dog pee too.

Erect a chicken wire fence around the patch. Probably the best way to prevent rabbits from munching the garden down to bare stems. Rabbits are diggers, so the chicken wire must go quite some way down

Sprinkle chili powder

Use blood meal around plants. It does not have an unpleasant smell to humans, but animals will steer clear of the scent of bloo, unless there are vampires in your vicinity.

Mix a rabbit repellent tea. a dash of cayenne pepper and equal measure of garlic powder into a coffee filter. Turn it into a teabag and pour warm water over it. Let it sit overnight outdoors as it stinks to the high heavens. It will drive away more than just rabbits. Pour revolting brew into a spray bottle, add a squirt of dish soap, which allows the spray to stick. No living thing will go near your plants.

Wednesday, June 4, 2008

Not mine... but glycosylation anyway!

Instead of writing myself, I'm forwarding a link to another nice blog entry about glycosylation.

Friday, May 9, 2008

Buckyballs and footballs

Before the English invented football, there was the buckyball.

Named after the American architect R. Buckminster Fuller who designed the geodesic dome with the same fundamental symmetry, C60 is third major form of pure carbon after diamond (my other favourite carbon) and graphite.

It also happens to be the roundest and most symmetrical molecule known.

In C60, hexagons and pentagons of carbon link together in a coordinated fashion (just like in a football) to form a hollow geodesic dome with bonding strains equally distributed among the 60 carbon atoms. The recognition for its discovery by Kroto, Curl and Smalley came in the form of a Nobel Prize in Chemistry back in 1996.

As it turns out, C60 and its other fullerene cousins (C70, C84, C28 et al) are endowed with extraordinary chemical and physical properties. They can react with all sorts of elements across the periodic table and free radicals- involving a polymerisation process widely used to make high temperature superconductors. American scientists spent the next decade moulding fullerenes into pipes (nanotubes). Meanwhile, the continental counterparts in the IBM laboratory in Zurich incorporated buckyballs into micro-sized abacus by lining buckyballs onto a multigrooved copper plate, like beads on a string and then manipulated the beads with a scanning tunnelling microscope to perform calculations. A technology that could pave the way for a better computer chip in the future. Move over Microsoft.

Wednesday, May 7, 2008

Vesicle coats

On to next chapter in Alberts et al..

Adidas clearly stole their 'Fevernova' logo

from the clathrin triskelion

And football players (or Plato?) the truncated icosahedron ball shape

from the clathrin coat

Go Nature in beating humans in making pretty things :)

Friday, May 2, 2008

Model selection - Information criteria, part II

Now for the hardcore information criteria part :)

The goal is still the same - pick a model $M$ to maximize the log-likelihood $ln p(D|M)$ of the data. This is given by $ln \int p(D | {\bf \theta} , M) p({\bf \theta} |M) d {\bf \theta}$ We can approximate the integral with a Laplace approximation, which is similar in idea to the previous post - the probability mass will be centered around the mode of the distribution. We can fit a normal distribution with the mode as mean, and variance approximated from Taylor expansion at the mode. Next 2 paragraphs can be skipped if you believe this :)

For example, to approximate a function $ln f({\bf z})$ that has a mode (and thus a local maximum) at ${\bf z_0}$ , we use the 2nd order Taylor:

$ln f({\bf z}) \simeq ln f({\bf z_0}) + \frac{1}{2} ({\bf z} - {\bf z_0})^T (\nabla \nabla ln f({\bf z_0})) ({\bf z} - {\bf z_0})$ (the first order term is 0 because of the local maximum)

Taking $A$ as the negative of the second derivative matrix, we get $f({\bf z}) \simeq f({\bf z_0}) \exp (- \frac{1}{2} ({\bf z} -{ \bf z_0})^T A ({\bf z} - {\bf z_0}))$ If we are looking for a probability distribution that is proportional to $f$ , we have ${\bf z_0}$ as the mean, $A^{-1}$ as the covariance matrix, and $\frac{ (2 \pi) ^ {\frac{M}{2}}}{| A | ^ {\frac{1}{2}}}$ as the normalizing coefficient - voila!

So we can fit a Gaussian to a function - back to information criteria. We'll fit a Gaussian to $ln \int p(D | {\bf \theta} , M) p({\bf \theta} |M) d {\bf \theta}$ at the mode (with the most likely parameter setting) ${\bf \theta_{MAP}}$ :

$ln p(D | M) \simeq ln \int p(D | {\bf \theta_{MAP}}, M) p({\bf \theta_{MAP}} | M) exp(-\frac{1}{2}({\bf \theta}-{\bf \theta_{MAP}})^T A ({\bf \theta}-{\bf \theta_{MAP}})) d {\bf \theta}=$
$= ln \left( p(D | {\bf \theta_{MAP}}, M)p({\bf \theta_{MAP}} | M) \frac{(2 \pi)^{\frac{M}{2}}}{|A|^{\frac{1}{2}}}\right) =$

$= ln p(D | {\bf \theta_{MAP}}, M) +ln p({\bf \theta_{MAP}} | M)- 0.5 ln |A| + Mc$

As before, the first term is the fit of the model to the data. The rest of the terms are the complexity penalty. The a wide prior probability for the parameters, the second term is small, and the last term scales with $M$ - the main penalty comes from $ln |A|$

To evaluate the determinant of the covariance matrix, we assume that it has full rank, and is due to $N$ iid data points. This means that $A$ is the sum of variances $A_n$ due to the data points, and since the data is iid, $A_i = A_j \forall i,j$ . So $ln|A| = ln |NA_1| = ln N^M |A_1| = M ln N + ln |A_1|$ . Again, last term is constant, so all in all we have
$ln p (D | M) \simeq ln p(D | {\bf \theta}_{MAP}, M) - 0.5 M ln N$

To recap, we estimated the probability of the data under the model, using the Laplace approximation to fit a Gaussian for the log-likelihood, and used some simplifying assumptions to arrive at the final form.

The end result is pretty much the Bayesian Information Criterion, and it penalizes model complexity more than AIC. Note that the constants in front are not arbitrary, since we never made any simplifications for them, and there's a 2:1 ratio. That should show Matti :)

Thursday, May 1, 2008

The Geometry of Nature and Chaos

Long before Benoit Mandelbrot defined fractals, Dutch artist MC Escher geometrical tessellations inspired connections between mathematicians, physicists, artists and crystallographers. To put it simply, fractals are structures that appear self-similar on multiple spatial scales- that is, any piece of it looks like the whole after a change of scale.

Fractals in Nature tend to be three-dimensional- requiring three coordinates to specify the location of any point. In specifying an object, we often use two definitions of dimensions. Firstly is the Euclidean dimension (D_e): the number of coordinates required to specify an object. Secondly, there is the Topological dimension (D_t): something like a measure of the intrinsic dimension of the object. Consider a thin string with a topological dimension of one but when it is spread out in space, as in a ball, it has a Euclidean dimension of three.

Topology is also referred as ‘rubber’ geometry since it only deals with the qualitative shape of an object. Take for instance a rubber ball- stretching it can allow it to be deformed into another topologically equivalent object. Therefore, a curve of any shape is actually topologically equivalent to a straight line with a topological dimension of one.

Euclidean and topological dimensions are always integral. But very often mathematicians use the term, similarity dimension which is often fractional. If you take a unit Euclidean line, square and cube, each divided into N equal self similar parts of linear dimension s (scale factors)- for the line, Ns = 1, each smaller part has length s = 1/N.

For the square, Ns² = 1. Therefore, s = 1/N^0.5

As for the cube, NS³ = 1. That means s = 1/N^1/3.

So say, if an object of unit size contains N self-similar copies of itself of size s, then its similarity dimension D_s is determined by the equation:

Ns ^Ds = 1

For the Euclidean figures above, D_s= 1 for the line, D_s= 2 for the square and D_s = 3 for the cube. If we re-write the equation

D_s = log (N) / log (1/s)

Now we can find the similarity dimension of the Koch curve (or the snowflake). At each observation scale, if the curve contains 4 self-similar copies of itself of size 1/3,

D_s = log 4 / log 3 = 1.2618…

That means the similarity dimension of a Koch curve is larger than its topological dimension of 1, but smaller than its Euclidean dimension of 2. Since D_s for a Koch curve is larger than that for a line but smaller than that for area, we can conclude that the Koch curve is more than a line but not quite a plane. Wonderfully surreal.

Monday, April 28, 2008

Misfolded proteins

There are nice sentences in high school text books along the lines of 'misfolded proteins are recognized and degraded'. But in reality, it seems like a tough job to sort these proteins out. There are unfolded proteins, not completely folded proteins, completely folded proteins, and misfolded proteins - how does the cell distinguish which ones deserve to go on?

Anyway, turns out there is a way of dealing with this. Firstly, everything in biology is shape. Shape, shape, shape - John Archer, a old professor, used to stress this a lot. You can recognize when a protein is not done folding because it will display portions that it shouldn't - for example, hydrophobic areas that would be buried in a beta-sheet.

Now how to distinguish between not completely folded and misfolded? This is where sugar tagging comes in. Proteins in the ER are glycosylated in the N-terminus. Glucosyl transferase proteins recognize the hydrophobic portions of the protein, and add another sugar to the N-terminal oligosaccharide. As long as the sugar tag has at least one more glucose, the protein is recognized by calnexin and it cannot exit the ER. To escape calnexin binding, the bound glucose needs to be cleaved by glucosidase.

The proteins in ER cycle between being bound by calnexin and having a sugar cleaved, and being recognized by the glucosyl transferase and having a sugar added until they are completely folded.

This still leaves recognizing misfolded proteins - and apparently the mechanism is similar. Once the protein has spent enough time in the ER and not gotten completely folded, a sugar will be linked that will be recognized by a chaperone which will direct it to the nucleus for degradation.