Last week, we got reviews back from eLife on the "Integrating influenza antigenic dynamics..." paper. Fairly positive, but as always, there are a number of issues that need correcting. Mainly to help organize the process of responding to reviewer criticisms, I've tried to treat criticisms as "issues" that need fixing like bugs in a piece of code. All the reviewer criticisms are now up as issues on GitHub. As I work, I'm closing issues by linking to a commit that fixes the issue in the text, making it clear what, exactly, has been revised in the text in response to each criticism.
I'm convinced that issue tracking is a good model for the process of correcting particular scientific issues with a manuscript. I could even imagine a manuscript publishing pipeline that gives reviewers the ability to create and close issues. This would thread manuscript conversations by topic rather than exchanging long point-by-point response documents.
Mirrored as a guest post on Haldane's Sieve.
The influenza virus shows a remarkable capacity to evolve to escape human immunity. Many other viruses, like measles, do not have this capacity. After infection with measles, a person gains life-long immunity to the virus, and hence measles has become constrained to be a childhood infection. Continual antigenic evolution in influenza necessitates frequent vaccine updates to provide sufficient protection to circulating strains.
Antigenic differences between strains are commonly quantified using the hemagglutination inhibition (HI) assay, which measures the ability of antibodies created against one strain to interfere with virus from another strain. The resulting HI data is represented as a sparse matrix of comparisons between viruses from strains A, B, C... and sera from strains X, Y, Z... Taken by itself, this matrix is difficult to work with. Experienced virologists can pick up the loss of reactivity between groups of viruses in the noisy HI data, but these patterns are not fully quantified.
In our new paper, available on the arXiv and hosted here, we extend techniques of multidimensional scaling (MDS) pioneered by Derek Smith and colleagues for the analysis of influenza antigenic data. Here, we attempted to bring the MDS antigenic model into a fully Bayesian framework and refer to the revised technique as Bayesian MDS (BMDS). In this model, viruses and sera are represented as 2D coordinates on an antigenic map in which their pairwise distances yield expectations for the HI titers, with antigenically similar viruses lying close to one another and antigenically distant viruses lying far apart.
By placing antigenic cartography in a Bayesian context, we are able to integrate other data sources, most notably sequence data. In this case, genetic sequences provide an evolutionary tree relating virus strains and we assume that antigenic location evolves along this tree in a 2D diffusion process. This process imposes a prior on antigenic locations in which evolutionary similar viruses have a prior expectation of lying close to one another on the map. In the paper, we use this BMDS / diffusion model to investigate patterns of antigenic evolution in 4 circulating lineages of influenza and show that antigenic drift determines to a large degree incidence patterns across time and across lineages.
The paper is also up on GitHub, which I'll keep updating as it goes through the review process. The BMDS model is implemented in the software package BEAST and is available in the latest source code. I hope to provide tutorials on running the BMDS model in the not-to-distant future.
On Wednesday, I will be doing a practical on inferring spatiotemporal dynamics of RNA viruses from sequence data at the GenEpi workshop at the London School of Hygiene and Tropical Medicine. I put together a step-by-step guide to go along with the practical, which should be entirely self-contained. This guide is available over at GitHub. I used the 2009 H1N1 influenza pandemic as a test case, where the growth of the pandemic and its spread throughout the world can be inferred from sequence data alone. It mostly details the very basics of getting up and running with the program BEAST and its small associated ecosystem.
I found GitHub pretty ideal for these purposes. I got to write in Markdown, which makes the writing itself quite painless while still producing decent results, and it's also easy to host all the associated files on GitHub, making for a nicely self-contained site.
Erik Volz, Katia Koelle and I just had a paper published in PLoS Computational Biology reviewing methods and applications in viral phylodynamics. This was part of PLoS Comp Biol's push for Topics Pages. These are reviews that get a full PLoS Comp Biol citation, but conform to Wikipedia format and standards with ample links to other Wiki pages, and at publication, they seed a new Wikipedia page. The idea here is to give a CV incentive for researchers to contribute to Wikipedia, which would normally be seen as a time sink. Plus this strategy gives the added benefit of increased exposure, as Wikipedia is often the first place people look when encountering a new subject.
The PLoS Comp Biol version is set in stone, while the Wikipedia version is free to keep evolving as the field progresses, which feels quite comforting actually.
More population genetics... Here, I wanted to look at the process of fixation, that is the process by which a mutant allele comes to take over the entire population. There is a very classic result by Kimura that the chance of fixation depends on population frequency p and the product of the effective population size N and the selective advantage of the mutant allele s, such that
for diploid populations and
for haploid populations. This result is most commonly used to gauge the likelihood that a new mutant entering the population at 1/2N or 1/N frequency will fix.
Here, I'm showing the chance that a polymorphism fixes in a haploid population as a function of its frequency in the population p and its scaled selection advantage (or disadvantage) Ns. You can see that when Ns = 0, the chance of fixation is just equal to the mutant's frequency. If an neutral allele is at 50% frequency in the population, it has a 50% chance of fixing. If a mutation is selectively advantageous, it has a greater chance of fixing than expected from its frequency alone, and likewise, if a mutation is deleterious, it will fix less often than expected from its frequency. A decently advantageous mutant (Ns = 10) is subject to random loss when it's at low frequency, but will almost certainly fix if it gets to moderate frequency in the population. If it gets to just 3.5% frequency in the population, then it will have a 50% chance of fixing. Conversely, a decently deleterious mutant (Ns = –10) needs to get to high frequency before it has any chance at all of fixing. If it gets to 96.5% frequency, then it will have a 50% chance of fixing.
What's really striking to me, is that this makes it clear that there's a snowball effect for advantageous mutants, where getting to some low frequency helps them to get to higher frequency, which in turns helps them to get to even higher frequency and eventually fix. This can be seen as a decreasing slope of the chance of fixation as frequency increases. Every step toward fixation is progressively easier for advantageous alleles. On the other hand, deleterious mutants face an uphill battle. Getting to low frequency is relatively easy, but every step past this gets more and more difficult. This can be seen as in increasing slope as allele frequency increases. Every step toward fixation is progressively harder for deleterious alleles.
It's clear that there exists substantial global genetic diversity in influenza at any given moment. This level of diversity could be reflected in diverse local epidemics, or local epidemics could be homogeneous with specific locations encountering only a single strain each season, even while maintaining substantial diversity at the global level. This is an important distinction if we care about how local populations react to influenza epidemics and in assessing vaccine effectiveness. Here, I've sought to quantify this distinction.
I began by collecting all the sequences I could for influenza A/H3N2 from GenBank from 1995 to 2011, keeping only sequences with complete dates. Because we're interested in functional diversity, I reduced this to only amino acid sequences from the HA1 protein and further reduced these sequences to 129 epitope sites following Munoz and Deem (2005). These epitope sites represent residues that when mutated often result in proteins that elicit a novel immune response, and hence mutations to these sites often provide a transmission advantage to the virus. I removed sequences which possessed gaps at any of these 129 sites, leaving a total of 8482 sequences to work from.
I next calculated the number of unique variants present for each Northern Hemisphere influenza season (October to March), giving the relationship between sample count and number of unique sequence types shown here. I calculated this relationship for the entire global sample, for the USA, Europe, Japan and Korea (Northern), for just the USA and within each state in the US. This relationship can be quantified with the Ewen's sampling formula, giving an expectation for the number of unique variants k observed in a sample of n sequences
, where θ represents the level of mutational input into the population. I estimate θ from the these relationships getting a global estimate of θ of 58.5 and a within-state estimate of θ of 7.3. Using the Ewen's sampling formula, we predict that, in the average influenza season, 100 random infections sampled from the global population will yield 59 unique variants and 100 random infections sampled at the state-level will yield 23 unique variants.
It is common is measure diversity in terms of pairwise identity. In population genetics this is often referred to as heterozygosity, named because it is the probability that two randomly chosen haploid individuals (or chromosomes within a diploid individual) are distinct. This is the same measure as 1 minus the Simpson index (1 - λ) in ecology. Here, I find that 89% of pairs sampled from the global population are distinct and 72% of pairs sampled from within-state populations are distinct. Along these lines, it's common to measure sequence "diversity" or π as the average number of mutations that separate random pairs of sequences. In this case, I find that pairs of sequences sampled from the global population are separated by 4.1 epitope mutations, and pairs of sequences sampled from within-state populations are separated by 2.6 epitope mutations. You may have noticed that θ estimated by Ewen's doesn't match up with θ estimated from pairwise diversity (58.5 vs. 4.1 and 7.3 vs. 2.6 for global and local samples, respectively). This is essentially a Ewens-Watterson test showing an overabundance of rare variants, consistent with the actions of natural selection or otherwise non-neutral demography.
So, although there is significantly more genetic variation at the global level, local variation is still substantial. We expect two people who get the flu within a state (and most likely a community) in the same season to have been infected by distinct viruses. It's not clear whether viruses differing at one of these 129 epitope sites are always immunologically distinct, but many of these mutations to epitope sites should have an immunological impact.
I've prepared a follow-up visualization to the one I did previously on coalescent genealogies. I've been working a lot lately with haplotype networks, in which we have a large sample of sequence variants from a population along with their frequencies and the mutational paths that connect them. I've been analyzing the spread of HIV variants in chronically infected individuals in this fashion. Here, I thought it would be useful to construct a simple visual display for what these networks should look like if evolution follows the standard null model with no selective effects and simple demography. I'm fairly pleased with the results. You can see them here.
We just had a new paper published in PLoS Pathogens lead by graduate student Daniel Zinder that addresses antigenic evolution in the influenza virus. Generally, what sets the pace of antigenic evolution in influenza has remained something of a mystery. There is massive pressure for the virus to evolve to escape human immunity. However, dramatically new variants (ones that necessitate a vaccine update) only emerge every year or three. In this study, we wanted to find out whether the emergence of new influenza variants is limited by waiting for new, advantageous, mutations to appear or whether emergence is limited by waiting for immunological circumstances to change in the human population.
In a very simple model with only a few epitopes determining the virus's antigenic phenotype, we find that a low mutation rate is critical to reproducing influenza's observed level of genetic diversity; immunological pressures alone will not result in restricted diversity. This is seen in the accompanying figure, in which low diversity outcomes (shown in white) only occur at very low mutation rates. However, we do find that a rather amazingly simple model with just a few epitopes does a pretty decent job at reproducing observed influenza dynamics.
For me, this research suggests that more work is needed to understand the effects of changing the dimensionality and size of the antigenic space in which a virus evolves. Does limiting flu to just a few epitopes improve model fits/predictions? Does it matter how many variants exist at each epitope site? Does it matter how mutation moves from variant to variant, i.e. are all variants of an epitope site reachable by a single mutational event?
In presenting my work on influenza evolution a couple weeks ago, Josephine Pemberton made the comment that the traditional presentation of the influenza phylogeny makes the process of evolution look overly deterministic; the successful trunk lineage is always highest up the page. Taking this to heart, I've made a revised phylogenetic plot that gives an interpretable y-axis showing the number of sequence differences between each node in the tree and the root sequence. This makes the rate of evolution through time obvious and more explicitly connects persistence with sequence evolution. Details of my procedure and the resulting trees for H3N2 influenza can be found here.
I've just started on a project for Google Summer of Code, mentoring Michael Landis at Berkeley. Michael has proposed to build a browser-based tool to visualize phylogeographic output from BEAST and similar programs. Here, we want to track the geographic locations of lineages or species through time across a phylogeny. An animation would start at the root of the tree and work its way forward to the present, essentially slicing the phylogeny at each point in time and showing the distribution of lineage-specific locations at this slice.
We're still in the planning stages and one of the big questions is which Javascript library to base this on. The top two contenders are Processing.js and D3.js. Processing.js will take code written in Processing, essentially stripped-down Java, and draw to an HTML canvas object, essentially specifying pixels on a grid. D3, on the other hand, is written as pure Javascript and all of the manipulation is in terms of SVG objects, specifying lines and circles and so on. Although, I agree in part with Knuth in that "premature optimization is the root of all evil," I wanted to see if performance would have a definite tick in one column or the other.
Here, I coded up a simple Brownian motion style visualization using both programs. The Processing.js visualization is here and the D3.js visualization is here. There are 500 particles, the velocity of which is constantly being bumped up and down by random noise. The XY window is adjusted every frame to match up with the extent of the XY locations of the particles. In addition to random noise, there is some friction slowing down the particle velocities and there is an attraction of each particle to {0,0}, making this an OU process. Particle sizes are proportional to their velocities.
| Browser | Processing.js | D3.js |
| Safari | 58 | 29 |
| Chrome | 40 | 34 |
| Firefox | 40 | 4 |
Here, I've recorded the frame rates I was getting for both Processing.js and D3.js. I'm doing all of this on my MacBook Pro. Your results may vary. Processing.js running in Safari comes out on top, nearly hitting 60 FPS, while D3.js under Safari gave roughly half this. Chrome fairs substantially worse with Processing.js, but slightly better with D3.js, while Firefox does terribly with D3.js. I would imagine that almost all of the differences here will lie in the handling of SVG vs canvas rather than in the D3 and Processing libraries. Still, although I'm sure SVG performance will continue to improve, for the moment it seems that Processing.js is the clear winner.
Disclaimer: this is one particular visualization. Incorporating other aspects (transparency, polygons, etc...) could give different results entirely.
twitter
github
RSS