How many strains of flu are circulating at any given moment? And how much sampling is necessary to capture this diversity? This came up in a conversation with Andrew Rambaut and Erik Volz last week. Fortunately, we can get a back-of-the-envelope estimate using standard population genetic theory. Here, I've downloaded all the amino acid sequences for the HA1 region of the H3N2 hemagglutinin protein that exist in Genbank between January 2002 and June 2009. This figure is looking at 10 week windows, with each colored region representing the frequency of a particular sequence in that window's sample. You can see that there are a few common sequences and many rare sequences, and that sequence diversity rapidly changes over time. The HA1 region is the region of the influenza genome most responsible for antigenic variation. Evolution of HA1 is what allows the virus to infect people that have built up immunity to previous strains of flu. Looking at amino acid diversity of HA1 will give an under-estimate of total genomic diversity of flu, but should be a decent proxy for functional diversity.
We can use the Ewen's sampling formula to calculate the probability that we observe k distinct sequences (or alleles) in a sample of n sequences. In this case, the expected number of alleles in sample of n sequences is
, where θ represents the level of mutational input into the population. This formula assumes neutral demography, no geographic subdivision and an infinite alleles mutation model, where every mutation creates a new allele. I fit this formula to the windows from Genbank comparing the number of sequences sampled each month to the number of distinct sequences observed. Doing so, I get an estimate for θ of 28.8, shown in red.
With this number in hand, it's possible to estimate the number of distinct alleles that one would find in a very large sample. We expect to find 104 alleles in a sample of 1000 sequences and 169 alleles in a sample of 10k sequences. Estimated global prevalence of influenza is around 70 million (more during the northern hemisphere winter, but this should be good enough for our purposes). A sample of 70 million sequences is expected to have 358 distinct sequences. However, most of these are at very low frequency. We would only expect to see around 30 alleles at greater than 1% frequency, 86 alleles present at >0.1% frequency and 164 alleles present at >0.01% frequency in the population. I'm not sure exactly where to draw the line in terms of "important" variation, but I would think that 1 in 1000 is a good ballpark. Thus, it seems to me that a sample of around 500 sequences (with an expected 84 unique alleles) would be sufficient to capture all the possibly important diversity in the HA1 protein.
I've spent a bit more time cleaning up my LaTeX template to make for fully automated web display. You can find it over on GitHub. This is currently set up for the canalization paper, but it should make a good basis for any sort of scientific manuscript. I've provided style sheets and a ruby script to cleanup output from TeX4ht into a presentable web version. This web version is hosted automatically through GitHub pages. Thus, by running a single script, the LaTeX source is compiled to HTML and with a GitHub push you can update the public web version of the manuscript. I hope this sort of approach will prove useful for collaborative writing.
It was easy to run this on my previous manuscripts written in LaTeX. I now have web versions of the tree topology and the global migration dynamics papers up.
In the Serengeti food web paper, we present a network diagram of predator-prey relationships, illustrating network structure (Figure 3). In getting with the times, we've also made an interactive version of this figure, presenting the network in a force-directed layout. Ed coded this up in d3.js based on a version I did in Processing. Green nodes represent plants, blue nodes represent herbivores and red nodes represent carnivores. Edges connecting nodes pull them toward each other following Hooke's Law, while nodes are repelled from each other according to Coulomb's Law. We add an additional force pulling nodes belonging to the same group toward each other.
My favorite part of the visualization is the concept of focus. If you click on a node, the spring forces applied to the edges of this node are magnified, pulling its connections closer. This makes it easier to explore relationships in the network. A double-click removes focus.
Our paper on modeling food webs was just published in PLoS Computational Biology. Here, I was happy to bring the statistics I've learned from phylogenetic analysis to an entirely different field. I advised Ed Baskerville in implementing MCMC and marginal likelihood estimation for network data. In this case, the data is a matrix of predator-prey relationships, which can be thought of as a network of directed edges specifying who-eats-whom. We investigated structure in the Serengeti food web through a model in which groups of species behave similarly to one another in terms of what species they eat and what species they are eaten by. The inferred model shows a high degree of trophic and spatial clustering in which a number of spatially distinct plant groups are fed upon by a few wider-ranging herbivore groups, which are in turn fed upon by just a couple of predator groups.
Also of possible interest, the supporting appendix provides a nice overview of the use Bayesian methods for inference on network data. The model we present here really should be useful in a variety of biological contexts; genetic regulatory networks and protein interaction networks immediately come to mind. Photo by Andy Dobson.
| Journal A | ||
| Accept | Reject | |
| Accept | 71 (67) | 35 (43) |
| Reject | 42 (43) | 31 (27) |
| Journal B | ||
| Accept | Reject | |
| Accept | 57 (47) | 18 (27) |
| Reject | 16 (27) | 25 (15) |
I discovered this paper by Peter Rothwell and Christopher Martyn through an excellent blog post by Bradley Voytek. In the paper, the authors show that reviews of the same paper by two independent reviewers show a level of agreement little better than expected by chance alone. The authors repeat their experiment across two neuroscience journals. For the first journal, they have 179 pairs of reviews, with 219 of the 358 votes (61%) recommending acceptance or acceptance with revision. If votes between reviewers were distributed entirely by chance, we would expect 67 accept-accept pairs, 43 accept-reject pairs, 43 reject-accept pairs and 27 reject-reject pairs. However, if the reviewers are coming to some sort of scientific consensus we would see an overabundance of accept-accept and reject-reject pairs.
Here, I've shown their findings, with observed and (expected) counts for each scenario. In journal A, there appears to be little or no difference from the chance expectation, while journal B shows a very modest improvement over the chance expectation. A simple Fisher's exact test gives a P value of 0.285 on the results of journal A and a P value of less than 0.0001 on the results of journal B. Additionally, Rothwell and Martyn find little correspondence in reviewer's assessments of priority of publication.
Interestingly, the authors studied reproducibility of abstract acceptance at two different scientific conferences. Here, each abstract was reviewed and rated on a 1 to 6 scale by a panel of 14 or 16 reviewers. In this case, variance across abstracts can be assessed, but also variance across reviewers (we expect some reviewers to be tougher than others in their assessments). Rothwell and Martyn find a very modest R2 across abstracts of 0.11–0.15, indicating very little reviewer agreement. However, R2 across reviewers was a more respectable 0.27–0.32, suggesting more variation in reviewer "toughness".
Thus, it appears that in small samples of two or three reviewers, noise from positive/negative reviewer bias may swamp the signal of a particular manuscript. This fits with my own anecdotal experiences. Usually (but by no means always) reviewers seem to agree on what's lacking in a manuscript, but will often disagree on how terrible a particular failing is to the manuscript's prospects. Perhaps if each reviewer's overall positive/negative rating bias were taken into account, we could arrive at a measure of manuscript quality that is more repeatable between independent reviewers. In turn, this could make authors less beholden to the roll of the reviewer die.
In an ongoing effort to be more open in my scientific dealings, I've posted a preprint of my latest paper to the arXiv and here on my website, as both PDF and HTML. This represents my first attempt at a straight-up modeling study. There's a lot going on with the epidemiology and evolution of influenza; I've made a model that attempts capture all the salient details. This includes things like the yearly attack rates, rate of antigenic evolution, genetic diversity, and geographic spread. At it's core, the model assumes that the antigenic phenotype of the virus can be adequately explained as a point in a Euclidean space. Mutation serves to jostle the location of the virus in this space and infection by one virus confers immunity to subsequent infection by nearby viruses in this antigenic space. The geometric basis of the model stems from empirical studies of influenza's antigenic phenotype (see Smith et al. 2004). In this study, I find that evolution in such a space results in a "canalized" trajectory. The best move for a virus is to move as far away from its past as possible, resulting in linear antigenic movement and a distinctive single-trunked phylogenetic tree.
I'm especially proud of my HTML version of the manuscript, which, through the magic of LaTeX, has all sorts of hyperlinking between figures and references. In addition, I've done my best to make something that's highly readable on the screen. Almost everything is taken care of by TeX4ht conversion from my LaTeX source and with a CSS stylesheet, so with only a little more work I should be able to fully automate the process.
I'm working now to put the source code for the simulations behind this online. In the meantime, I would very much welcome any feedback you might have on the manuscript. Good to get feedback before publication, when there's still an opportunity to incorporate it.
I came across a simple visualization of England and Wales mortality data in the Guardian. And because I couldn't deal with the network-y display of hierarchical count data, I decided to redesign the graphic as a tree map. In googling for "treemap", I found d3.js, which makes extremely attractive Javascript graphics, with a number of rather fancy built-in figure types. It seems a little harder to get into than Processing, as it exposes more of the raw Javascript, but the results are beautiful and it provides full SVG support. Here's the mortality data laid out with d3's treemap algorithm.
In my paper on selection in viral phylogenies, I compared the effective population size of measles virus to the effective population of human influenza virus. The concept of effective population size Ne is central to population genetics. It measures the timescale of population turnover, or, looking backwards in time, it measures how long it takes for individuals in the population to find a common ancestor. Genetic diversity is a combination of this timescale and mutation rate.
This is just a small addendum to that paper. I had wanted to include swine influenza in with the comparison of measles virus and human influenza virus, but decided that this would detract from the paper's focus. Here, the sequences of swine influenza come from de Jong et al. (2007).
The scaled effective population size Neτ of measles is estimated at 124.6 years, Neτ of global H3N2 human influenza is estimated at 7.2 years, and Neτ of European H3N2 swine influenza is estimated at 24.1 years.
This fits nicely with the observed patterns of antigenic evolution. Infection with measles confers life-long immunity; evolution of the measles genome does not change its antigenic phenotype. This results in neutral population dynamics. However, human influenza evolves in antigenic phenotype very rapidly, causing strong selective pressures that reduce effective population size. Swine influenza presents a nice example between these two extremes. In comparing rates of antigenic evolution, de Jong et al. find that "while human H3N2 viruses have evolved at a rate of about 2.0 antigenic units per year since 1982, swine H3N2 viruses have evolved more than six times more slowly, about 0.3 antigenic units per year." In this case, selective pressures still reduce effective population size, but not to the degree seen in human influenza.
In my work on flu I've been trying to build joint evolutionary and epidemiological models, where natural selection emerges dynamically from influenza strains competing for susceptible hosts. In speaking on this, I found it useful to broaden the context a bit. Here, you can think very generally of genetic / ecological variants competing with one another in some sort of ecological space. Variants that are close together in this space strongly compete, while more distant variants exist more-or-less independently.
This is exactly the model that Darwin used to illustrate the Origin of Species. Here, I've described this idea in a bit more detail and built a visualization of the model.
I put the source code to the simulations in last month's tree topology paper online.
twitter
github
RSS