Scotch M, Tahsin T, Weissenbacher D, O’Connor K, Magge A, Vaiente M, Suchard MA, Gonzalez-Hernandez G. Incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography. Virus evolution. 2019 Jan;5(1):vey043. doi:10.1093/ve/vey043
Discrete phylogeography using software such as BEAST considers the sampling location of each taxon as fixed; often to a single location without uncertainty. When studying viruses, this implies that there is no possibility that the location of the infected host for that taxa is somewhere else. Here, we relaxed this strong assumption and allowed for analytic integration of uncertainty for discrete virus phylogeography. We used automatic language processing methods to find and assign uncertainty to alternative potential locations. We considered two influenza case studies: H5N1 in Egypt; H1N1 pdm09 in North America. For each, we implemented scenarios in which 25 per cent of the taxa had different amounts of sampling uncertainty including 10, 30, and 50 per cent uncertainty and varied how it was distributed for each taxon. This includes scenarios that: (i) placed a specific amount of uncertainty on one location while uniformly distributing the remaining amount across all other candidate locations (correspondingly labeled 10, 30, and 50); (ii) assigned the remaining uncertainty to just one other location; thus ‘splitting’ the uncertainty among two locations (i.e. 10/90, 30/70, and 50/50); and (iii) eliminated uncertainty via two predefined heuristic approaches: assignment to a centroid location (CNTR) or the largest population in the country (POP). We compared all scenarios to a reference standard (RS) in which all taxa had known (absolutely certain) locations. From this, we implemented five random selections of 25 per cent of the taxa and used these for specifying uncertainty. We performed posterior analyses for each scenario, including: (a) virus persistence, (b) migration rates, (c) trunk rewards, and (d) the posterior probability of the root state. The scenarios with sampling uncertainty were closer to the RS than CNTR and POP. For H5N1, the absolute error of virus persistence had a median range of 0.005–0.047 for scenarios with sampling uncertainty—(i) and (ii) above—versus a range of 0.063–0.075 for CNTR and POP. Persistence for the pdm09 case study followed a similar trend as did our analyses of migration rates across scenarios (i) and (ii). When considering the posterior probability of the root state, we found all but one of the H5N1 scenarios with sampling uncertainty had agreement with the RS on the origin of the outbreak whereas both CNTR and POP disagreed. Our results suggest that assigning geospatial uncertainty to taxa benefits estimation of virus phylogeography as compared to ad-hoc heuristics. We also found that, in general, there was limited difference in results regardless of how the sampling uncertainty was assigned; uniform distribution or split between two locations did not greatly impact posterior results. This framework is available in BEAST v.1.10. In future work, we will explore viruses beyond influenza. We will also develop a web interface for researchers to use our language processing methods to find and assign uncertainty to alternative potential locations for virus phylogeography.