Let’s Gamify Virus Hunting
I never understood the allure of Vegas. Until the pandemic forced me to hunt viruses
This is my second post in a series expanding on our group’s recent virus discovery preprint. The first post in the series is here.
When I was in grad school, the collective sum of all known DNA sequences used to arrive via snail mail on CD-ROMs. Needless to say, there’s gray in my beard. Nowadays, it would require around 50 million CDs to represent all the data in public sequence repositories. The stack of CDs would be 50 miles high! The training I got in grad school obviously wouldn’t be up to the task of searching for viruses in such a mess. Luckily, I work with some seriously brilliant young “trainees” who agreed to patiently teach an old dog the new bioinformatic tricks I needed to fish for viruses in the miles-deep trenches of the Sequence Read Archive (SRA).
I have dangerous levels of love for video games - to the extent that a condition of buying a PlayStation for our house was that my husband had to also get a fireproof box where we lock away the controllers when we’re done playing. With ad libidum game controllers in the house, I might not show up for work. The surprise of my pandemic remote-work experience is that trawling for viruses in other people’s datasets turns out to be pretty much the most engrossing computer game I’ve ever encountered. It’s like it’s giving me a tiny window into what gambling addiction might be like. If somebody could figure out how to gamify this stuff, they’d give Sudoku a run for its money. I’m looking at you,
.Gameplay, as I’ve been experiencing it, is sort of akin to the children’s card game Concentration. As a specific example, a gene I call “HelPol” first turned up in a parvovirus in an Oregon tidepool pillbug1. At first, the linear amino acid sequence of the new gene was hard to recognize - but protein fold predictions showed that it’s a type of DNA polymerase. I named the gene after Hel, the Norse goddess of the underworld, because the second place I found HelPol was in sediment samples collected near deep sea hydrothermal vents. Instead of being embedded in parvoviruses, the sediment HelPol genes are embedded in a much stranger group of viruses that I suspect infect archaea. Since it’s hard to imagine an animal parvovirus stealing genes from an archaeal virus, it seems like there must be a line missing on my subway map. My current bet is that a missing link HelPol will eventually turn up in a diverse group of viruses I call “meldviruses” (midsize eukaryotic linear DNA viruses).
The problem with this story is that I’m just one guy, I’m only really searching for my personal favorite virus families, and so far I’ve looked at <1% of the SRA. There are a lot more Concentration cards to turn over. It’s like the African proverb that starts, “if you want to go fast, go alone.” The virus-hunting game will go farther if there’s a massively multiplayer option.
The Human Element
Finding weird genes in pillbugs is fun, but human datasets are obviously where the real action’s at. My datamining efforts barely sampled any human datasets, and yet I turned up quite a few examples of an emerging family called adintoviruses in creepy places like atherosclerotic plaque datasets. If more human eyeballs could be recruited to trawl human sequence datasets, I bet we’d uncover all sorts of new associations between viruses and other major diseases such as cancer and Alzheimer’s. There’s gold in them thar datasets!
Here’s the real problem: the vast majority of human sequence datasets are hidden behind walls of privacy bureaucracy. There are many thousands of bioprojects in this category and each bioproject requires its own boutique access negotiations. The access paperwork sometimes contains terms my employer forbids me to sign. New privacy laws in Europe have basically put an iron curtain across the entire continent.
Byzantine access restrictions are maddening for two reasons. First, nobody has ever reported any substantive harm from the release of research sequences. It’s a risk we’ve been imagining for decades but it has simply never materialized. More importantly, the identifiable portions of these types of datasets are the human sequences - which are exactly the thing we’d ideally want to computationally remove if all we’re doing is looking for viral sequences.
The pushback I’ve heard on the idea of releasing permanently de-identified datasets is that the data release wasn’t explicitly specified in the original consent forms. This argument is faulty. The purpose of consent forms is to ensure that volunteers understand the possible harms associated with enrollment in a research study. Releasing de-identified data obviously doesn’t pose any meaningful risk of harm. If anything, our failure to properly analyze valuable research data could be considered a minor ethical harm.
It might help to consider some thought experiments. Let’s picture a scenario in which there’s a disease outbreak in a fundamentalist Christian community and I want to find out whether the disease is caused by a virus. I’d be aware of that fact that some study volunteers might not feel comfortable with the idea of gay marriage, but the consent form can’t have a checkbox saying “I agree that my samples can be analyzed by a gay-married scientist.” Gay sample handling isn’t a harm. In a more absurd thought experiment, a consent form for a vaccine trial can’t include the sentence, “I understand that receiving vaccines might lead to alien abduction.” Warnings about unrealistic risks are effectively a form of disinformation. It’s inappropriate to focus on the fact that study volunteers didn’t explicitly consent to the harmless release of de-identified datasets because non-harms don’t belong on consent forms.
Game Development
Although I’m a proud co-author of a paper entitled “Cenote-Taker 2 Democratizes Virus Discovery and Sequence Annotation,” I humbly confess the project was the brainchild of an aforementioned brilliant trainee, Mike Tisza. Mike has since moved on to the greener pastures of the academic tenure track, where I doubt game development research would be fully appreciated. Futhermore, Cenote-Taker only democratizes the final stages of the virus-hunting game. Truly democratized gamification still has a fairly long and winding road ahead.
A fundamental problem a gamifier would need to confront is that most of the dozens of petabytes of available SRA data are in the form of what’re called “short reads.” It’s as though we’re detectives who have been given several tons of shredded documents. The computational tools for assembling the shreds into readable pages are well worked out, but downloading the raw data from SRA is slow and disk-intensive, and the assembly work is CPU and memory-intensive. The assembly process is nowhere near as much of an environmental disaster as Bitcoin mining, but I’d estimate that the hundred-thousand SRA datasets I assembled put something like ten tons of CO2 into the atmosphere. It’s pretty funny that when the pandemic trapped me in my house my carbon footprint suddenly went up by an order of magnitude.
An initial stage of game development might be for public sequence repositories and academic groups (like mine) to do the computationally intensive work needed to create a searchable repository of compressed assembly snapshots. Megahit is a fast and accurate tool for the assembly process and the ultrafast Diamond format gets my vote for creating the searchable database. The pipeline could also be set up to send assembled contigs of interest to Mike’s new and improved Cenote Taker 3 - such that nonspecialists could more easily understand the gene content of candidate viruses and could easily share annotated genome maps via GenBank. The hunting pipeline I’m picturing would require an AWS account and some basic knowledge of virus biology and command line syntax - so I guess it won’t be giving Sudoku a run for its money anytime soon. But those of skill in the art could presumably build a friendlier interface on top of the basic infrastructure I’m proposing. I’m looking at you, Serratus.
All that and I didn’t even mention AI! Which is obviously perfect for this type of big data project. AI geeks, start your engines! I guess I might be the only man in America explicitly begging AI geeks to please put me out of a job.
Next post: Riddle of the primarnivore - what do anelloviruses and giraffe necks have in common?
My earliest memories are from a family trip down the Pacific coast when I was three years old. I have flashbulb picture-memories anchoring my lifelong interest in biology to tidepools on the Oregon coast. Man oh man was it cool to discover HelPol in the ancestral cradle of one of my passions. Eat your heart out, God of War.