Biology now involves more and more data. With the amount of data involved… we need to use computers to deal with this data, to dig useful information out of the biological data.
- Zhipeng Cai, Ph.D. Computing Science, University of Alberta
Computing Science research associate, Zhipeng Cai.
Ever since HIV (human immunodeficiency virus) was discovered swimming through human bloodstreams in 1981, it has proved adept at staying a few strokes ahead of the medical community.
An artillery of drugs has been produced to combat the virus, but often they work on a patient only for awhile. Then the virus grows resistant and the drugs become useless, and doctors are back at square one.
HIV is hard to beat because it’s not a killer with dependable traits—it’s more like a rapidly changing community of killers. A drug that effectively fights one member of the community may not work so well on another. And the HIV community is a large one. Within group M, the subtype responsible for most HIV infections, there are at least nine subtypes.
To make matters worse, the community travels together. “For a particular patient, there could be several subtypes of HIV in the body, not just one,” says Dr. Guohui Lin, a professor of computing science at the University of Alberta (U of A).
As the community colonizes its human hosts, subtypes mate with other subtypes, creating brand new strains of HIV. Lin gives an example.
“When subtypes A and C come together, they cross over with each other and… form a new strain. We call (this new strain) a recombinant… Most of the time, recombinants don’t survive… But sometimes, when they do, they become a dangerous new strain.”
In 2005, New York City officials announced the case of a man whose HIV led to AIDS in less than two years. Normally, this progression takes at least 10 years. Also, the man’s HIV was resistant to most HIV drugs.
And in February 2007, officials in King County, Washington, announced that four men were infected with an HIV strain also resistant to most HIV drugs.
Inside the bloodstream of an HIV-positive person, whole generations of HIV live and die within a single day. The virus is a target that moves and evolves with astonishing speed, and if scientists are to conquer HIV, they must have a sure-fire way to keep the target in their sights.
Keeping the target in sight
Scientists would do better at keeping up with HIV if they had a way to identify and analyze strains quickly and accurately. After all, you can’t catch a killer if you don’t have a good profile about the killer.
This is where the U of A bioinformatics research group gets involved. The group uses computing science to tackle problems related to biology and medical science. Lin and other researchers in the group have created an efficient application for analyzing HIV strains.
Says Lin, “It tells you immediately which subtype a virus belongs to. If there is no subtype that it belongs to (i.e. it’s a recombinant), our application can tell you which segments of the virus come from which subtype.”
The team’s HIV work is an example of phylogenetic analysis, which reveals how closely different species are related. In the case of the team’s HIV work, it reveals how closely specimens of a single species are related.
Phylogenetic analysis depends on deriving meaningful data from reams and reams of genetic information. To give you an idea of how much information, consider the genome, the complete genetic map of a species. In a 2007 paper by Lin and his team, they state, “Most genomes contain millions to billions of nucleotides (the building blocks of DNA).”
Researchers began to map whole genomes back in the 1970s, and now more than 200 have been mapped, including the human genome. By virtue of their completeness, whole genomes offer the richest genetic information about an organism that you could ask for.
But some researchers have hesitated to analyze the whole genome because of the extremely large volume of information. “Such huge volumes of data create computational challenges in both memory consumption and CPU usage,” the Lin team states in their paper.
But Lin’s team has come up with a unique method for analyzing the whole genome of HIV without getting overwhelmed by information. Their application can accurately analyze thousands of HIV strains in only seconds.
The application combs through a genome and selects the strings of nucleotides that “contain the richest evolutionary information,” they state in their paper.
The strings they select are actually excerpts from much longer strings. Zhipeng Cai, a bioinformatics PhD student, says complete nucleotide strings are thousands of nucleotides long. “It’s very hard to compare these long sequences,” he says.
So, rather than compare the whole sequences, the application pinpoints the subsequences that contain the most important evolutionary information. To analyze a virus, the application uses only these top-ranked subsequences. This is how it avoids an onslaught of data.
“Our contribution to whole genome phylogenetic analysis of HIV is efficient counting,” says Lin.
Beyond HIV research
Cai and Lin say the team’s method for analyzing HIV also works on other viruses. The team has been doing similar work with avian flu and foot-and-mouth disease.
The genetic information they cull sometimes reveals surprising things. For example, the team, together with their American collaborators, has developed a theory that Alberta is one of the regions where avian flu originates.
Lin says he’s pleased that he both discovers new knowledge in biology and makes it easier for his colleagues to do their work. He recalls a biologist now with Centers for Disease Control and Prevention (CDC) who was grappling with a recombinant strain of avian flu.
“I said, if you can give me the DNA sequence, I can definitely help you figure out the detailed events of the recombination. And he said, ‘You can do that?’ And I said of course.”
Article and photos by Erin Ottosen, 2007.