By Joe Stanganelli
October 25, 2016 | A seasoned medical researcher talking about the state of next-generation sequencing (NGS) for differential diagnostics five years ago might spin you a tale akin to the stereotyped centenarian talking about what it was like to walk to and from school in eight feet of snow uphill both ways.
Ask Andrew Dauber.
"When I started doing this—my first paper using next-generation sequencing was in 2011, which was pretty early in the exome area—there were still people who were using SAM tools or… homegrown variant callers," related Dauber, a University of Cincinnati pediatrics professor who is an attending physician at Cincinnati Children's Hospital Medical Center (CCHMC) and serves as the Program Director and Director of Translational Research for the Cincinnati Center for Growth Disorders, in an interview with Diagnostics World News. "[P]eople would call their datasets in both GATK [a.k.a. the Broad Institute's Genome Analysis Toolkit] and SAM tools, and then look at the overlap and look at what was excluded, and use that to try to figure out which variants were real or not real—because there wasn't as much confidence in the variant callers."
Back in these days—the dark ages of NGS—Dauber developed a familiarity with, and eventual preference for, GATK in what began as a matter of geography. Dauber went to Harvard Medical School—which is closely affiliated with the Broad Institute (still sometimes referred to as the Broad Institute of MIT and Harvard) and located in the same city (Cambridge, Massachusetts). Dauber later worked practically across the Charles River from the Broad Institute at Boston Children's Hospital (a Harvard teaching hospital) as a pediatric endocrinologist, inter alia, from 2008 until his move to Cincinnati in 2014.
Having worked with a Broad member during his training, Dauber eventually became a member of the Broad himself—going to weekly meetings and conducting all of his exome-sequencing analysis via the Broad's platform. Consequently, Dauber saw what the Broad's GATK could do—and has been something of a devotee ever since.
"We use GATK on a daily basis," said Dauber of his work today in Cincinnati. "GATK is used in our core facility to do all of the sequencing analysis. [A]ll my analysis has been based on variants that were identified using GATK. We've been very successful in identifying new genetic ideologies, for a host of different conditions, using GATK—ranging from hypercalcemia in infancy, to precocious puberty, to other causes of severe growth disorders—and GATK has called the variants, and allowed, and facilitated all of those discoveries."
The Evolution of Confidence
Moreover, and perhaps more importantly, Dauber has seen the evolution of GATK since those NGS dark ages—and the impact of that evolution—first hand. He related this story from "a few months ago."
"[A colleague] had processed this patient's data through an older version of GATK—this was a patient with severe growth problems, severe developmental issues, and some dysmorphic features—and the older version did not reveal anything," said Dauber. "Our research team here that processed next-generation sequencing data [subsequently] followed the [GATK] Best Practices and ran the most up-to-date version of [GATK] with joint calling, and immediately I redid the analysis, and we were able to find the pathogenic variant that this kid had: a rare form of something called Rubinstein-Taybi Syndrome."
Why didn't the older version of GATK find this? Dauber explained that the patient had the less-common Type 2 of the disease, with a mutation "that just simply was not called using the older version."
Indeed, according to Dauber, Broad Institute efforts like the ExAC browser and similar open-sharing efforts of sequencing data have done wonders for GATK-using physicians like himself in their rare-disease diagnostic efforts. The aspects of GATK that make it at once community-driven and open-source(ish) have allowed it to thrive as more than mere software; GATK is a platform—where the end users, as contributors and participants, are just as important as (or arguably more important than) GATK itself. Without the GATK community building upon prior work, there would be no GATK.
"I remember my first… infantile hypercalcemia case, which was due to an indel. I remember the indels weren't included in the VCF; it was a separate file, and the Broad was not… confident in those calls at all," said Dauber. "And all of this has changed now; not only are indels integrated into the main VCF, [with] more confiden[ce], but now I feel like you never really see other variant callers mentioned in methods."
"With 37,000 users worldwide, the GATK user community is a powerful crowdsourcing mechanism that allows issues to get caught, reported, and fixed fast," added Lee McGuire, Chief Communications Officer of the Broad Institute, in an email interview with Diagnostics World News.
Pretty Powerful Platform
For rare-disease pediatricians like Dauber, this crowdsourcing is what makes GATK better than closed-source, strictly proprietary sequencing solutions. On this point, Dauber related to Diagnostics World News his most "powerful" example of GATK coming through where at least one commercial sequencing vendor—South Korean-based Axeq Technologies—failed.
"There was this one example where we actually had done commercial sequencing from a separate vendor when I was with Boston Children'," said Dauber, referring to Axeq. "[The case involved] this family with… two young children—one of whom was an infant—who had central hypothyroidism. Their [respective] pituitary glands seemed not to be working correctly, not stimulating the thyroid gland to make thyroid hormone, and this is quite an unusual thing in small children—and to have two siblings [with this problem], that suggested that there was a recessive disorder."
According to Dauber, Axeq's non-GATK sequencing technology came up short—providing sequencing data from which no ideology could be derived. Dauber, however, knew better.
"I was just— I felt like this [diagnosis] had to be there, and I was just skeptical," stammered Dauber in sense-memory frustration. "[So] we actually took the same raw data—the BAM files—that the company had produced, and then I asked the team at the Broad to just rerun the same BAM files through GATK and see what [would happen], and literally in like two minutes I found the answer, which was this homozygous mutation in… Prop 1, which is a factor that affects pituitary differentiation and development."
Dauber went on to explain that, for one of the two siblings, Axeq's variant-calling software had miscalled this variant as heterozygous instead of as homozygous.
"That simple correct call in GATK made a huge difference in… finding the ideology for these patients," said Dauber. "[A]lso, this ideology had immediate diagnostic and therapeutic ramifications because Prop 1 leads to multiple pituitary hormone deficiencies—so it alerted us of the need to check for some other pituitary hormones, and then be able to counsel the family appropriately about the necessary treatments. That was … pretty powerful."
Standards and Speed
This reputation for accuracy has helped GATK achieve huge gains in user base and ubiquity as a platform; this year, the Broad Institute has even been successful in forming several partnerships with its competitors to enable competing solutions with GATK. (See coverage in Bio-IT World.) This would seem to be a win-win for all involved—allowing commercial vendors of proprietary NGS solutions to maintain market share and relevance, while at once enabling GATK devotees and welcoming new ones.
One such partner, however, has been mildly dismissive of GATK's relative usefulness. In an email interview for the preceding Bio-IT World report to which this one plays counterpoint, Edico Genome's Vice President of Marketing, Gavin Stone, said that the company is converting customers to its DRAGEN solution—despite offering GATK licenses on its systems—because the former allegedly offers similar accuracy at far greater speeds. (See coverage here.)
And, to be sure, a lot has changed in the past few years in NGS technologies; some commercial NGS solutions are catching up, enabling GATK-like accuracy at faster speeds on powerful systems—with or without the actual GATK software. The Broad Institute's McGuire declined to comment on Stone's claims directly, but did say that the new version of GATK expected for release later this year will be "substantial[ly]" faster than its predecessor.
All of this said, agility can come in many forms, and reported speed advantages of other tools have apparently have done little to sway Dauber—who sees distinct flexibility and adaptability benefits of GATK.
"Now, all of the [GATK] Best Practices are online and available," said Dauber. "Anytime there's an update[,] you just need somebody locally who knows how to implement that stuff and knows how to interact [and] ask questions."
Still, for Dauber, it all comes down to accuracy and trust in the scientific community—and GATK's reputation therefor.
"[Sequencing has become] almost a commodity that's available at so many places, but still you need to do it well; how you process the data makes a huge difference—and I think that the team at the Broad has done a very good job of constantly improving their calling and making sure you're getting the most out of the data that's being generated from the sequencing machines," said Dauber. "I think it's created a standard for the variant calling aspect. I think it's created a standard that's accepted as high quality exome data. Nobody questions that anymore."
On this latter note of standardization, Dauber has a bright outlook on GATK's status.
"My impression is that GATK now is the standard for exome sequencing analysis pipelines," said Dauber. "You used to have to [detail] your methods … and now papers just say 'We use GATK following their best practices"—period, and it's like, 'Okay, fine.' It's accepted. That's what you're supposed to do."