Genomes versus exomes versus genotypes
The pluses and minuses of different sequencing strategies
When someone has their “DNA sequenced,” it almost always means one of four things.
- Their entire genome of three billion plus base pairs may be sequenced in a process appropriately called whole genome sequencing, or WGS.
- Alternatively, they may have whole exome sequencing, or WES, in which the ~1.5% of their genome that codes for proteins is sequenced and the non-coding regions are not.
- Genotyping is offered by several direct-to-consumer companies, in which single bases are identified in certain locations in the genome where variations often occur. For example, 23AndMe does this, providing data for nearly one million bases.
- Finally, patients with certain diseases may have a specific gene or genes sequenced in a very defined protocol to help with diagnostics and treatment. For example, cancer gene panels that include more than a hundred of genes are currently being evaluated for efficacy in determining targeted therapies. (I’m going to set this kind of targeted diagnostic sequencing aside, because it is usually done to address an individual patient’s specific clinical need.)
There’s still significant disagreement among experts about how to strategize sequencing and genotyping programs. A lot of the controversy comes down to cost versus data gained. Given that we need huge numbers of sequences to start identifying and understanding patterns in the genome, it’s argued that getting genotype or exome data from hundreds of thousands or millions of people is more valuable at this time than a lower number of whole genome sequences. But WGS prices are coming down, and we’re learning more and more about the role of non-coding regions in some diseases, which the other methods largely miss.
As precision medicine ramps up in the United States and the tests become more clinically relevant, it’s important to know what each can—and cannot—tell you about your genetic makeup. So what’s the best method and where are we going from here?
When 23AndMe offered genetic tests for $99 (now $199), it made a big splash in the news. It’s still a very popular way to take a look at one’s ancestry and some genetic traits, and it has provided researchers with massive amounts of genetic data from customers as well. Nonetheless, the ability of 23AndMe and other similar genotyping testing companies to report medical insights via the test was curtailed by the FDA, although some limited clinical reporting has resumed. So what do they actually do with the “spit tubes”?
Unlike sequencing, which provides ACTG sequences of the entire genome or large portions thereof, genotyping focuses on specific bases in the genome known to vary from person to person. It’s cheap and fast because they can just choose the single nucleotides of DNA they want to include and run the test on a chip. It yields interesting data, and they now have more than a million consenting customers, providing some intriguing research opportunities. In addition to customers learning about their individual ancestry and traits, researchers have mined the data for clinically relevant variants associated with depression, resilience to Mendelian disease, and more.
But the data has limitations too. Think of it this way: Even if you are an expert and decide the most relevant 1,000,000 DNA bases to include in your genotyping test, you are still eliminating more than 3,100,000,000 bases in the genome. True, not all bases are created equal for clinical relevance or insight, but there’s a lot of missing information in a genotype.
Whole exome sequencing (WES)
Many of the institutions to first offer clinical sequencing chose to focus on the coding regions only — the exome. Why? For much of the genomic sequencing era, the non-coding regions of the genome — upwards of 98% of DNA we possess — has been considered superfluous (though that view has now changed dramatically). It was referred to in ways that implied its unimportance: “junk DNA,” “genomic dark matter,” and so on. In this context, isolating and sequencing only protein coding regions and ignoring the “junk” was a cheaper, quicker, and less data-intensive alternative.
WES focuses exclusively on the known coding regions of the genome. There are roughly 180,000 exons (the sequences that are transcribed to messenger RNA and are then translated to proteins), constituting about 30,000,000 base pairs. There are several specific methodologies for sequencing exomes, but they all work on the common concept of targeting the desired sequences, “capturing them,” usually by binding them with complementary DNA sequences, and enriching them for sequencing. Refinements in the techniques have made WES highly effective in capturing and sequencing just the coding regions with high accuracy.
Groups using WES have demonstrated its clinical usefulness by successfully diagnosing a relatively high percentage of patients with previously undiagnosed, and usually rare, diseases. Groups report a consistent diagnostic rate of about 30%, with one of the earliest and largest providers, Baylor College of Medicine, arriving at a figure of 28% using very stringent standards. In addition, data from research studies using exome sequencing in various groups has been gathered by a group at the Broad Institute as part of the Exome Aggregation Consortium (ExAC). As explained in a blog post by senior author Daniel Macarthur, the massive data collection and analysis effort now has exomes from more than 60,000 people (soon to be many more), allowing for significant insights into human genetic variation, albeit limited to the variations found in the coding regions of their genomes.
Whole genome sequencing (WGS)
The move to WGS for both research and clinical sequencing is gaining momentum. Indeed, a researcher here at The Jackson Laboratory (JAX) told me with complete confidence that s/he believes all sequencing programs will move to WGS within five years. Obviously, lower prices and greater ability to handle the data are key factors. Also, the notion of “junk DNA” has been largely debunked over the past few years. There is significant debate about many aspects of the research, including what percentage of non-coding DNA has an actual function and even how “function” should be defined in this context, but there is now consensus that some non-coding regions indeed have vital functions, particularly in the regulation of gene expression. The relative success of clinical exome sequencing still leaves ~70% of patients without diagnoses, and some recent insights into non-coding DNA variants indicate the answers may be found there in many cases.
The challenges involved with making clinical WGS efficient and feasible remain daunting. While more and more people are obtaining WGS, there is no ExAC equivalent or even close. And ExAC has assembled almost a petabyte of raw exome data, so any repository of the size needed for clinical insight beyond rare disease and some cancer applications — 100,000 genomes? 1,000,000? 10,000,000? — will require almost unimaginable data management and computing horsepower. And before we tackle that challenge, formidable data sharing/patient privacy hurdles must be addressed and cleared.
But progress is being made on the research front, where tools are being developed to analyze the potential impact of non-coding variants throughout the human genome. Last week, a group led by newly arrived JAX Professor Peter Robinson published a paper detailing a new tool called “genomiser,” which assesses the pathogenicity of known non-coding variants and finds a high rate of causal mutations in human disease. Increasing knowledge of non-coding variant effects and developing analytical tools to quickly find probable disease variants will greatly expand WGS usefulness in both clinical and research settings.
Recent initiatives such as the U.K.’s 100,000 genomes project and the U.S.’s Precision Medicine Initiative are driving progress in clinical genomics at a rapid pace. What seemed exotic a few years ago may quickly become standard procedure. Questions and challenges and even pessimism regarding precision medicine persist in the healthcare system, but becoming familiar with the testing and sequencing options and their pros and cons is important and relatively easy these days — and many of us will likely be mulling over our variants of uncertain significance in the near future.
Mark Wanner followed graduate work in microbiology with more than 25 years of experience in book publishing and scientific writing. His work at The Jackson Laboratory focuses on making complex genetic, genomic and technical information accessible to a variety of audiences. Follow Mark on Twitter at @markgenome.