Bridging the gaps in DNA sequencing

Bridging the gaps in DNA sequencing

I recently went to a website that provides access to one man’s full genome sequence, self-sequenced from his white blood cells. It may be the first such sequence prepared, soup to nuts, by the same individual who provided the sample.

Published just last month, the sequence was prepared on a new technology that may soon bypass the genome center setup that has become familiar: large sequencer, lab for sample preparation, huge computing infrastructure for analysis, etc. Indeed, the man who prepared the genome, Clive Brown*, makes the following prediction in his “Read Me” file:

“In the near future you will be able to take spittle, say, at home and sequence yourself on a regular basis. Data can be stored, analyzed and tracked online under your control.”

How is this sequence — what has been whimsically called the “Clive-ome” — possible? It’s through a new long-read sequencing technology, nanopore sequencing, which is starting to change what’s possible in the field. Another technology, known as single-molecule, real-time (SMRT), is also being used to uncover biology obscured by other sequencing methods. So what might these new capabilities mean for future research?

A success story, with caveats

The rapid improvement of DNA sequencing technology over the past 15 years has been a true innovation success story. Astonishing progress in sequencing speed, cost and accuracy has drastically changed biomedical research.

The technology that has driven most of the sequencing progress is short read, massively parallel sequencing, with a high percentage of it conducted on equipment from Illumina**. DNA is first fragmented into short segments, usually around 150-300 base pairs long, and huge numbers of these small fragments are sequenced simultaneously. After sequencing, however, each of the short sequences has to be fit onto a reference template in a process known as sequence assembly. That’s one reason obtaining the first genome sequence for a species is a relatively long, hard process, but also very important. It provides a reference sequence onto which future genome sequences can be assembled in the correct order.

There’s a significant problem though. Even the best short read whole genome sequences aren’t whole. At least five percent of the DNA bases — more than 150,000,000 of them in human genomes — aren’t included in whole genome data. They can also leave out other information that can be vital for disease diagnosis. Why is that?

One issue is that large areas of the genome are highly repetitive, including at the ends (telomeres) and junction points (centromeres) of chromosomes. These regions include so-called AT and GC rich regions, with long stretches of DNA that yield nearly identical short fragments. It’s impossible to accurately reassemble these onto a template. Think of a puzzle that’s all featureless blue sky—reassembling highly repetitive regions is a similar challenge, only many orders of magnitude larger and more difficult.

Another problem is that many genomic features and defects don’t involve sequence changes. Called structural variations (SVs), they include sections of DNA that are reversed (inversions), missing DNA (deletions), duplicated DNA (duplications) and so on. Sometimes copies of a whole gene are added or deleted in what are known as copy number variants, meaning that there are fewer or more than the usual two copies of that gene. SVs are common in even healthy human genomes, as JAX’s Charles Lee discovered, and ongoing research by Lee and others has shown that SVs also underlie many diseases and disorders. The ability to detect them is therefore very important.

The thing is, these variations don’t involve a change in the sequence, and it can be very difficult to find them when the DNA is fragmented into small pieces. How do you know a sequence is reversed in the original genome? Or there are three copies of a segment instead of two? So even while short read DNA sequencing has ramped up and provided unprecedented amounts of data and started to affect clinical care, there has been a concerted effort to implement technologies that sequence longer DNA fragments.

Long reads, big hurdles

Nanopore and SMRT sequencing provide the ability to sequence DNA segments of 5,000-10,000 base pairs and even more in one read, but in past years they have been hampered by high cost and accuracy problems relative to short read technology. They are currently being used effectively for certain applications, however, and recent advances have them poised to make a significant impact. As a recent review of next generation sequencing in Nature notes: “… other approaches now aim to sequence longer contiguous pieces of DNA, which are essential for resolving structurally complex regions.”

SMRT is the more mature long read sequencing technology, implemented by a company — Pacific Biosciences (PacBio) — that commercialized it in 2011. It uses the same basic components as short read sequencing, but reconfigures them to allow for the sequencing of much longer single-molecules. The technology was hampered early on by relatively high error rates and much higher costs than short read sequencers. The original sequencing machine alone had a high six-figure price tag, took up a lot of floor space, and was notoriously finicky to run, but the long read data has always been sought after for certain tasks.

Researchers at JAX have used the system to sequence entire microbial genomes at once, a hugely valuable attribute when sorting through samples with dozens or hundreds of different microbial species present. JAX researchers are also using it to sequence an entire messenger RNA (after reverse transcribing it to DNA) to see what different forms — called isoforms — are present in the cell, which can have a significant effect on their function. At other institutions, researchers have also turned to SMRT sequencing to create better, more complete human whole genome sequences. SMRT remains somewhat prone to sequencing errors in each sequencing run, but the errors are random, meaning high accuracy is obtainable by sequencing the same molecule multiple times.

Nanopore sequencing, in which a DNA molecule is fed through a tiny pore and the bases sequenced by the slight differences in current disruption each combination of bases produces, has long been an intriguing possibility. It has been the subject of intense research for more than 20 years, but problems with controlling the speed with which the DNA molecule is fed through the pore and with the fine tuning of the pore and sensor structures themselves have kept nanopores in the realm of speculative research rather than real-world use until very recently. In early 2012, Oxford Nanopore made some rather sensational announcements at a genomics technology conference, and the community eagerly anticipated a working nanopore sequencer. And waited. And waited, with growing skepticism. Actual products started emerging a couple of years ago, however, and it seems now that now the wait is paying off.

The Clive-ome is significant, in that it shows that nanopores can now handle large sequencing tasks. But the first commercial nanopore product, the MinION, is a small sequencer not much bigger than a flash drive, and while it doesn’t have huge capacity or high accuracy, it can go just about anywhere and has been particularly valuable for sequencing pathogens. It has been used to quickly identify the sources of hospital infection outbreaks, helped track the Ebola and Zika epidemics in Africa and South America, and even went into space, where it was used to sequence microbes in the ISS. Rapid development cycles have further enhanced speed and accuracy, and a larger instrument (which basically employs MinION-like devices running in parallel) is in the works, promising new options for using nanopore sequencing for large genomes, such as this just released preprint of a C. elegans genome. Or even mammals, like Clive. There are many nanopore fans out there—indeed, one researcher recently questioned whether long-read sequencing will go to all nanopore soon — but at this time each technique still has considerable advantages and drawbacks, depending on the application.

Impact on biomedical research

The contributions of long read sequencing to isoform research, microbial sequencing and pathogen tracking are already profound. The potential is there for human genome sequencing with better coverage and little assembly needed as well. If the methods continue to improve, that would provide better data more quickly for individual diagnoses of rare diseases, cancer and more. The technologies are still new and changeable, so predictions are difficult, but the payoffs could be enormous in the years ahead.

*Clive Brown is uniquely qualified to self-prepare his genome using nanopore technology, as he spearheads product development at Oxford Nanopore. His project is interesting nonetheless, and given how quickly outside researchers have found uses for nanopore sequencers, the prospect of widespread self-sequencing in the future is not entirely unrealistic.

**Illumina just announced a new sequencer that further builds on its short-read technology. Sadly, the headlines trumpeting a $100 genome sequence are not accurate — it will NOT sequence a genome for $100, or at least not anytime soon. It’s significant, however, in that it shows that Illumina remains strongly committed to short read sequencing despite the progress being made by its long read competitors. 

Mark WannerMark Wanner followed graduate work in microbiology with more than 25 years of experience in book publishing and scientific writing. His work at The Jackson Laboratory focuses on making complex genetic, genomic and technical information accessible to a variety of audiences. Follow Mark on Twitter at @markgenome.