Detecting human genome structural variation with long read sequencing

Chia Lin Wei, Director, Genome Technologies in lab

Rapid advances in DNA sequencing have dramatically accelerated biological and biomedical research discovery over the past decade. Unfortunately, some genomic regions and features have remained very difficult, if not impossible, to accurately identify and analyze. Work led by of The Jackson Laboratory using recent improvements in long read sequencing technology has now made an important type of difficult-to-find genomic alterations, known as structural variants (SVs), far easier to accurately detect and classify.

Historically, the most common, economical and accurate sequencing methods have been based on short reads. That is, the genome is cut into very short segments prior to sequencing, then reassembled using a reference sequence. The reference acts as a scaffold so that each separate short read can be added in the correct place, like a linear jigsaw comprised of millions of pieces. Short read methods work very well for finding single nucleotide variation within the sequence throughout most of the genome.  

There are some analyses for which they fall far short, however, including finding structural variants, which account for a large amount of the genetic variation found between individuals. SVs include insertions, deletions, inversions and duplications of DNA segments within the genome. Because they usually change the number of sequences present or their order, not the actual sequences themselves, they are difficult to detect using short read assembly protocols. SVs are common, however, and they are increasingly implicated in disease, including cancer.

Now, improvements in long read sequencing methods are also advancing SV detection. One technology of note is nanopore sequencing. DNA is threaded through a nanopore and bases are detected by minute changes in current within the pore, a protocol that provides long reads and has seen a sharp increase in accuracy over the past two years. In a paper published in Nature Methods, Wei and her team present a computational analysis pipeline, named Picky, that uses nanopore sequence data to detect a full spectrum of SVs and identify where the DNA breaks occur with high accuracy.

Working with a breast cancer cell line that has been the focus of many studies, the team found 34,100 unique SVs—mostly insertions and deletions but also including inversions, translocations and tandem duplications — with 66,660 breakpoints. Accurate SV detection allows for better insight into cancer genome dynamics, such as the finding that SVs are enhanced in regulatory regions that affect gene expression in the cancer lines studied. Such changes can increase the expression of cancer-promoting genes or inhibit cancer-suppressing genes, indicating their importance in cancer initiation and progression.

In the end, Picky significantly outperformed short read protocols, particularly in detecting and classifying tandem duplications, inversions and some insertion events, and even other long read analysis protocols. “Our analysis showcases the effectiveness of long-read sequencing analysis,” says Wei, “and uncovers new features of SVs and how they affect chromatin territories and transcriptional regulation.” Combined with the increasingly accurate long-read data, it could help change sequencing approaches for genome-wide structural analysis.