Machine learning and deep learning in particular have a growing influence on everything from how we shop to how we drive—or are driven by self-driving cars. Anshul Kundaje, Ph.D., assistant professor of genetics and computer science at Stanford University, is using similar tools to study large-scale computational regulatory genomics. Using the vast amount of biological data available, Kundaje is teaching neural networks to predict biomolecular function across tissue and cell types. The goal is to develop important new insights into our own biology, as well as accelerate our understanding of disease.
Kundaje was awarded the Chen Award of Excellence at the 2019 Human Genome Meeting in Seoul, Korea. HUGO sat down with him to learn more about his work and what neural network “black boxes” actually do.
Anshul Kundaje: I investigate the control circuitry of the genome and how its sequence is translated to biochemical activity.
AK: Of course, a lot goes into it. Along with DNA sequencing, there have been breakthroughs in profiling different kinds of biochemical markers, such as how proteins bind DNA, measures of gene expression, characterizing epigenetic modifications and so forth. So we’re dealing with not just genome sequencing data but also data about these molecular profiles. So how is gene expression controlled by the sum of all these parts?
An example I like is to think of translating a sentence from one language to another, say from English to French. The input is an English sentence and the output is the same sentence in French. For my research, the input is a DNA sequence and the output is a predicted biochemical profile that we map across the genome.
AK: This is where we can use machine learning approaches like neural networks, which are very good at mapping different kinds of inputs to different kinds of outputs. They take inputs, they learn complex patterns in them, and they can learn complex functions mapping the inputs to the outputs. So the neural network can take the DNA sequence and learn complex predictive sequence patterns. Using that, we can map how a sequence gets translated into various kinds of biochemical markers and activity.
Neural networks are not new. They’ve been extremely successful in many domains. But they’re typically considered “black box models,” which provide very good predictions even though you never understand what they’re doing. In our scenario that would be pretty useless even if the predictions are accurate, because we need to understand how the model makes its predictions to give us insights into biology. For that reason, my lab really focuses on the interpretation of black box models.
AK: We’ve built some interesting tools over the past four or five years that allow us to open up the black boxes, really query them to reveal the patterns they learn. When we do that, we are able to see that the neural networks really are learning incredible biology. Then we can use them as hypothesis generation engines.
For example, we can look at a gene that is activated in neurons but not in immune cells. Why is that, when the genome is the same in both cell types? Neural networks are trained to map the DNA sequence, which is static, to histones and protein-DNA binding and epigenetic markers and all the extra biochemistry going on that varies across cell types and tissues. With enough data, you can start to identify patterns and predict biological functions.
AK: It is. It’s terabytes of data from thousands of cell types and tissues and hundreds of biochemical markers across three billion positions of the genome. But that’s actually perfect for training these neural networks, because they’re data hungry. They perform extremely well and map very accurately when they’re fed large amounts of data. It’s always been a question of how to interpret the models, which is why we’re tackling that issue.
AK: I’ll give you an example of what we can do with genome editing experiments. If we’re given a regulatory switch in the genome, we can make a prediction regarding what causes the switch to be active in a particular cell type. Maybe it’s the binding site of a protein—we can design CRISPR experiments to target that site in particular to validate the prediction in the lab. As we acquire more data, the predictions improve.
Bringing that to the clinic, we know that rare diseases are usually caused by variation or mutation in protein-coding genes, but common diseases involve large numbers of non-coding variants. We can take any variant in the genome and use the neural network like an oracle, allowing it to predict the molecular activity of a sequence and how it’s affected by the variant. Over time the neural network becomes a very powerful in-silico engine, and you can run billions of experiments on a computer and filter through massive sets of biological hypotheses on protein binding sites and their effects throughout the genome.
The models can also be mixed and matched, and you can train them end to end and build more and more complex systems trained on very specific biological entities. In the end, you could eventually build a systems-level model of the entire cell and predict if and how activity or changes in one area will affect all the other areas.
AK: That’s something we’re working on, because it’s one of the big issues in biology in general. The biology in cell lines and mouse models is similar to what goes on in actual patient systems, but it’s not the same. To address it, there has been interesting work in what’s called domain adaptation. A real-life example of this is training self-driving cars, which involves certain scenarios but not every possible one. How does a car trained in summer manage to maintain function when it hits a heavy snow squall? There are new methods that allow models to generalize scenarios that they have not encountered before. In biology we’re working on adapting these kinds of models to showcase what really translates and what doesn’t. Once you separate out the parts that don’t, it can act as a useful filter.
AK: There’s been an explosion of genomic data with billions of data points out there, many of which are just sitting in relatively primitive databases. There’s no intelligent way to search the data or generate recommendations for queries. Our goal is to develop a machine learning backend for the data and match it with a very interactive front end. Think about Amazon and Google—they use recommendation engines based on your buying and browsing history that allow you to discover things you might have never seen.
And if you build a good interface, you can provide ways to access humongous amounts of data through the lens of models without actually getting the data. The back end accesses gigantic databases and pulls out information or provides suggestions, like Amazon does, to help investigators test hypotheses or lead them to what they’re most interested in. You can hide the data itself and still provide users with a powerful discovery tool that incorporates knowledge from data sets they would otherwise never have known about.
AK: Yes, that’s what we want to build. It will require powerful back end machinery, and it’s going to take a long time to build—maybe a decade—but I think it will have a significant impact on genomics and medicine. It’s the next big leap in discovery science.