Tech Corner February 01, 2021

Machine learning: How do you teach machines?

Machine learning platforms can be fed data to "teach" them to learn patterns and predict outcomes, and they are becoming more important all the time in biomedical research. But what are they? How do they work? And can we peer into the black box of their inner functions to make them more powerful for prediction and discovery?

machine learning

I wrote about two papers around the new year that presented new computational research tools using convolutional neural networks, or CNNs. These CNNs were trained to recognize useful patterns in reference data sets and validated in real-world applications. While both were applied to cancer, the data inputs (cancer cell imagery and RNA-sequencing data) were vastly different. And, in a way, they represent a kind of shot across the bow. They have made it very obvious that machine learning (ML) is already important in my workplace, JAX, and it is sure to become more so in the months and years ahead.

But what are these neural networks and other ML tools? How, exactly, do they help researchers now? And where are they headed? These are not easy questions to answer, but I will attempt to do so. And I will follow the field and provide updates about significant developments.

First, a clarification. ML is often lumped with artificial intelligence (AI), so much so that AI/ML has become a standard acronym. In ML, machines are trained on data of some form or other, and they “learn” to predict results from it without being explicitly programmed. ML is part of AI, which concerns creating intelligent systems that actually simulate human intelligence. AI is being implemented—sometimes dramatically, as in AlphaGo’s success in defeating human masters of the game Go—but biomedical researchers, at least for now, are mostly applying ML methodologies to extract knowledge and insight from the vast amounts of data being generated in laboratories around the world.

The neural network black box

Computational neural networks are actually based on biology, in which a single neuron in the brain is typically connected to many others. The total number of neurons and connections can comprise extensive networks, and the data inputted yields a certain output, such as an action. Applying the concept to machines, a network of simple processing elements—artificial neurons, in a way—receives data input, processes it based on prior training with reference data, and provides output of a certain kind. Applications include pattern recognition, such as face identification, sequence recognition (e.g., speech recognition), control output for self-driving vehicles, etc., game-playing and decision making, as in AlphaGo, and more.

Industries that deal with large amounts of complex data, such as finance, medicine and computer gaming, are applying an ever-increasing amount of AI technology in their work. It may therefore be a surprise to learn that no one knows exactly what happens in most artificial neural networks. The path between data input and results output is called a “black box,” so while the output may be accurate and highly valuable, it’s unknown how it was produced. In that context, it’s understandable why AI’s increasing presence in society makes many people wary, with much thought going into how to apply it without magnifying potential harms.

Shining a light on genomic function

One of my first significant encounters with the importance of ML and neural networks in biomedical research was in April 2019, when I was fortunate enough to interview Anshul Kundaje at the Human Genome Organisation (HUGO) meeting in Seoul, Korea. Kundaje, an assistant professor of genetics and computer science at Stanford University, is quite young and still a junior faculty member, but he has already made impressive contributions to genomics research. In fact, he was in Korea to receive HUGO’s Chen Award of Excellence in recognition of his work. And over the course of 45 minutes, he explained to me how he’s using computers to translate biological data into predictions of biological function.

Genomics research is awash in complex data. The genomic sequences themselves contain upwards of three billion base pairs. But there are a lot more data points that influence how the sequence relates to function. A single cell can yield RNA sequences, protein signatures, metabolic measurements, and much more. Associated with the DNA itself are epigenetic marks—chemical compounds on DNA bases that help regulate gene expression—and proteins bound to the DNA at specific locations. It all adds up to terabytes of data from thousands of cell types and hundreds of biochemical markers across three billion positions of the genome. It’s impossible for humans to synthesize and analyze all the data, but according to Kundaje, it’s perfect for training neural networks to help predict biological function from the data patterns.

In addition, Kundaje is seeking to interpret the black box neural network models—to shine a light into the black box, as it were. “In our scenario that would be pretty useless even if the predictions are accurate,” he noted, “because we need to understand how the model makes its predictions to give us insights into biology.” With that knowledge in hand, his audacious goal is to eventually build a systems-level model of the entire cell, so that it’s possible to predict how changes in one area affect everything else. It will take a lot of computing power and time, but it will represent a big step forward for genomics and biology in general.

The future of machine learning research

Traditionally, researchers have distinguished between in vitro (outside of the body, as in a test tube or in tissue culture) and in vivo (inside the body) experimental systems. A relative new term is in silico, which means running experiments on a computer. With the application of machine learning and predictive model development, in silico capabilities and importance will grow rapidly over the next decade. But can we learn how to model biology to the point that we can change one thing (a base pair, an epigenetic mark) and accurately predict the system-wide result? The implications are enormous for everything from increasing our basic knowledge of biology to profoundly improving the practice of medicine, and as the findings emerge, it’s important to understand what they might mean—or not—for progress.

Mark Wanner followed graduate work in microbiology with more than 25 years of experience in book publishing and scientific writing. His work at The Jackson Laboratory focuses on making complex genetic, genomic and technical information accessible to a variety of audiences. Follow Mark on Twitter at @markgenome.