Using machine learning to find metastatic cancer's origin
By Mark Wanner
To address the problem, a JAX team developed CUP-AI-Dx, a machine learning tool that uses RNA sequence data for analysis. In a paper recognized as one of the best of 2020 by EBioMedicine, the researchers show that CUP-AI-Dx has high accuracy when applied to real-world data sets and provides an important clinical tool to help guide therapies for CUP patients.
Based on their molecular attributes, most metastatic cancers can be traced back to their site of origin, e.g., breast, colorectal, skin, etc. It’s an important piece of information that helps guide therapeutic strategy. Unfortunately, up to five percent of the time, the site of origin cannot be determined, making these “cancers of unknown primary” (CUP) even more difficult to treat. Sadly, patients with CUP have a one-year survival rate of only 25 percent, making improved diagnostic methods essential for better patient prognoses.
To help address the problem, a team co-led by Jackson Laboratory , , and , developed a machine learning framework to help predict the primary site and molecular subtype of cancer samples. The tool, called CUP-AI-Dx, uses RNA sequencing data for analysis, incorporating the expression of 817 genes as input. CUP-AI-Dx incorporates a 1D Inception convolutional neural network model to infer metastatic cancer’s primary tissue of origin. It simultaneously allows for robust identification of a tumor’s molecular subtype, further enhancing clinical insight.
As presented in “CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence,” a paper published in EBioMedicine, the research team used the transcriptional profiles of more than 18,000 primary tumors representing 32 cancer types from The Cancer Genome Atlas (TCGA) to train the model. Once optimized, CUP-AI-Dx was tested on nearly 400 metastatic samples, correctly identifying the tissue of origin 96.7 percent of the time in a test dataset. When applied to clinical-grade RNA-seq dataset generated from two different institutes in the U.S. and Australia, the model predicted the primary site as the top option with an accuracy of 87 percent and 72.5%, respectively.
CUP-AI-Dx provides an important clinical tool to help guide therapies for patients who might otherwise be limited to generalized treatments. In fact, EBioMedicine named the paper as one of its top non-COVID-19-related papers from 2020, and the only one from the field of oncology. The model and results are available for non-commercial use at https://github.com/TheJacksonLaboratory/CUP-AI-Dx.