CellVoyager: AI CompBio Agent Generates New Insights by Autonomously Analyzing Biological Data
Abstract
Modern biology increasingly relies on complex, high dimensional datasets such as single-cell RNA sequencing (scRNA-seq). However, the richness of such data means that conventional analyses may only scratch its surface. Extracting meaningful insights from these datasets often requires advanced computational methods and domain expertise. Current AI agents for biology are primarily focused on executing user-specified commands and are therefore limited by the user’s creativity and familiarity with which kinds of analyses are useful. Furthermore, these agents do not account for prior analyses already attempted by researchers, reducing their ability to build upon existing work. To address these limitations, we introduce CellVoyager, an AI agent that autonomously explores scRNA-seq datasets in novel directions conditioned on prior user-ran analyses. Built on large language models, CellVoyager ingests both the dataset and a record of prior analyses to generate and test new hypotheses within a Jupyter notebook environment. We evaluate CellVoyager on CellBench, a new benchmark based on 50 published scRNA-seq studies encompassing 483 analyses. Given only the background sections of these papers, CellVoyager outperformed GPT-4o and o3-mini by up to 20% in predicting which analyses the authors eventually conducted. We then carried out three in-depth case studies where CellVoyager is given previously published papers with their scRNA-seq datasets and conducts analyses to generate new findings. The original authors of each study evaluated these findings and consistently rated them as creative and sound; 80% of the agent’s hypotheses were deemed scientifically interesting. For example, in one case study, the agent found that CD8+T cells in COVID-19 infection are more primed for pyroptosis, which was not explored by the original researchers. CellVoyager also reanalyzed a brain aging dataset to discover a previously unreported association between increased transcriptional noise and aging in the subventricular zone of the brain. These results demonstrate that CellVoyager can act both autonomously and collaboratively to accelerate hypothesis generation and computational biology. It also highlights the potential of agents like CellVoyager to unlock new biological insights by reanalyzing the vast existing biological data at scale.
Related articles
Related articles are currently not available for this article.