Imagine accelerating the discovery of new therapeutics through the development of AI models for mining drug-cell interactions at unprecedented resolution. Tahoe Therapeutics (formerly Vevo) new release may have redefined the race to map the human cellular landscape in cancer.
In an unusual move, Tahoe Therapeutics has released “Tahoe 100M”, a massive open-source dataset encompassing 100 million single-cell data points and 60,000 experiments, mapping 1,100 drug treatments across 50 cancer types. Tahoe 100M brings a 50-fold increase in publicly available perturbational single-cell data, positioning itself in the world’s largest single cell repository.
Tahoe 100M includes what researchers call “single cell transcriptomics profiles”, i.e., a comprehensive list of gene expression data for each individual cell. These “profiles” provide a snapshot of each cell and how it responds to drug perturbations, portraying a more accurate mosaic of tumor cell interactions. Thus, researchers can use the mosaic to understand the behavior of individual cells and define the impact of cancer heterogeneity on the development of effective treatments.
Dr. Johnny Yu, co-founder and technology platform developer at Tahoe, describes the company’s unique “Mosaic Platform”, used to generate the dataset, as “a technology that creates a ‘mosaic tumor’ that allows testing drugs across multiple cancer types simultaneously and at high throughput”. The “Mosaic Platform”, combined with single-cell resolution, yields “approximately 20,000 measurements across all protein-coding genes per assay” he continues, “offering a unique level of cellular granularity”. Using this approach ensures the dataset’s immediate practical value, making it a precious resource for AI modeling.
Tahoe Therapeutics and the Arc Institute have recently partnered in the launch of the Arc Virtual Cell Atlas: the most comprehensive and diverse public database of single-cell level transcriptomic data across a wide range of perturbations. These data can be obtained for free and used for further analysis and AI modeling. Just in the last month, the dataset has been downloaded almost 11,000 times on Hugging Face, a data sharing platform. Dr. Hani Goodarzi, Tahoe’s scientific co-founder, Core Investigator at the Arc Institute and UCSF Professor, puts the dataset into context: “Tahoe’s ‘Mosaic Platform’ helped minimize ‘batch effects’, which can make single cell data difficult to compare, offering a more consistent and reliable resource for modeling”.
While recent technological advances in using AI, such as the AlphaFold 3 model, have fundamentally unlocked the ability to predict protein structures and drug interactions, understanding patient biology complexity remains a critical challenge. At this intersection, the potential impact of single-cell perturbation datasets on drug discovery can be profound. “Tahoe 100M enables the building of comprehensive models that can predict drug interactions across diverse patient populations,’ states Dr. Nima Alidoust, co-founder and CEO at Tahoe.
To develop effective cancer treatments, we need to understand biological interactions beyond simple protein binding. Datasets such as Tahoe 100M account for patient complexity from the earliest stages of drug discovery, thus, having the potential to unlock novel “AI-first” approaches to drug discovery.
Dr. Bo Wang, chief AI scientist for the University Health Network in Canada and among the leading experts in AI for biology and healthcare, believes that the release of this dataset is “a big deal for the field”. His lab developed the single-cell GPT model (scGPT), one of the first attempts to apply AI large language modeling to single-cell data. This model was trained using 33 million human cells from tissues such as heart, brain, blood, etc. and allows accurate cell type classification in single-cell studies. He believes that “the Tahoe 100M dataset significantly extends our ability to train AI models to learn more nuanced, dosage-dependent cellular responses in perturbation studies across different cancer types, which help portray more generalizable AI models for drug development”. He is confident that such models will provide more accurate means for early patient stratification and for in silico screening of patient response for precise treatment selection.
The generous release of Tahoe 100M is a potential turning point for deciphering cancer vulnerabilities at scale and can trigger an open-source data sharing momentum in cancer research.
By providing unprecedented access to high-quality, large-scale single-cell data, Tahoe is promoting a more open, collaborative approach to scientific discovery. This is important as recent reports warn about thousands of 3D protein structures and other disease-relevant big datasets held within the vaults of private companies. The release of Tahoe 100M may represent a first step towards creating the “internet of biology”, laying the foundation for the development of truly transformative AI models to integrate and understand cellular biology and drug development at high speed.
Read the full article here