Enhancing Chemical Toxicity Predictions with Synthetic SMILES from a Fine-Tuned LLM-Based Chemical Synthesis Generative Model

Yong Oh Lee
Do Yeon Kim

0 evaluations Published on Apr 15, 2025

This article on Sciety

Abstract

The adoption of transformer-based models in toxicity prediction has significantly advanced the field, yet these models continue to struggle with data imbalances inherent in benchmark datasets such as Tox21, Clintox, HIV, and BBBP. This persistent challenge undermines their effectiveness, particularly in minority class predictions where data scarcity prevails. Recent advancements in large language models (LLMs) have demonstrated remarkable capabilities in generating synthetic Simplified Molecular Input Line Entry System (SMILES), providing a novel approach to address these imbalances. In this study, we explore the potential of LLM-generated synthetic SMILES to enhance the training datasets, focusing on the augmentation of minority classes. Our comprehensive experiments on multiple benchmark datasets show that this strategy effectively mitigates class imbalance issue but also substantially improves the minority class prediction accuracy without compromising the overall model performance. For instance, in the Tox21 dataset, we observed an increase in minority class prediction accuracy from 0.707 to 0.965. Similar improvements across other datasets further validate the efficacy of synthetic SMILES augmentation in enhancing both toxicity prediction and broader chemical property assessments.

Related articles are currently not available for this article.