PLM-OMG: Protein Language Model-Based Ortholog Detection for Cross-Species Cell Type Mapping
Abstract
Understanding conserved and divergent cell types across plant species is essential for ad- vancing comparative genomics and improving crop traits. Accurate and scalable ortholog detection is central to this goal, particularly in cross-species single-cell analysis. However, conventional methods are time-consuming and perform poorly with distantly related species, limiting their effectiveness. To address these limitations, we introduce PLM-OMG, a protein language model-based framework for orthogroup classification and cross-species cell type mapping. We benchmark five deep learning models including ESM2, ProGen2, ProteinBERT, ProtGPT2, and LSTM, using a curated 15-species dataset and large-scale monocot and dicot datasets from PLAZA. Transformer-based models, particularly ProtGPT2 and ESM2, achieve superior accuracy and generalization across evolutionary distances. Our results show that PLM-OMG enables scalable and reusable orthogroup detection without recomputing existing groups, significantly reducing computational overhead and highlighting its potential to transform cross-species transcriptomic analysis in plant genomics.
Related articles
Related articles are currently not available for this article.