Machine learning can identify orphans that have diverged into the “twilight zone” of sequence similarity

Emilios Tassios
Jori de Leuw
Christoforos Nikolaou
Anne Kupczok
Nikolaos Vakirlis

0 evaluations Published on Jun 8, 2025

This article on Sciety

Abstract

Species-specific orphan genes (orphans) lack homologues outside of a given taxon and frequently underlie unique species traits. It is thus important to elucidate the evolutionary origins of orphans. Orphan genes can result from sequence divergence beyond recognition, when homologous proteins diverge to an extent at which tools that rely on sequence similarity to establish homology can no longer identify them as homologues. Orphans can also result from other processes, including de novo gene emergence from previously noncoding sequences, in which a homologous protein-coding gene truly does not exist.

Here we propose that orphans resulting from divergence might be recognizable from their patterns of non-statistically significant similarity hits which are almost always discarded. To test this, we simulated diverged orphan protein sequences based on conserved proteins from the Unified Human Gastrointestinal Protein catalogue (UHGP) and used reversed protein sequences as negative data sets. We trained four machine learning classifiers on features extracted from the similarity search tool DIAMOND’s output, like total query coverage or maximum bit score. We tested the influence of evolutionary parameters such as simulation tree branch length, indel rate and among-site rate heterogeneity.

We found that the performance of the models depends on the simulation parameters: when the underlying simulated divergence was moderate, accuracy reached ∼90%, but when extremely diverged scenarios where simulated accuracy dropped to ∼70%. The most important features for the classification were the number of alignments (hits) and the minimum hit E-value. When applying our classifier on a set of ∼170,000 eligible simulated orphans from the UHGP dataset, we found that ∼30% of them are predicted to be divergent and these are shorter and more disordered than the rest. Our classifiers and pre-processing python scripts are available at<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/emiliostassios/Classification-of-divergent-genes-using-ML">https://github.com/emiliostassios/Classification-of-divergent-genes-using-ML</ext-link>and can be readily used as a computationally fast means to obtain a candidate set of diverged orphans from any similarity search output. Therefore, our work allows to study such orphans across the tree of life and in doing so to recover cases of remote homology, get better estimates of the evolutionary age of protein families, detect cases of rapid divergence and more generally better understand how genetic novelty arises.

Related articles are currently not available for this article.