Benchmarking Phenotypic Clustering Algorithms via Empirically Calibrated Simulations: A Diagnostic Framework to Improve Biodiversity Assessment in Neglected Crops
Abstract
Clustering algorithms are widely used for phenotypic characterization and germplasm management, particularly in neglected and underutilized species (NUS) that lack genomic resources. However, their performance under biologically realistic conditions remains poorly understood. Standard clustering methods commonly applied in crop research often assume distinct, isotropic, and homogeneous clusters—assumptions rarely satisfied in real-world NUS datasets.
We developed a biologically informed simulation framework, empirically calibrated with phenotypic data from West African fonio (Digitaria exilis), to benchmark the performance of eleven clustering algorithms under both idealized and realistic scenarios. Our simulations integrated heterogeneous trait distributions (normal, gamma), strong inter-trait correlations (up to r = –0.84), heteroscedasticity, and moderate population structure (Pst ≈ 0.15), as observed in fonio landraces. Each scenario was replicated 100 times, with clustering accuracy evaluated using Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Silhouette coefficient, and Davies–Bouldin index.
The results revealed consistently poor algorithm performance under realistic conditions (e.g., ARI < 0.07), including for widely used methods in NUS research such as K-means, GMM, and PAM. Performance markedly improved under idealized conditions, validating our simulation framework.
These findings highlight the risk of overinterpreting clustering outputs from weakly structured phenotypic datasets and expose key limitations in current biodiversity analysis practices—particularly those guiding plant genetic resource conservation programs. We provide an open-source R-based diagnostic tool, available on Zenodo 5(<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.15877863">https://doi.org/10.5281/zenodo.15877863</ext-link>), to assist practitioners in selecting robust clustering approaches for germplasm management and pre-breeding in data-scarce crops.
Related articles
Related articles are currently not available for this article.