2NPLGBM: A genomic model that merges the strengths of classical and machine learning methods in genomic prediction
Abstract
Background Genomic prediction (GP) is a central component of modern plant breeding, enabling the early selection of superior genotypes based on genomic marker data. Classical GP models, such as genomic best linear unbiased prediction (GBLUP), operate within the data modeling culture and typically assume additive genetic effects, which have limitations that hinder their performance in hybrid breeding, where dominance and epistasis effects play a role. In contrast, machine learning (ML) models from the algorithmic modeling culture can model non-additive genetic effects but often lack biological grounding and interpretability. To bridge these paradigms, we propose 2NPLGBM, a hybrid genomic prediction approach that integrates quantitative genetics with ML. This method introduces a two-matrix (2NP) genotype representation by concatenating additive (Z) and dominance (W) matrix representations, which serves as input to a Light Gradient Boosting Machine (LGBM), enabling the simultaneous modeling of additive, dominance, and higher-order genetic interactions (AA, AD, DD). Results The 2NPLGBM model was evaluated using six years of hybrid maize trial data across four agronomic traits (grain yield, plant height, days to silking, and days to anthesis) under five cross-validation schemes simulating temporal: Leave-One-Year-Out (LOYO), Rolling Window (RW), and genetic generalization: Five-Fold, and tester-based schemes (Tester CV0 and Tester CV00). Compared to GBLUP, 2NPLGBM achieved an average 5% improvement in predictive accuracy under temporal validations and over 15% gains under tester-based schemes, particularly for flowering traits (days to silking and days to anthesis). Moreover, it consistently improved selection efficiency, indicating that the model captures complex genetic signals relevant for ranking and hybrid selection. Feature interpretation using SHapley Additive exPlanations (SHAP) confirmed that non-additive interactions contributed substantially to prediction accuracy for highly heritable traits. It also revealed trait-specific architectures, additive effects dominated flowering traits, while dominance effects contributed substantially to plant height and yield. Classical variance component analysis supported these findings, indicating high dominance contributions of 17.3% for yield and 8.2% for plant height. Conclusion 2NPLGBM represents a biologically informed ML framework that bridges classical quantitative genetics and algorithmic modeling cultures. By jointly modeling additive and non-additive effects it enhances predictive accuracy, interpretability, and selection efficiency in hybrid breeding programs. Future work should explore multi-trait and multi-environment extensions, integration of environmental covariates, and the inclusion of multi-omics data to further strengthen predictive power and biological interpretability.
Related articles
Related articles are currently not available for this article.