Bagging partial least squares for accurate and stable wheat protein content detection using near-infrared spectroscopy
Abstract
Near-infrared (NIR) spectroscopy combined with machine learning algorithms has been widely adopted for rapid assessment of grain quality attributes. However, conventional calibration models often suffer from overfitting and instability when applied to high-dimensional spectral data with limited sample sizes. In this study, we developed a novel bagging partial least squares (BA-PLS) algorithm for accurate and stable prediction of wheat protein content. A total of 394 wheat samples were collected and their NIR spectra from 950 to 1650 nm were acquired. The BA-PLS algorithm generates multiple bootstrap subsamples, trains PLS models on each subsample, and aggregates their predictions through averaging, effectively reducing prediction variance while preserving the low-bias property of PLS. The performance of BA-PLS was comprehensively compared with standard PLS, support vector regression (SVR), and extreme gradient boosting (XGBoost). The results demonstrated that BA-PLS achieved superior predictive performance with the coefficient of determination (R²P) of 0.9600 and the root mean square error (RMSEP) of 0.3058%. Notably, while SVR and XGBoost exhibited severe overfitting with training to test R² gaps exceeding 0.4045, BA-PLS maintained excellent generalization with a minimal R² gap of 0.0261. Furthermore, BA-PLS provided reliable prediction uncertainty estimates through the standard deviation of ensemble predictions. The proposed BA-PLS algorithm offers a practical and stable solution for rapid wheat protein quantification, with potential applicability to other cereal quality assessment tasks.
Please wait while we load your content...