Three-segment dynamic threshold joint optimization strategy-based mRMR-PCA-LGBM model for origin identification of Cornus officinalis via mid-infrared spectroscopy
Abstract
The origin of Chinese medicinal materials directly determines their efficacy and safety. To address the rapid traceability needs of Cornus officinalis, this study proposes a three-segment dynamic threshold joint optimization framework. Based on 658 samples of Cornus officinalis from 11 different origins, the framework uses the minimum redundancy maximum relevance algorithm to sort the 3448-dimensional mid-infrared spectra, which are then divided into three segments: retention, dimensionality reduction, and deletion. Through Bayesian optimization, the framework jointly determines the retention of 34 key spectral bands, deletion of 345 bands, and hyperparameters of the LightGBM model. The dimensionality reduction segment is compressed to 38 dimensions using principal component analysis, resulting in a final input of 72 features for the mRMR-PCA-LightGBM model. The independent test set achieves an accuracy of 90.9%, F1-score of 0.91, Cohen's kappa of 0.90, and Matthews correlation coefficient of 0.90. The receiver operating characteristic – area under the curve for the 11 origins is greater than 0.95. These results are markedly better than those of five control models. By strategically capturing origin-specific information while eliminating irrelevant noise, this framework demonstrates that highly accurate and robust origin identification is achievable with minimal spectral features, providing a practical and efficient technical pathway for the authentication and market supervision of Chinese medicinal materials.

Please wait while we load your content...