Investigation of sample partitioning in quantitative near-infrared analysis of soil organic carbon based on parametric LS-SVR modeling
Abstract
Soil organic carbon (SOC) can be quantitatively determined with the enhanced stability of near-infrared (NIR) measurements. NIR analysis requires a modeling-validation division for real samples. The research of modeling robustness should be discussed in the modeling process based on an investigation of the calibration–prediction sample partitioning. A framework for sample partitioning is proposed with the consideration of the tunable ratio of numbers of the calibration and prediction samples. We addressed this issue in the multivariate calibration for NIR analysis of SOC using the least squares support vector machine regression (LS-SVR) method with an interactive grid search of its two modeling parameters, γ and σ, where γ is the regularization parameter directly influencing the Lagrange multiplier in the kernel transformation, and σ2 represents the kernel width used to tune the degree of generalization. We created 7 volunteer groups for different ratios of calibration–prediction partitions. The calibration and prediction samples were re-produced for each volunteer group. The LS-SVR models were established and the parameters were optimally selected by considering the stability and robustness based on the statistical theory of mean value and relative standard deviation. Furthermore, in all comparative partition ratios, the optimal volunteer group was selected with the partition of 65 calibration samples and 35 prediction samples. Consequently, the optimized calibration model with correspondent optimal volunteer group was evaluated by the independent validation samples. The optimal LS-SVR parameters (γ, σ) were (110, 7), and the validation results revealed a root mean square error of 0.302 and a correlation coefficient of 0.907. This validation effect was considerably satisfactory for the random validation samples because an optimal volunteer group was chosen for calibration–prediction partition to guarantee the modeling stability and robustness in the process of model optimization.