Jia-Yue Hua,
Zhuo-Kang Wanga,
Yu-Yu Wangd,
Yu-Hao Wua,
Hai-Cheng Wei*c,
Jing Zhao*b,
Liu Yanga,
Yu-zhe Tana,
Zi-Long Denga,
Zhi-Jie Xianga,
Zi-Yi Wanga and
Xin-Tong Zhaoa
aSchool of Electrical and Information Engineering, North Minzu University, No. 204 North Wenchang Street, Yinchuan, Ningxia 750021, China. E-mail: wei_hc@nun.edu.cn
bSchool of Information Engineering, Ningxia University, Yinchuan 750021, China
cSchool of Medical Technology, North Minzu University, Yinchuan 750021, Ningxia, China
dDepartment of Lab Construction & Administration, North Minzu University, Yinchuan 750021, Ningxia, China
First published on 22nd May 2025
A rapid and non-destructive maturity evaluation model based on Near-Infrared Spectroscopy (NIRS) is proposed for monitoring quality parameter changes during the ripening process of fresh grapes and determining the optimal harvest period. Initially, physicochemical parameter variations of Cabernet Sauvignon grapes across twelve growth stages were studied to support predictions. Subsequently, SPA-LASSO was used to select feature wavelengths from five preprocessed full spectra, and Partial Least Squares Regression (PLSR) was employed to establish models predicting Soluble Solid Content (SSC) and Total Acid (TA) levels. Based on experimental results, the best-performing model for maturity prediction was selected. Results indicate that SSC increases and TA decreases from fruit enlargement to ripening stages. In late maturity, SSC slightly decreases and TA slightly increases. The SG + SPA-LASSO + PLSR model performed best for both SSC and TA, with SSC prediction model coefficients of determination (RC2 and RP2) at 0.982 and 0.983 respectively, and root mean square errors (RMSEC and RMSEP) of 1.010 and 0.978. TA prediction model coefficients were RC2 = 0.954, RP2 = 0.944, RMSEC = 2.347, and RMSEP = 2.618. Overall, SPA-LASSO proved effective in feature selection, enhancing model generalization for spectroscopic screening in non-destructive grape maturity assessment.
Common methods for feature spectrum selection can be broadly categorized into three main types based on their principles: statistical,13,14 machine learning,15 and informatics.16,17 For example, Tian et al. used the CARS algorithm to optimize the improved iPLS to select a subset of near-infrared spectral features. They selected 14 effective feature wavelengths between 1160 nm and 1338 nm and established a prediction model for crude protein content in brown rice based on near-infrared spectroscopy technology, with a validation set correlation coefficient of 0.8876.18 Dharmawan et al. used PCA technology to compress the near-infrared spectral information of Arabica coffee, selected PC variables based on contribution rate, and used an ANN model based on a multi-layer perceptron to classify Arabica coffee from different origins. The accuracy achieved in internal cross validation, training, and testing sets ranged from 90% to 100%. The error in the classification process shall not exceed 10%.19 With the rise of deep learning, there have been numerous methods for feature spectral selection based on deep learning. Zhou et al. proposed a spectral feature selector based on convolutional neural networks to select features from the near-infrared hyperspectral data of over 140000 wheat grains. 60 channels were selected from 200 channels as feature subsets, and combined with a convolutional neural network classifier with an attention mechanism, automatic lossless classification of a single wheat grain was achieved with a classification accuracy of 90.2%.20 Kuo et al. modified the one-dimensional spectrogram power network and constructed a 1D ResGC Net with embedded residual global context, which can automatically identify near-infrared spectral feature bands and extract spectral feature information. However, deep learning is difficult to deploy on portable handheld devices due to their large size and high hardware resource requirements. In addition, most deep learning models are called “black box models”, and the internal data processing process of the model is difficult to explain, resulting in highly extracted data features.21
Therefore, in order to solve the problem of damaging the detection of SSC and TA content in fresh grapes by traditional methods, accurately select characteristic wavelengths, reduce redundant spectral information, and establish the best prediction model for the maturity and quality of fresh grapes, this study collected samples of fresh grapes at twelve growth stages, used near-infrared spectroscopy technology to capture the quality index parameters of fresh grapes at different maturity stages, and used the SPA-LASSO algorithm for spectral feature selection. At the same time, the partial least squares regression method was used to establish SSC and TA content prediction models with various feature spectral sets. By comparing the prediction results, the best prediction model was selected for evaluating the maturity of fresh grapes.
From each batch of 300 berries, a group of 10 berries with the most similar physiological traits was selected. The skins and seeds were removed, and the juice was extracted. The juice was then centrifuged at 1500 rpm for 10 minutes, filtered through a 0.45 μm microporous membrane, and used for subsequent experiments.
During near-infrared transmission spectroscopy of Cabernet Sauvignon grape juice samples, 5 mL of liquid was extracted into a cuvette. The scanning wavelength range was set from 900 nm to 2500 nm, with a wavelength resolution of 3.2 nm and integration time of 10 ms. Each sample was measured three times, and the average was calculated to obtain the final spectral data used for establishing prediction models. After completing the near-infrared transmission spectroscopy, each sample was immediately retrieved for the measurement of SSC and total acids.
The SSC content of the samples was measured using an SN-DR3205 digital refractometer (Shanghai Shangpu Instrument Equipment Co., Ltd, Shanghai, China), while the TA content was determined by the acid–base titration method23 (expressed as tartaric acid equivalents). SSC and TA contents for each period were measured three times, and the average values were calculated as the final reference content.
Four preprocessing methods, namely Savitzky–Golay smoothing (S–G smoothing), Multiplicative Scatter Correction (MSC), Standard Normal Variate (SNV), and MSC–S–G, were employed to process the spectra. The best preprocessing method was selected based on the prediction results.
SPA is a forward iterative feature selection method commonly used for spectral data.26 It gradually reduces the feature space by iteratively selecting the most relevant subsets of features to lower dimensions while retaining essential information. The main steps of the algorithm are as follows:
(1) Initialization: select an initial wavelength as the starting wavelength, add it to the set of selected features, and set the maximum number of variables.
(2) Stepwise selection: calculate the correlation between each unselected wavelength point and all wavelength points in the selected feature set sequentially. Retain the wavelength with the highest correlation and add it to the selected feature set.
(3) Termination and output: repeat step (2) until the number of wavelength points in the selected feature set reaches the predetermined maximum number of variables. Output all wavelength points in the feature set as the final set of selected feature wavelengths.
The LASSO algorithm is a widely used linear regression method in machine learning. This algorithm automatically selects features that significantly impact the target variable by introducing L1 regularization, while compressing the coefficients of other features to zero, thereby achieving model sparsity.27 Its optimization objective is shown as eqn (1):
(1/(2 × M)) × sum(yi − θ0 − ∑(θj × xji))2 + λ × sum(|θj|) | (1) |
The SPA-LASSO method combines SPA with LASSO. It utilizes SPA to select a subset of spectral variables that are highly correlated with the target variable, and then applies the LASSO algorithm to further refine this variable set, producing the final set of spectral feature variables.
The significant advantage of the SPA-LASSO method lies in: (1) reducing computational complexity. The runtime of the LASSO algorithm is dependent on the number of features. By employing the SPA algorithm for preliminary screening, the SPA-LASSO method can handle a smaller feature set in the LASSO algorithm, saving computational resources and feature selection time. (2) Enhanced feature selection accuracy and interpretability. The SPA algorithm can eliminate features with poor correlation with the target variable, reducing interference from less useful features in the LASSO algorithm. This improves the accuracy and stability of feature selection. By passing a more relevant set of features to the LASSO algorithm, the model can be further refined to select features that provide better explanatory power for the target variable. (3) Avoiding local optima and reducing feature redundancy. The LASSO algorithm, with its L1 regularization, tends to produce sparse coefficient estimates. This allows it to globally consider the correlation of all features with the target variable, thereby avoiding local optima and reducing feature redundancy. The technical roadmap is shown in Fig. 1.
![]() | (2) |
![]() | (3) |
![]() | (4) |
RMSE is used to measure the average error between predicted values and actual values. A smaller RMSE indicates that the predicted values are closer to the actual values. Mean Absolute Error (MAE) represents the average absolute difference between predicted values and actual values. It measures the average prediction bias of the model; a smaller MAE indicates higher prediction accuracy of the model. R2 is used to measure the model's explanatory power over the observed data. A value closer to 1 indicates stronger explanatory power of the model.
The spectral preprocessing, feature wavelength selection, and PLSR model establishment in the experiment were all carried out in Matlab 2022a, and the t-test was performed using SPSS 19.0 software.
From the 6th to the 9th sampling period, the grapes are in the veraison stage. During this period, grapes begin to accumulate sugars, resulting in an increase in SSC content, while acidic substances start to degrade, leading to a gradual decrease in TA content.30 This trend continues until the 11th sampling period when the grapes reach maturity. The SSC quality score peaks at 26.2°Bx, while the TA quality score reaches its minimum at 5.02 mg L−1. In the 12th harvesting period, the SSC quality score shows a slight decrease. This is likely due to post-maturity processes such as respiration, conversion of sugars into other substances like alcohol, and loss of moisture in the grapes. The TA content increases, possibly indicating stress on the fruit due to drought conditions, leading to the production of more organic acids.31
Fig. 3 shows the average spectra of grape samples collected over twelve periods. Although there are significant differences in absorption peaks, the overall trends are consistent. The spectral distribution between 900 and 970 nm is likely due to absorption by phenolic compounds, anthocyanins, cellulose, and sucrose.32 The wavelength range from 925 nm to 980 nm may relate to carbohydrates and O–H groups in water.33 The peak at 1200 nm corresponds to the second overtone of C–H stretching in sugars and organic acids, while the trough at 1798 nm is due to the C–O stretching in sugars and organic acids.34 Absorption peaks at 1100 nm and 1410 nm are attributed to the combination bands of O–H in water.35 Beyond 2400 nm, the spectral signals exhibit significant noise, necessitating preprocessing of the raw spectral signals to enhance the signal-to-noise ratio and improve spectral quality.
When applying the LASSO algorithm to perform feature wavelength selection from the original spectral data of grape samples, we evaluated the model's performance and selected the optimal regularization parameter λ. This was done using 10-fold cross-validation, where the optimal λ value was determined by minimizing the root mean square error based on the cross-validation results. The final selected feature wavelengths are depicted in Fig. 4(a). When using the SPA algorithm for spectral feature selection, the choice of starting wavelength is crucial. Selecting an appropriate starting wavelength not only helps preserve key information in the selected feature wavelengths but also reduces the risk of overfitting. In this case, the 70th wavelength point was set as the starting wavelength, with a maximum iteration of 255 times. The algorithm retained the top 40 wavelength points that correlated most strongly with the starting wavelength as the final set of feature wavelengths, as shown in Fig. 4(b).
The feature wavelengths selected by the SPA algorithm are concentrated more densely between 880 nm and 1150 nm, while wavelengths between 1630 nm and 2400 nm have not been chosen as feature wavelengths. This may be because the SPA algorithm places greater emphasis on the spatial distribution among the feature wavelengths, thereby preferring feature sets that exhibit spatial continuity or consistency. Among the five preprocessing methods used, a significant number of identical wavelength points are selected as feature wavelengths, predominantly ranging from 900 nm to 1200 nm. This is because the SPA algorithm primarily considers the correlations among the feature wavelengths and their correlations with the target variable during the feature selection process, independent of the data preprocessing methods applied to the features. As a result, the selected set of feature wavelengths demonstrates statistical significance and interpretability.
When improving the feature wavelength selection using the LASSO algorithm on top of the SPA algorithm, the process starts by using the SPA algorithm to select the 70th wavelength point as the starting wavelength. The SPA algorithm retains the top 100 wavelength points with the highest correlations as the intermediate set of feature wavelengths. Subsequently, the LASSO algorithm is applied to further refine this intermediate set of feature wavelengths. The selection of the optimal regularization parameter is determined based on 10-fold cross-validation results to minimize the RMSE. This process finalizes the selection of the feature wavelength set, as depicted in Fig. 4(c).
Compared to the SPA algorithm, the LASSO algorithm, due to its inclusion of L1 regularization, selects feature wavelengths that are more spatially sparse and evenly distributed. After applying MSC and SNV preprocessing to the spectra, the number of selected feature wavelengths is minimal and highly consistent. This is likely because these preprocessing methods eliminate nonlinear and linear offsets in the spectral data, enhancing data reliability and signal-to-noise ratio. The total number of feature wavelengths selected by SPA-LASSO is significantly fewer than those selected by the former two methods. While the SPA algorithm emphasizes spatial sparsity, the LASSO algorithm considers inter-variable correlations to a greater extent.
Predictive models were established using PLSR for the full spectra and each set of selected feature wavelengths from five preprocessing types. Prior to model construction, the dataset was randomly split into training (80%) and testing (20%) sets, with an additional 25% split from the training set for validation. The PLS model results based on the full spectra with different preprocessing methods are shown in Table 1.
Substance | Selection spectra | Calibration set | Prediction set | ||||
---|---|---|---|---|---|---|---|
R2C | RMSEC | MAEC | R2P | RMSEP | MAEP | ||
SSC | Raw spectra | 0.806 | 3.287 | 2.639 | 0.715 | 4.090 | 3.370 |
MSC | 0.877 | 2.598 | 2.318 | 0.880 | 2.584 | 2.294 | |
SG | 0.898 | 2.398 | 2.119 | 0.869 | 2.689 | 2.366 | |
SNV | 0.861 | 2.807 | 2.306 | 0.849 | 2.902 | 2.435 | |
MSC + SG | 0.913 | 2.334 | 1.936 | 0.915 | 2.402 | 2.035 | |
TA | Raw spectra | 0.799 | 5.016 | 4.102 | 0.788 | 5.091 | 4.379 |
MSC | 0.855 | 4.126 | 3.421 | 0.829 | 4.475 | 3.765 | |
SG | 0.889 | 3.892 | 2.847 | 0.885 | 3.921 | 2.791 | |
SNV | 0.835 | 4.453 | 3.654 | 0.804 | 4.922 | 3.980 | |
MSC + SG | 0.898 | 3.661 | 2.592 | 0.892 | 3.761 | 2.889 |
From Table 1 it can be observed that the PLSR models established using different preprocessing methods for full spectra yield varying prediction results. Overall, predictions for the SSC content are better than those for the TA content, likely due to larger errors when measuring TA data using titration methods. Specifically for SSC or TA prediction models individually, models built using raw spectra yield the poorest results. Models using MSC + SG preprocessing consistently achieve the highest prediction accuracies (RC2, RP2 > 0.910 for SSC content; RC2, RP2 > 0.890 for TA content). This could be attributed to MSC + SG effectively correcting spectral baseline drift and reducing noise interference, thus emphasizing data stability and signal-to-noise ratio. Therefore, among the four preprocessing methods evaluated for predicting SSC and TA contents based on full spectra, MSC + SG proves to be the most suitable. However, despite this, the prediction results are still not entirely satisfactory, highlighting the necessity for feature wavelength selection to eliminate redundant information and enhance prediction model accuracy.
The PLS model results for the SSC content and TA content based on respective feature wavelength sets are shown in Tables 2 and 3, respectively.
Substance | Selection spectra | Wavelengths | Calibration set | Prediction set | ||||
---|---|---|---|---|---|---|---|---|
RC2 | RMSEC | MAEC | RP2 | RMSEP | MAEP | |||
a Bold values indicate the best-performing models for current prediction results. | ||||||||
SSC | Raw spectra + LASSO | 35 | 0.972 | 1.286 | 0.972 | 0.969 | 1.335 | 1.007 |
MSC + LASSO | 20 | 0.968 | 1.349 | 1.052 | 0.967 | 1.375 | 0.995 | |
SNV + LASSO | 16 | 0.960 | 1.504 | 1.061 | 0.945 | 1.754 | 1.208 | |
SG + LASSO | 37 | 0.969 | 1.328 | 0.924 | 0.964 | 1.443 | 1.132 | |
MSC + SG + LASSO | 23 | 0.970 | 1.274 | 0.946 | 0.966 | 1.384 | 0.996 | |
Raw spectra + SPA | 40 | 0.963 | 1.436 | 0.935 | 0.966 | 1.404 | 1.086 | |
MSC + SPA | 40 | 0.965 | 1.401 | 0.989 | 0.961 | 1.481 | 1.051 | |
SNV + SPA | 40 | 0.959 | 1.521 | 0.953 | 0.958 | 1.552 | 1.090 | |
SG + SPA | 40 | 0.963 | 1.460 | 1.077 | 0.946 | 1.760 | 1.421 | |
MSC + SG + SPA | 40 | 0.939 | 1.891 | 1.484 | 0.931 | 1.986 | 1.550 | |
Raw spectra + SPA-LASSO | 22 | 0.977 | 1.168 | 0.926 | 0.980 | 1.074 | 0.848 | |
MSC + SPA-LASSO | 8 | 0.979 | 1.101 | 0.861 | 0.971 | 1.305 | 0.888 | |
SNV + SPA-LASSO | 13 | 0.981 | 1.056 | 0.859 | 0.979 | 1.107 | 0.905 | |
SG + SPA-LASSO | 14 | 0.982 | 1.010 | 0.687 | 0.983 | 0.978 | 0.716 | |
MSC + SG + SPA-LASSO | 17 | 0.981 | 1.048 | 0.881 | 0.982 | 1.040 | 0.780 |
Substance | Selection spectra | Wavelengths | Calibration set | Prediction set | ||||
---|---|---|---|---|---|---|---|---|
RC2 | RMSEC | MAEC | RP2 | RMSEP | MAEP | |||
a Bold values indicate the best-performing model for the current prediction results. | ||||||||
TA | Raw spectra + LASSO | 35 | 0.930 | 2.930 | 2.305 | 0.927 | 3.009 | 2.417 |
MSC + LASSO | 20 | 0.905 | 3.394 | 2.702 | 0.908 | 3.309 | 2.730 | |
SNV + LASSO | 16 | 0.911 | 3.256 | 2.694 | 0.900 | 3.466 | 2.879 | |
SG + LASSO | 37 | 0.928 | 2.903 | 2.248 | 0.917 | 3.049 | 2.302 | |
MSC + SG + LASSO | 23 | 0.934 | 2.793 | 2.181 | 0.928 | 2.875 | 2.156 | |
Raw spectra + SPA | 40 | 0.910 | 3.254 | 2.614 | 0.897 | 3.589 | 2.938 | |
MSC + SPA | 40 | 0.928 | 2.925 | 2.348 | 0.918 | 3.115 | 2.378 | |
SNV + SPA | 40 | 0.923 | 2.978 | 2.329 | 0.930 | 2.887 | 2.289 | |
SG + SPA | 40 | 0.919 | 3.050 | 2.301 | 0.904 | 3.334 | 2.721 | |
MSC + SG + SPA | 40 | 0.939 | 2.774 | 2.222 | 0.928 | 2.952 | 2.329 | |
Raw spectra + SPA-LASSO | 22 | 0.942 | 2.686 | 2.151 | 0.934 | 2.734 | 2.073 | |
MSC + SPA-LASSO | 8 | 0.918 | 3.182 | 2.534 | 0.904 | 3.426 | 2.852 | |
SNV + SPA-LASSO | 13 | 0.941 | 2.609 | 1.952 | 0.934 | 2.795 | 2.200 | |
SG + SPA-LASSO | 14 | 0.954 | 2.347 | 1.846 | 0.944 | 2.618 | 2.046 | |
MSC + SG + SPA-LASSO | 17 | 0.943 | 2.624 | 2.056 | 0.944 | 2.538 | 1.969 |
By comparing the prediction data from Tables 1–3, it is evident that the predictive models established using feature spectra significantly outperform those using full spectra. This indicates that feature spectra selected by specific wavelengths have better interpretability. Moreover, the number of wavelengths after feature selection is reduced by 84% compared to the full spectra, indicating a substantial reduction in redundant information in the spectra. Among the feature selection methods, SPA-LASSO identifies the fewest feature wavelengths, further streamlining the predictive models. Specifically, the MSC + SPA-LASSO method reduces the number of feature wavelengths by over 96% compared to the full spectra.
Comparing the SSC prediction models established with various feature wavelengths from Table 2, the model built using the MSC + SG + SPA method yielded the poorest results, with R2C and R2P both exceeding 0.931, albeit significantly better than the full spectrum model. The SG + SPA-LASSO method performed optimally on both the calibration and prediction sets (RC2 = 0.982, RP2 = 0.983), with the prediction set's coefficient of determination slightly higher than that of the calibration set, indicating strong generalization ability and mitigating overfitting. While the prediction model established using the MSC + SPA-LASSO method (RC2 = 0.979, RP2 = 0.971) slightly underperformed compared to the SG + SPA-LASSO method, its feature wavelength set, composed of only 8 wavelengths, significantly reduced model complexity. This reduction is crucial for minimizing the size of portable handheld devices and improving the operational speed of deployed hardware infrastructure.
Comparing the results of various TA prediction models in Table 3, the model established using the MSC + LASSO method performed the poorest, with R2C and R2P both exceeding 0.905. The SG + SPA-LASSO method exhibited the best performance in the calibration set (R2C = 0.954). Meanwhile, the MSC + SG + SPA-LASSO method performed best in the prediction set, with both models achieving (R2P = 0.944), but MSC + SG + SPA-LASSO had a smaller RMSEP, indicating lower average prediction error and higher accuracy on the prediction set. However, the MSC + SPA-LASSO model, which has the smallest feature wavelength set, showed poor performance ((RC2 = 0.918), (RP2 = 0.904)). This could be attributed to the model excessively reducing the number of features, thereby losing some spectroscopic information that, while less correlated, still holds value for total acid prediction, consequently lowering the model's performance.
After a thorough analysis of the PLS model results, a comprehensive conclusion was drawn: prediction models based on feature spectra outperform those based on full spectra, especially when utilizing the SPA-LASSO method to select fewer but more accurate feature variables. Fig. 5 depicts scatter plots of measured and predicted values of SSC and TA for both calibration and prediction sets, with solid regression lines illustrating the correlation between measured and predicted values. Notably, Fig. 5(c) demonstrates excellent fitting performance for SSC. Conversely, Fig. 5(e) indicates slightly poorer predictions for TA, particularly at higher values. This variation may relate to the physiological maturity of grape samples and the stability of TA content.36 These results indicate that PLS models accurately predict SSC and TA levels during grape ripening. The SPA-LASSO feature wavelength selection method effectively identifies optimal subsets of wavelengths with the best predictive ability in spectral data, reducing redundant information and thereby enhancing modeling effectiveness and generalization capability.
![]() | ||
Fig. 5 Individual prediction scatter plots ((a)–(c) for SSC predictions, (d)–(f) for TA predictions). |
This journal is © The Royal Society of Chemistry 2025 |