Guannan Chen*a,
Xueliang Lina,
Duo Lin*ab,
Xiaosong Gea,
Shangyuan Fenga,
Jianji Panc,
Juqiang Lina,
Zufang Huanga,
Xi Huanga and
Rong Chena
aKey Laboratory of Optoelectronic Science and Technology for Medicine, Ministry of Education and Fujian Provincial Key Laboratory for Photonics Technology, Fujian Normal University, Fuzhou 350007, China. E-mail: edado@fjnu.edu.cn; linduo1986@163.com
bCollege of Integrated Traditional Chinese and Western Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou, Fujian 350122, China
cFujian Provincial Cancer Hospital, Fuzhou, Fujian 350001, China
First published on 14th January 2016
Identification of different states in cancer is of vital importance for cancer treatment and management. A powerful diagnostic algorithm based on Lasso-partial least squares-discriminant analysis (Lasso-PLS-DA) was developed here for improving blood surface-enhanced Raman spectroscopy (SERS) analysis, with the aim to classify different states in nasopharyngeal cancer (NPC). A total of 160 blood plasma samples were collected for this study, obtained from 60 normal volunteers, 25 T1 stage cancer and 75 T2–T4 stages cancer patients. Results show that a diagnostic sensitivity of 68% and a specificity of 84.0% can be achieved for separating T2–T4 stage from T1 stage cancer, which had a 20% improvement in diagnostic specificity compared with the previous work. This exploratory study demonstrates that the Lasso-PLS-DA can be integrated with blood SERS analysis as a promising clinical complement for different T stages detection in NPC.
It should be noted that each raw Raman spectrum obtained from biological sample usually contains high dimension of the spectral space such as intensity variables, which will result in computational complexity and inefficiency in extracting the most diagnostically significant information. Besides, the Raman spectra belong to similar subjects are commonly similar, making it a challenge to differentiate them sensitively with simplistic band feature analysis. These main limitations will hinder further clinical applications of Raman spectroscopy in medical diagnosis. Numerous developments in multivariate analysis including principal component analysis (PCA), linear discriminant analysis (LDA), partial least-squares regression (PLS), artificial neural networks (ANNs), support vector machines (SVM) and genetic algorithm (GA), within the past decade have enabled significant progress of RS and other technologies in biomedical detection.1,15–19 For example, Huang et al. demonstrated the ability to identify dysplasia from normal gastric muscosa tissue using RS in conjunction with PCA-LDA.4 Similar diagnostic algorithm was also used for cell and blood identification based on Raman spectra for cancer detection.20,21 Most previous researches focused on discriminate cancer from normal subjects using Raman method with multivariate analysis, however there is few study on identification of different cancer stages, which is of great importance for cancer treatment and management. Very recently, we have evaluated the feasibility of a label-free method based on blood plasma SERS with PCA-LDA for exploring variability of different tumor (T) stages in nasopharyngeal cancer (NPC).22 This preliminary study showed high diagnostic accuracies of 83.5% and 93.3%, respectively, can be achieved for classification of T1 stage cancer and normal, and T2–T4 stage cancer and normal blood groups. However, the diagnostic accuracy is only 63.0% for classification of T1 stage cancer and T2–T4 stage cancer. Thus, the development of a more powerful diagnostic algorithm that could identify Raman spectra belong to different NPC stages would be of significant clinical value during blood SERS analysis.
In this work, a robust multivariate statistical method based on Lasso-partial least squares-discriminant analysis (Lasso-PLS-DA) was employed to develop efficient diagnostic algorithm for classification of SERS spectra between blood samples from different NPC stages.
| Y = β0 + β1X1 + β2X2 + … + βnXn |
Lasso regression is a regularization technique by reducing the number of predictors in a regression model. It uses the original data matrix X to constrain the values of the correlation coefficients values of the multiple linear regression. It produces shrinkage estimates with potentially lower predictive errors than ordinary least squares (the model parameter of Lasso should be adaptively chosen to minimize an estimate of expected prediction error.). Under this constraint, the model weighs the importance of each channel for prediction, and unimportant channels are driven to β values equal to 0 by the optimization process. The formulae of Lasso is expressed as:24
represent the estimated coefficient, arg minβ is the vector β with minimal mean squared error, N is the number of observations, xij is data, a vector of p values at observation ij, yi is the response at observation i, the parameters β0 and βj are scalar and p-vector respectively.
In general, the advantage of Lasso is to drive the parameters to zero deselects the features from the regression. Thus, Lasso automatically selects more relevant features and discards the others in an iterative process. This advantage also has the effect of making the Lasso robust restrain noise. A sparser model with smaller number of non-zero coefficient (called β values) could be produced by Lasso model most significantly.
After obtaining the useful spectral variables using the Lasso, we noticed that not all of the useful variables were distributed in each SERS bands, and some bands had more useful variables than others. In order to avoid the uncertainty of prediction result due to the interference part variables by the noise, we choose some integral SERS spectral ranges which contain one or more useful spectral variables to establish further prediction model. PLS-DA is employed to classify the cancer stages detection based on the spectral bands selected above in this study. Two block regression is made by the PLS. Firstly, the dependent block (X) can predict the independent block (Y). The Y block represents the class labels and each X block represents each spectrum. PLS-DA integrates the basic principle of PCA, and maximizes the covariance between group affinity and spectral variation in order to rotate the components further. Therefore, the diagnostically relevant variation could be explained by the PLS components. However, the number of model components causes the complexity of the PLS-DA model. The performance of the PLS-DA is measured by comparing the root mean square error in prediction (RMSEP) of the model proposed by PLS-DA with the RMSEP of the model containing all the variables.
RMSEP is defined as
The Lasso algorithm with Leave-one-out cross-validation (LOOCV) was employed to seek the significant SERS spectral features that were immediately bound up with different stage cancer pathologies firstly. The latter convention was used as the Lasso's model parameter in this paper because this parameter can contact to the useful features more intuitively and directly than others. In LOOCV, one cancer sample (i.e., one spectrum) was taken out from all of these 100 cancer samples, and then the rest of blood spectra were used to reconstruct by the Lasso algorithm for classifying the selected spectrum. This procedure was iterated until all withheld cancer sample were classified.4 The features for SERS spectrum obtained by Lasso algorithm were shown in Fig. 1. Both LOOCV and 10-fold cross-validation got the same nine significant band regions of spectral variables. Nine significant band regions of spectral variables (483–487, 562–566, 573–577, 704–708, 1011–1015, 1388–1392, 1445–1449, 1561–1565 and 1702–1706 cm−1) were selected from the SERS band regions. Two integral SERS spectral ranges (550–585 and 1435–1730 cm−1) which contain more useful spectral variables were also marked by the cyan shadow area. According to previous literatures,22 the selected spectral ranges were possibly related to DNA/RNA bases and amide I. It can be seen that the spectral features (band positions, intensities and bandwidths) of the two regions between T1 and T2–T4 cancer plasma are very similar, whereas some significantly diagnostic variables can be extracted by Lasso algorithm from them. The reason may be that cancer belongs to part of a widely accepted multistep, continuum progression cascade from normal to cancer, and it suggests subtle and vague molecular distinction, making it a challenge to identify different cancer stages by simplistic spectral features analysis. This result confirms a potential role of the proposed method based on Lasso algorithm for classification of different cancer stages. Similarly, Huang et al. applied genetic algorithm to select significant spectral variables from the Raman band regions for providing clinically discrimination between normal and precancer cervical tissues.15 Different from their work, the selected spectral range in this work is wider aim to avoid interference to variables from the noise.
The PLS-DA model was then further used for the selected SERS band regions. The optimum number of variables was determined with leave-one-out cross-validation using root mean standard error method. Fig. 2 represents that the number of variables was generated by RMSEP. The minimum value for the optimum number of variables was showed in RMSEP, and due to overfitting it raises with the increasing number of variables. To assess the predictive accuracy of the Lasso-PLS-DA based diagnostic algorithms, the receiver operating characteristic (ROC) curve (Fig. 3) was produced. The ROC curves for Lasso-PLS-DA was generated by calculate the selected two spectral ranges, with the integration area under the ROC curve (AUC) of 0.812 (the optimum number of components was 10). The ROC curve for PLS-DA and PCA-LDA by calculate the full-spectrums was 0.683 (the optimum components = 4), and 0.631 (the optimum components = 4), respectively. It was found that the Lasso-PLS-DA algorithm was capable of achieving greater efficiency in comparison to conventional algorithm based on PLS-DA and PCA-LDA. This is explainable. Using the full-spectrum variables, the diagnostic efficiency of PLS-DA and PCA-LDA may be interfered by non-significant variables and noise. For Lasso-PLS-DA, spectral regions including the selected most significant spectral variables, were employed as an optimal input for further PLS-DA, allowing a reliable way to solve these limitations. Posterior probability values were also used to predict the response (Fig. 4). The posterior probability scatter plot yielded a diagnostic sensitivity of 68% (51/75) and a specificity of 84.0% (21/25) for separating T2–T4 stage from T1 stage cancer with a threshold line, which had a remarkable improvement compared with the previous work (a sensitivity of 62.7% (47/75) and a specificity of 64.0% (16/25)).22 Excitingly, the diagnostic specificity was increased by 20% in this work. From the result above, we can find that the Lasso-PLS-DA algorithm renders a powerful way to identify different stages cancer by developing a classification model from the significant Raman features.
We also calculated the ROC curves of discrimination results for normal and cancer (Table 1). The results using PLS-DA and PCA-LDA with leave-one-out cross-validation for full-spectrum were 0.931 and 0.919, whereas the result was 0.930 using Lasso-PLS-DA. Results indicate that the Lasso-PLS-DA used to efficiently distinguish different NPC stages is also can be used to distinguish normal and cancer groups.
| Diagnostic combinations | The integration area under the ROC curve | ||
|---|---|---|---|
| Lasso-PLS-DA | PLS-DA | PCA-LDA | |
| T1 stage vs. T2–T4 stage cancer | 0.812 | 0.683 | 0.631 |
| Normal vs. cancer | 0.930 | 0.931 | 0.919 |
| This journal is © The Royal Society of Chemistry 2016 |