 Open Access Article
 Open Access Article
      
        
          
            Hong 
            Zhou‡
          
          
        
        
       ab, 
      
        
          
            Liangge 
            Xu‡
ab, 
      
        
          
            Liangge 
            Xu‡
          
          
        
       abc, 
      
        
          
            Zhihao 
            Ren
abc, 
      
        
          
            Zhihao 
            Ren
          
        
       ab, 
      
        
          
            Jiaqi 
            Zhu
ab, 
      
        
          
            Jiaqi 
            Zhu
          
          
        
       *c and 
      
        
          
            Chengkuo 
            Lee
*c and 
      
        
          
            Chengkuo 
            Lee
          
          
        
       *abd
*abd
      
aDepartment of Electrical and Computer Engineering, National University of Singapore, Singapore 117583. E-mail: elelc@nus.edu.sg
      
bCenter for Intelligent Sensors and MEMS (CISM), National University of Singapore, Singapore 117608
      
cNational Key Laboratory of Special Environment Composite Technology, Harbin Institute of Technology, Harbin, 150001, China. E-mail: zhujq@hit.edu.cn
      
dNUS Suzhou Research Institute (NUSRI), Suzhou 215123, China
    
First published on 7th November 2022
The world today is witnessing the significant role and huge demand for molecular detection and screening in healthcare and medical diagnosis, especially during the outbreak of COVID-19. Surface-enhanced spectroscopy techniques, including Surface-Enhanced Raman Scattering (SERS) and Infrared Absorption (SEIRA), provide lattice and molecular vibrational fingerprint information which is directly linked to the molecular constituents, chemical bonds, and configuration. These properties make them an unambiguous, nondestructive, and label-free toolkit for molecular diagnostics and screening. However, new issues in molecular diagnostics, such as increasing molecular species, faster spread of viruses, and higher requirements for detection accuracy and sensitivity, have brought great challenges to detection technology. Advancements in artificial intelligence and machine learning (ML) techniques show promising potential in empowering SERS and SEIRA with rapid analysis and automatic data processing to jointly tackle the challenge. This review introduces the combination of ML and SERS/SEIRA by investigating how ML algorithms can be beneficial to SERS/SEIRA, discussing the general process of combining ML and SEIRA/SERS, highlighting the molecular diagnostics and screening applications based on ML-combined SEIRA/SERS, and providing perspectives on the future development of ML-integrated SEIRA/SERS. In general, this review offers comprehensive knowledge about the recent advances and the future outlook regarding ML-integrated SEIRA/SERS for molecular diagnostics and screening.
First of all, it is challenging and time-consuming to deal with the huge volumes of spectral data in many applications such as the detection of multiple biomarkers and viruses and the monitoring of biological responses in multiple processes. Especially, the volume of spectral data is inevitably ever-increasing with the development of sophisticated SERS/SEIRA-based applications.1 Furthermore, the processing of each set of spectral data is also complex and time-consuming, which generally includes normalization, baseline calibration, and feature signal extraction. Another issue for SERS and SEIRA is the spectral overlapping of various molecules, which greatly limits the application scope of SERS and SEIRA. For instance, almost all kinds of protein molecules suffer from IR spectral overlapping between 1600 and 1700 cm−1, where the amide I and amide II vibration bands are located (proteins are special types of amides).18 Third, anomalies and artifacts are critical challenges for SERS and SEIRA, which restricts SERS and SEIRA to low accuracy, poor stability and reliability. Factors that cause anomalies and artifacts are various, ranging from instrumental effects, sample effects, to contamination in sampling procedures.19 More specifically, the instrumental effects include shifts in the wavenumber scale, multi-passing errors, detector effects, noise effects, dark noise, etc. The sample effects contain sample heating, fluorescence interference in Raman spectra, and matrix absorption. The contamination in sampling procedures includes ambient lighting, air, sample support surfaces, sample containers, and sample movement. Finally, the manual design of surface-enhanced spectroscopy substrates is time-consuming and inefficient and various analytes require customized structure designs to ensure the matching of molecular vibrations and plasmonic resonances, which is particularly prominent in SEIRA spectroscopy.20 Therefore, the automatic design of substrates is highly desirable and helps to facilitate the practical application of the technology. In general, in response to the above-mentioned bottlenecks, researchers are looking for breakthroughs in other aspects.
A potential solution to the bottleneck is an algorithm analysis technique, which was widely employed in early chemometrics.21 Specifically, it is used to analyze and mine chemical data and to design optimal experiments or choose measurement procedures. The well-known algorithms include principal component analysis (PCA), principal component regression (PCR), multiple linear regression (MLR), linear discriminant analysis (LDA), and more.22,23 However, the implementation of these algorithms requires the support of high-performance computing. Therefore, the addressing of these issues can not only rely on algorithms, but also requires the assistance of computers. Artificial intelligence (AI) is a part of computer science that tries to enable machines to perform tasks that typically require human intelligence. It utilizes the computing power of machines and intelligent algorithms to free people from complicated data analysis.24 Therefore, AI could provide novel strategies for overcoming the challenges faced by surface-enhanced spectroscopy, which also makes common SEIRA and SERS intelligent tools and analysis platforms. One of the basic requirements for AI is learning, and it is generally agreed by most researchers that there is no AI without learning.25 Therefore, machine learning (ML) is one of the most rapidly developing and significant subfields of AI research.26–33 At the very beginning of ML development (1950s–1960s), there are three major branches, that is, symbolic learning proposed by Hunt et al., statistical methods by Nilsson, and neural networks by Rosenblatt.34 Nowadays, these branches develop advanced methods and can be divided into four categories, that is, classification, regression, clustering, and dimensionality reduction.35–44 The algorithms for these branches include support vector machine (SVM), κ-nearest neighbor (κNN), decision tree (DT), convolutional neural network (CNN), k-means, PCA, etc.45–50 These algorithms have been well employed in SEIRA and SERS. The development path of AI-augmented surface-enhanced spectroscopy is shown in Fig. 1.51–60 During the exploration, the researchers have demonstrated many advantages of ML-augmented SEIRA and SERS over conventional approaches. ML offers the inimitable possibility of solving pressing challenges in the field of surface-enhanced spectroscopy and is receiving increasing attention.
|  | ||
| Fig. 1 Development path of AI-augmented surface-enhanced Raman and infrared spectroscopy. The common SERS substrate is nanoparticles with dimensions ranging from tens to hundreds of nanometers61 and the common SEIRA substrate is nanoantennas with micro-level dimensions. SEIRA and SERS are widely used in biochemical and energy fields. SERS: Surface-Enhanced Raman Scattering; SEIRA: surface-enhanced infrared absorption; SM SERS: single-molecule SERS; s-SNOM: scattering-type scanning near-field optical microscope; TERS: tip-enhanced Raman spectroscopy; PCA: principal component analysis; DFA: deterministic finite automaton; SVM: support vector machine; ANN: artificial neural network; CNN: convolutional neural network; CFNN: cascade forward neural network; HCA: hierarchical cluster analysis. Reproduced with permission.61 Copyright 2020 Royal Society of Chemistry. | ||
As noted above, although ML and surface-enhanced spectroscopy are complementary in terms of technical characteristics, ML-augmented SEIRA and SERS are still in their infancy and their efficient combination is challenging. The landmark work of many researchers has greatly advanced the field, but their technical approaches are diverse and their perspectives are somewhat distinct. Therefore, the purpose of this review is to summarize some of the masterpieces in the technological evolution and to provide a timely discussion and perspectives of ML-augmented SEIRA and SERS and their applications in molecular diagnostics, the most representative and active field according to our literature survey. Firstly, we introduce the concept of ML-augmented surface-enhanced spectroscopy and discuss the benefit of ML for SEIRA and SERS. Then, we present the development and application of ML-augmented SEIRA and SERS from the perspective of substrate design, data processing, and decision-making. Finally, we conclude and look ahead to the future of advanced technologies.
Machine learning algorithms enable the automated design of SERS/SEIRA substrates to avoid time-consuming and onerous design processes. Taking SEIRA's antenna design as an example,126 its first step is to analyze the infrared spectrum of the analyte molecule and obtain the position of the molecular vibration. Then, an appropriate antenna structure is chosen to excite plasmonic resonances that match the molecular vibrational frequencies. There is a lot of repetitive work involved here. First, the selection of the structural shape is a critical and continuous optimization process. It requires designers to utilize simulation software (such as FDTD solutions) to compare the figures of merit of different shapes, such as sensitivity, enhancement factor, bandwidth, and so on. While design experience can reduce the number of iterations in the process, it could lead to design deviations due to personnel differences. Another iterative work at this stage is to match antenna resonances and molecular vibrations via the tuning of antenna dimensions, since zero detuning allows for more efficient molecule-plasmon coupling. Additionally, a multiband design is necessary to enhance SEIRA's ability to identify molecules if the detection target is a specific molecule in the mixture. Furthermore, the limitations of nanofabrication are also issues to be considered at the device design stage. A good design considering all of these factors takes a lot of effort and time. Fortunately, these time-consuming and repetitive tasks are easy and efficient for ML-assisted design systems. For example, genetic algorithms were used to automatically generate highly sensitive antenna structures that match well with molecular peaks.126 Furthermore, the number of iterations and machine learning efficiency can be improved by employing the physical constraints of causality to directly learn the response functions of antennas.127 In common deep neural networks, the function in the hidden layers to output predictions is unknown like a black box. It works but it is unknown how or why it works. By incorporating physical knowledge into hidden layers, the network is able to learn the physical relationships between the input physical parameters. Automatic learning of the underlying physics can improve the accuracy of prediction results.
Machine learning algorithm also helps SERS/SEIRA reduce anomalies and artifacts. As mentioned earlier, anomalies and artifacts are common in SERS/SEIRA, mainly caused by instrumental effects, sample effects, to contamination in sampling procedures. The presence of anomalies and artifacts limits SERS/SEIRA to high noise, low accuracy and resolution, poor stability and reliability. Machine learning can address these issues in the following ways.128,129 Well-trained ML model can identify and reduce noise by the difference with the sample signal in changing frequency and spectral location. More specifically, the change time of the sample is typically on the second or even minute level, while some noise signals such as background noise from the instrument are usually high frequency. The position of the signal of some contaminants in the substrate in the spectrum could also be distinct from the sample. Based on these differences, a well-trained ML model can distinguish and attenuate noise. Additionally, ML can identify the “right” signals and correct anomalies to improve the accuracy, resolution, and stability of SERS/SEIRA. More specifically, during the training of the ML model, the correct signal is used and “remembered” by the model via extracting features.130 When anomalous and artifact signals are fed into a trained model, its features will be outside the normal range. Models can ignore or even correct anomalies and artifacts depending on what they were trained on.131 These improvements are greatly crucial for SERS/SEIRA when it comes to on-site applications.
Machine learning algorithms are beneficial for solving problems about spectral overlap in SERS/SEIRA. Spectral overlap of analytes is a common challenge for spectral detection methods, and for SERS and SEIRA, it impairs the identification capability and limits the scope of application. Fortunately, the spectra of different analytes partially overlap, which presents an opportunity for ML to address this issue.117,118 By extracting the complete spectral signature of an analyte, ML can quickly and automatically identify and classify analytes. For example, nucleotides and sucrose are spectrally overlapping around 1000–1200 cm−1 due to their C–O stretching vibrations. However, there is no spectral overlap around 1400–1600 cm−1. Nucleotides are infrared active in this region due to the C–N stretching vibration, while sucrose lacks it. Therefore, well-trained ML algorithms taking advantage of this distinction can accurately identify and classify them in mixed analytes.58
Finally, machine learning algorithms assist SERS and SEIRA to analyze and process data quickly, directly, and automatically, and relieve the pressure of massive spectral data through dimensionality reduction. In the real-time monitoring application of SERS and SEIRA, the output data is three-dimensional information, including spectral intensity, wavelength, and time. Additionally, if multiple analytes are targeted, the information becomes four-dimensional (category information added).58,128 All these factors lead to the generation of a huge amount of spectral data. Fast and accurate analysis and processing of these data is a major challenge for SERS and SEIRA. Some ML algorithms such as PCR can reduce the dimension of information on the premise of preserving information features, thereby reducing the amount of data, decreasing data complexity, simplifying data processing, and quickly outputting test results.132 It can be foreseen that the increase in the amount of data is inevitable due to the technological development of SERS and SEIRA and the emergence of new applications. Therefore, the dimensionality reduction of spectral data by using ML algorithms is of great help to SERS and SEIRA.
|  | ||
| Fig. 3 Machine learning algorithm. (a) Schematic diagram of the three classic learning frameworks. (b) General process of ML-based SERS/SEIRA data training and testing. (c) Loss curve during the training process. The loss of the test set increases when overfitting, indicating that fitting should be stopped within the dashed line. (d) Representation of a confusion matrix showing the performance of the classification models for the test data.195 Copyright 2019 MDPI. (e) ROC curves and corresponding AUC values.196 Copyright 2020 American Chemical Society. | ||
The process of combining supervised and unsupervised ML algorithms with SEIRA/SERS is different. Supervised learning requires supervision to train a model, and unsupervised learning finds patterns in data, as shown in Fig. 3b, where the model training presented in the dashed box is performed under supervision. The top priority for the whole process is to select an appropriate ML algorithm based on the task objective. Then, the preprocessing of raw spectral data is required for both supervised and unsupervised ML algorithms.189 Since baseline wander is a common problem in analysis with IR and Raman spectroscopy, baseline correction is the first step in the preprocessing of raw spectral data. For infrared spectroscopy, the common method includes polynomial baseline correction, adaptive smoothness parameter penalized least squares method,190 Lorentz fitting, and constrained base-line correction based on a peak shift.191 In terms of Raman spectroscopy, the Savitsky–Golay-smoothing method is used to correct the baseline.192 Then, by sequentially subtracting the signal from the baseline and normalizing, a valid signal containing the target molecular features is obtained, which is used as the input to the ML algorithm.55 Normalization is necessary because it can compare the errors of the models and also reduce the impact of abnormal samples on the training process. In many cases, PCA is also used for noise reduction and dimensionality reduction during preprocessing.193 It is worth noting that some informative features in the original data may be accidently removed during the preprocessing. Therefore, it is necessary to pay attention to the change of features in preprocessing. The learning of the ML model consists of three stages, namely training, validation, and testing. The datasets corresponding to the three stages are training set, validation set, and test set, respectively. Therefore, the raw data needs to be divided into training set, validation set, and test set in preprocessing, and the ratio could be 60%, 20%, and 20%.194 The training dataset is the example dataset used during the learning process to find the best hyperparameters for the model. The validation set is an example dataset used to tune the hyperparameters of the model throughout the model development process, adding a feedback loop to the training of the model. The test dataset is a separate dataset from the training and validation dataset and is used to evaluate the final model chosen during the validation process. Since the signal to the target is obtained from the same SERS/SEIRA platform, the test dataset follows the same probability distribution as the training dataset.
In neural network models, weight initialization is a crucial design choice.197 Its purpose is to avoid layer activation outputs exploding or vanishing during the forward pass through the models. Common initialization methods include constant initialization and random initialization.198 For constant initialization, all weights in the neural network are initialized to a constant value C (typically 0). While constant initialization is simple, it is nearly impossible to break activation symmetry, making some models inefficient to learn. Random initialization can break the symmetry and let each neuron learn a different function of its input, but high or low weights could lead to exploding or vanishing gradients. Some new initialization techniques, such as He199 and Xavier200 initialization, are proposed to achieve a good starting point for initialization. After weight initialization, hyperparameter tuning becomes a critical task for the ML model.201 Common hyperparameter types include (i) κ in κ-NN, (ii) regularization constant, kernel type, and constants in SVMs, and (iii) number of layers, number of units per layer, and regularization in a neural network. To find the optimal value for hyperparameters, the tuning methods such as grid search, random search, and Bayesian optimization could be implemented. After hyperparameter tuning, cross-validation is often utilized to estimate the prediction performance of a model with the hyperparameter. An appropriate hyperparameter helps avoid under-fitting (high training and test errors) and overfitting (low training error but high test error). Under-fitting and overfitting can be observed in a loss curve which reflects the model error and answers the question “how bad our model is doing”.202 As shown in Fig. 3c, the loss decreases over time as the model learns, and the lower the loss, the better the model performance. Notably, it means overfitting starts when the loss on the test/validation set transitions from decreasing to increasing.203 The performance of the classification models for the test data can be displayed in a confusion matrix, as shown in Fig. 3d. It is represented in the form of a matrix and divided into two dimensions, the actual and predicted results along with the total number of predictions.204 It shows the error of the model performance, hence also called the error matrix. Another representative evaluation parameter is the receiver operating characteristics (ROC) curve, which reveals the performance of the classification model across all classification thresholds, as shown in Fig. 3e. The area under the ROC curve is called AUC. AUC provides an aggregated measure of performance across all possible classification thresholds and represents the trade-off between sensitivity and specificity.196 In addition to the confusion matrix, ROC and AUC, there are many evaluation parameters, such as accuracy, precision, recall, specificity, f1 score, and precision-recall curve.205,206 The key metrics that are commonly used for evaluating and reporting the data processing performance include mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and average standard deviation of the mean (SDM). MSE is computed as  , where Y and Y are the measured and predicted values, respectively. RMSE is computed as
, where Y and Y are the measured and predicted values, respectively. RMSE is computed as  . MAE is calculated as the sum of absolute errors, that is,
. MAE is calculated as the sum of absolute errors, that is,  , where ei is the absolute error. SDM is expressed mathematically as
, where ei is the absolute error. SDM is expressed mathematically as  , where
, where  . MSE, RMSE, and MAE are often used to evaluate the difference between measured and predicted results. After introducing the benefits, algorithm types, and general process of ML-based SERS/SEIRA, we will explain and demonstrate it in the later section by reviewing the detailed application of machine learning in SEIRA from the perspective of substrate design, data processing, and decision-making.
. MSE, RMSE, and MAE are often used to evaluate the difference between measured and predicted results. After introducing the benefits, algorithm types, and general process of ML-based SERS/SEIRA, we will explain and demonstrate it in the later section by reviewing the detailed application of machine learning in SEIRA from the perspective of substrate design, data processing, and decision-making.
|  | ||
| Fig. 4 Machine learning-enhanced substrate design. (a) Schematic view of SEIRA substrate designed by using an adaptive genetic algorithm.216 (i): 3D view of the substrate. (ii): Design process using GA. (iii): Comparison of the predicted and simulated spectra. Copyright 2018 Springer Nature Limited. (b) Illustration of the all-dielectric SEIRA substrate designed by using neural network architecture.217 (i): 3D view of the substrate. (ii): Design process using neural network algorithm. (iii): MSE error in cross-validation. Copyright 2019 Optical Society of America. (c) SEIRA substrate design using a genetic algorithm for the detection of COVID-19.126 (i): 3D view of the substrate. (ii): Design process. Copyright 2021 American Chemical Society. | ||
Of course, binary-pattern antennas are not the solution that all designers want. In most cases, with the antenna pattern already selected, researchers only need algorithms to optimize the performance of the antenna to high-level goals. In 2019, Nadell and co-workers demonstrated the use of deep neural networks (DNN) to model complex all-dielectric antennas and achieve performance optimization, as shown in Fig. 4b-i.217 A DNN is a kind of ANN with multiple layers between the input layer and the output layer. It finds the correct mathematical manipulation of whether the input and output are linear or non-linear. The geometric parameters x of the antenna such as radius, height along with ratios were used as the input layer of the DNN, and the corresponding spectra S was set as the predictions of the output layer in the training process. While DNNs could build hidden layers by exploiting additional network parameters during training, it increased the difficulty of optimization and led to poorer generalization. Taking knowledge of the underlying physics as input and pre-learning these quantities during training could improve network performance and reduce the difficulty of optimization (Fig. 4b-ii). The agreement of the prediction spectrum of the model with the target spectrum demonstrated the power of the DNN in the optimization of SEIRA substrate design. The MSE for the cross-validation set after training can be used to evaluate the prediction accuracy of the model (Fig. 4b-iii).
When considering coupled molecules in substrate design, machine learning is required to model the vibrational behavior of molecules and optimize the enhancement performance of the SEIRA signal. In 2021, Li and co-workers utilized a GA-based ML to automatically design a sensitive SEIRA substrate for COVID-19 (severe acute respiratory syndrome coronavirus 2) detection, as shown in Fig. 4c-i.126 Currently, the commercialized COVID-19 diagnostic methods include reverse transcription-polymerase chain reaction (RT-PCR) test, serological test or immunoassay, and chest computed tomography (CT). These methods suffer from their respective disadvantages, such as low sensitivity for CT and time-consuming for RT-PCR. Plasmonic methods with the potential for point-of-care (POC) diagnostics are highly desirable. The first step for modeling the vibrational behavior of molecules was the extraction of molecular complex permittivity from its infrared spectrum. Then, the optimal solution, with high sensitivity, zero detuning of plasmonic resonance and molecular vibration, high enhancement factor, and so on, is rapidly found from multi-design parameter problems by exploiting the excellent parallel capabilities of GA (Fig. 4c-ii). Furthermore, the mutation of COVID-19, which poses a great challenge to common diagnostic methods, can be easily distinguished by SEIRA methods by comparing the intensity and frequency of peaks of the virus. Overall, the use of machine learning in the design of SEIRA antennas is booming, while its use in SERS substrate design is rare. Because antennas for SEIRA are generally on the micrometer scale, wherein antenna patterns with various design parameters can be realized by using top-down fabrication-based photolithography and direct writing technique. That is, the multi-parameter complex design of antennas for SEIRA requires the help of powerful technical resources of machine learning. In terms of SERS, its signals are large and complex in many applications. Studies employing ML to aid SERS process data and make decisions are more common.
Systematic inverse design based on machine learning can accelerate the design of SEIRA/SERS substrates to achieve tailored optical responses. The underlying idea of the ML-assisted reverse design is to train an ML model by learning the relationship between physical responses and structures, and then provide structural patterns based on the desired physical responses. A well-trained model can remove the need of computationally intensive numerical simulations from the pattern. Fig. 5a shows the simultaneous inverse design of material and structural parameters of core–shell nanoparticle-based nanophotonic substrate, demonstrated by So and co-workers.218 Since the structural parameters are continuous quantities and the material parameters are discrete quantities, it is difficult to achieve the simultaneous inverse design of these two parameters using an algorithm. By combining both regression and classification into the same implementation (Fig. 5a-ii), core–shell nanoparticles are designed in reverse to meet the user-drawn spectra (Fig. 5a-i). In the model training process, a large number of parameters and corresponding spectra obtained through forward design are required. After training, the model is tested by using a hand-drawn Lorentzian function with peaks at specific locations (Fig. 5a-iv). The parameters of the core–shell nanoparticle are obtained and the predicted spectra match well with the target spectra (Fig. 5a-v).
|  | ||
| Fig. 5 Inverse design of nanophotonic devices. (a) Simultaneous material and structural inverse design through a supervised deep learning algorithm.218 (i): Target spectrum. (ii): Schematic diagram of the machine learning model used in the reverse design. (iii): Schematic representation of core–shell nanoparticle-based substrates. (iv): Inverse design example of core–shell nanoparticle at resonant wavelengths of 400 nm. (v): At resonant wavelengths of 500 nm. Copyright 2019 American Chemical Society. (b) Inverse design of nanophotonic devices using a semi-supervised deep learning algorithm.219 (i): A schematic of inverse design. (ii): The required reflection spectra (upper panel), the results of inverse design (middle, bottom panel). Insets are the design pattern through algorithms. (iii): Visualization of the latent space. Copyright 2019 John Wiley and Sons. (c) Inverse design of nanophotonic devices using an unsupervised deep learning algorithm.220 (i): Schematic illustration of transitioning device design from a traditional trial-and-error approach to machine learning-mediated inverse design. (ii): Network architecture to inverse design structural images. (iii): Examples of the results of the inverse design. Copyright 2018 American Chemical Society. | ||
Semi-supervised learning algorithms can also be used in the reverse design, which reduces the amount of data for training, where both labeled and unlabeled data are used for training. Ma and collaborators propose a network that differs from other inverse design methods.219 It takes input geometry and encodes the structural design and optical responses as latent variables with predefined distributions (Fig. 5b-i). Randomly sampling and decoding these latent variables from the latent space can reconstruct the original structural geometry, thereby enabling inverse design. Fig. 5b-ii shows the simulated spectra using the parameters obtained from the inverse design (middle and bottom pane in Fig. 5b-ii), which matches well with the required spectrum (upper panel). The sampling process provides a variety of outputs for the same target spectrum, that is, it generates many candidates for reverse design. Furthermore, they demonstrated that the proposed model can automatically learn to distinguish different shapes through encoding–decoding training iterations on labeled and unlabeled data (Fig. 5b-iii).
Besides, inverse design of nanophotonic structures using unsupervised learning algorithms is also feasible. Liu and workers adopted a generative adversarial network (GAN) in the network model to reverse design the arbitrary geometry of the substrate (Fig. 5c).220 GAN is an unsupervised learning architecture. It consists of two networks, namely a generator and a critic. They compete against each other and learn simultaneously to create authentic pattern. The generator receives the random noise and generates a pattern of the structure that should have the desired optical properties. The pattern is judged by the critic and then the critic decides whether it comes from the structural geometry of interest. The objective of the generator network is to deceive the critic network by generating authentic pattern. After training, the generator model can create designs that resemble patterns in the actual geometric data (Fig. 5c-i). In addition to a generator and a critic required by traditional GAN models, the authors also add a simulator network to approximate the optical properties of the generated design patterns (Fig. 5c-ii). After unsupervised training, the model is able to provide structural patterns for a given spectrum. For specific spectral requirements, the simulated spectra of the test pattern and the generated pattern (via reverse design) are well matched (Fig. 5c-iii).
 . By obtaining the optimal value of θ from the training set, the result of monitoring the molecular change in dynamic process is extracted, as shown in Fig. 6b-ii. The detailed mathematics behind MLR can be found in Note S3 of ESI.† By using the DNN model (Fig. 6b-iii–v) to process the data of the biosensor, new results could be obtained, as shown in Fig. 6b-vi. Overall, the trends in results with and without machine learning were similar. Significant increases in sucrose and nucleotide signals were observed after the first injection of Melittin. The liposome signal stabilized, while the sucrose and nucleotide molecule signals stabilized at t = 100 min. Notably, compared to results without machine learning, the Melittin signal had few false negatives and remained stable around zero before it was introduced. This shows that the DNN method helps to effectively reduce the interference effect of the matrix, and the information extracted by DNN yielded superior performance.
. By obtaining the optimal value of θ from the training set, the result of monitoring the molecular change in dynamic process is extracted, as shown in Fig. 6b-ii. The detailed mathematics behind MLR can be found in Note S3 of ESI.† By using the DNN model (Fig. 6b-iii–v) to process the data of the biosensor, new results could be obtained, as shown in Fig. 6b-vi. Overall, the trends in results with and without machine learning were similar. Significant increases in sucrose and nucleotide signals were observed after the first injection of Melittin. The liposome signal stabilized, while the sucrose and nucleotide molecule signals stabilized at t = 100 min. Notably, compared to results without machine learning, the Melittin signal had few false negatives and remained stable around zero before it was introduced. This shows that the DNN method helps to effectively reduce the interference effect of the matrix, and the information extracted by DNN yielded superior performance.
      |  | ||
| Fig. 6 Machine learning-enhanced data processing in SEIRA. (a) Data preprocessing using frame averaging model for noise reduction.122 (i): Schematic diagram of the conversion from raw noisy data to ML data set. (ii): Data before and after frame averaging showing the result of noise reduction. Copyright 2021 American Chemical Society. (b) Data processing using DNN algorithm for dynamic biomonitoring.58 (i): Schematic depiction of the dynamic process of biological reaction. (ii): Measurement results of the dynamic process without using ML for data processing. (iii): Schematic diagram of SEIRA-based biosensor. (iv): Raw spectral data obtained by biosensors in biological reactions. (v): Schematic diagram of the DNN algorithm for raw spectral data processing. (vi): Results of the dynamic process using DNN for data processing. Copyright 2021 John Wiley and Sons. | ||
In terms of SERS, spectral statistics and data processing methods have grown significantly. The first step, before spectral comparison and further data processing, is to normalize the data. Fluorescence removal, filters' signal removal, baseline subtraction, intensity normalization, smoothing, and more are commonly used routines. A brief overview of the possible use of these tools, as well as several ways to analyze and compare spectra, is presented: (1) the construction of spectral libraries and automatic searches; (2) the determination of flowcharts and criteria for the unambiguous characterization of the studied materials, and (3) the use of chemometrics for data classification. Some examples relying on these statistical tools to advance our knowledge on the aging of materials or the determination of individual components in mixtures are also presented. The importance of data processing in the use of surface-enhanced Raman spectroscopy is significant. But current progress is limited by cumbersome and intensive data collection steps. More important, a typical lab-scale SERS experiment requires the user to evaluate the quality and reliability of the result as the data are being collected. There is an urgent need to develop to simplify and accelerate data processing steps. The challenge can be addressed by ML-enhanced data processing in SERS.221–224
Alstrom and co-workers225 proposed a Bayesian Non-negative Matrix Factorization (NMF) approach to identify the locations of target molecules. This method can successfully analyze the spectra and extract the target spectrum, as shown in Fig. 7a. A visualization of the loadings of the basis vector is created and the results show a clear SNR enhancement. Compared to traditional data processing, the NMF approach enables a more reproducible and sensitive sensor. It was able to identify the estradiol glow base spectrum as one of the basis vectors at high concentrations. This allows for more accurate and robust data analysis when the SERS substrate is contaminated by unknown molecules, as shown in Fig. 7a-iv. A further advantage of using NMF for SERS data is that the method is interpretable, because the basis vectors identified can be related to the expected physical effects (Fig. 7a-v). At the same time, both the high demand for mass data and the low interpretability of the mysterious black-box operation significantly limit the well-trained model to real systems in practical applications. Aiming at these two issues, Luo and co-workers226 constructed a novel machine learning algorithm-based framework (Vis-CAD), integrating visual random forest, characteristic amplifier, and data augmentation (Fig. 7b). The introduction of data augmentation significantly reduced the requirement of mass data, and the visualization of the random forest clearly presented the captured features, which helps to determine the reliability of the algorithm. For instance, the trace analysis of individual polycyclic aromatic hydrocarbons in a mixture achieves a confidence accuracy of no less than 99% under optimized conditions. The visualization of the algorithm framework clearly shows that the captured features are closely related to the characteristic Raman peaks of each individual, as shown in Fig. 7b-i. Moreover, the sensitivity to trace individuals can be improved by at least 1 order of magnitude compared to the naked eye. The lesser need for the proposed algorithm for massive data and the visualization of the operational process provides new avenues for the indestructible application of machine learning algorithms, and it helps the SERS field to improve the limits of sensitivity for both qualitative and quantitative analysis of traces.
|  | ||
| Fig. 7 (a) Improving the robustness of SERS by NMF.225 (i): The Raman microscope used to collect data. (ii): A side-view up of a Raman substrate depicting the nanopillars. (iii): Illustration of the principle behind the SERS substrates. (iv): An example of a Raman map that has been locally contaminated (red and green spectra). (v): Noise filtering using NMF. The left-hand side is the loadings in the left drawn as Raman maps and the right-hand side is the corresponding basis vector in S. Copyright 2014 IEEE. (b) Schematic flowchart of the Vis-CAD.226 (i): Data preparation and processing. (ii): Stochastic Forest chain model. (iii): Model testing. Copyright 2022 American Chemical Society. | ||
![[double bond, length as m-dash]](https://www.rsc.org/images/entities/char_e001.gif) O vibrations in both. Kühner and co-workers demonstrated the discrimination and detection of physiological levels of glucose and fructose by developing a PCA-assisted SEIRA biosensor, as shown in Fig. 8a.56 It works by placing a SEIRA substrate in internal reflection mode on a flow cell (Fig. 8a-ii), which is flushed through an attached tube connector to deliver analytes in and out of the flow cell (Fig. 8a-i). Then PCA algorithm is used to decompose measurement data and represented them as a set of orthogonal and uncorrelated eigenfunctions called principal components (PC) and eigenvalues defined as termed scores (SCs). The first-order PC corresponds to the variance that contributes the most, and thus constitutes the largest contribution. Similarly, the second-order PC contributes the second largest, and so on. The more significant the correlation between the various datasets is, the less PC is needed to describe the features of the entire dataset. The PCs and corresponding SCs are shown in Fig. 8a-iii and iv. Clearly, the PCA algorithm can extract solutions of different concentrations, which is a key element of an automated evaluation procedure. To achieve automatic evaluation, the PCs data obtained by using PCA requires further supervised learning through classification algorithms. Meng and co-workers demonstrated chemical material identification by using the PCA algorithm and SVM classifier to process the data of a plasmonic mid-infrared filter array-detector array.122 The sensor integrates a plasmonic filter array with an IR detector array, as shown in Fig. 8b-i. First, the data of the detector is calculated as PC1, PC2, and PC3 via PCA algorithms (Fig. 8b-ii). The PCA begin with the standardization of the continuous variables of the dataset. The next is to construct the covariance matrix by
O vibrations in both. Kühner and co-workers demonstrated the discrimination and detection of physiological levels of glucose and fructose by developing a PCA-assisted SEIRA biosensor, as shown in Fig. 8a.56 It works by placing a SEIRA substrate in internal reflection mode on a flow cell (Fig. 8a-ii), which is flushed through an attached tube connector to deliver analytes in and out of the flow cell (Fig. 8a-i). Then PCA algorithm is used to decompose measurement data and represented them as a set of orthogonal and uncorrelated eigenfunctions called principal components (PC) and eigenvalues defined as termed scores (SCs). The first-order PC corresponds to the variance that contributes the most, and thus constitutes the largest contribution. Similarly, the second-order PC contributes the second largest, and so on. The more significant the correlation between the various datasets is, the less PC is needed to describe the features of the entire dataset. The PCs and corresponding SCs are shown in Fig. 8a-iii and iv. Clearly, the PCA algorithm can extract solutions of different concentrations, which is a key element of an automated evaluation procedure. To achieve automatic evaluation, the PCs data obtained by using PCA requires further supervised learning through classification algorithms. Meng and co-workers demonstrated chemical material identification by using the PCA algorithm and SVM classifier to process the data of a plasmonic mid-infrared filter array-detector array.122 The sensor integrates a plasmonic filter array with an IR detector array, as shown in Fig. 8b-i. First, the data of the detector is calculated as PC1, PC2, and PC3 via PCA algorithms (Fig. 8b-ii). The PCA begin with the standardization of the continuous variables of the dataset. The next is to construct the covariance matrix by  , where Xi and Yi are the specific training dataset from variables X and Y, and μx and μy are the means of the variables. The eigenvector of a matrix A is calculated by A
, where Xi and Yi are the specific training dataset from variables X and Y, and μx and μy are the means of the variables. The eigenvector of a matrix A is calculated by A![[v with combining right harpoon above (vector)]](https://www.rsc.org/images/entities/i_char_0076_20d1.gif) = λ
 = λ![[v with combining right harpoon above (vector)]](https://www.rsc.org/images/entities/i_char_0076_20d1.gif) , where λ is the eigenvalue and
, where λ is the eigenvalue and ![[v with combining right harpoon above (vector)]](https://www.rsc.org/images/entities/i_char_0076_20d1.gif) is the eigenvector. The 1st PC v1 is the eigenvector of the sample covariance matrix A associated with the largest eigenvalue. The 2nd PC v2 is the eigenvector of the sample covariance matrix A associated with the second largest eigenvalue. Then, the SVM model is trained to identify which chemical is present and determine the concentration of a specific analyte. The hyperplane is often used to separate the different classes in SVM. It can be expressed as ω·x − b = 0, where ω is a weights that determines the orientation of the hyperplane and b is the bias. The distance of any point x in dataset to the hyperplane is calculated as d = |ωT·x − b|/‖ω‖. The training task is to maximize the distance d using the training dataset (see Note S3 of ESI† for details). The confusion matrix demonstrates good automatic identification of the different analytes with a 10-fold cross-validated accuracy value of 95.34%. The sensitivity and bandwidth of SEIRA are often limited by the small overlap between molecules and sensing hotspots, as well as by sharp plasmonic resonance peaks. Our group developed a wavelength-multiplexed hook-shaped nanoantenna array for continuous broadband detection to capture multiple absorption peaks in the fingerprint region, as shown in Fig. 8c-i.121 For different analytes with the same functional group, their SEIRA spectra overlap in many regions (Fig. 8c-ii). Therefore, it is difficult to distinguish them in mixtures using narrow-band SEIRA substrates. With the help of PCA and SVM algorithms, 100% recognition accuracy is achieved (Fig. 8c-iii–vi). This strategy can be designed to cover the entire infrared fingerprint range, that is, by designing the antenna length for specific molecular monitoring, such as proteins, sugars, lipids, nucleic acids, and volatile organic compounds.
 is the eigenvector. The 1st PC v1 is the eigenvector of the sample covariance matrix A associated with the largest eigenvalue. The 2nd PC v2 is the eigenvector of the sample covariance matrix A associated with the second largest eigenvalue. Then, the SVM model is trained to identify which chemical is present and determine the concentration of a specific analyte. The hyperplane is often used to separate the different classes in SVM. It can be expressed as ω·x − b = 0, where ω is a weights that determines the orientation of the hyperplane and b is the bias. The distance of any point x in dataset to the hyperplane is calculated as d = |ωT·x − b|/‖ω‖. The training task is to maximize the distance d using the training dataset (see Note S3 of ESI† for details). The confusion matrix demonstrates good automatic identification of the different analytes with a 10-fold cross-validated accuracy value of 95.34%. The sensitivity and bandwidth of SEIRA are often limited by the small overlap between molecules and sensing hotspots, as well as by sharp plasmonic resonance peaks. Our group developed a wavelength-multiplexed hook-shaped nanoantenna array for continuous broadband detection to capture multiple absorption peaks in the fingerprint region, as shown in Fig. 8c-i.121 For different analytes with the same functional group, their SEIRA spectra overlap in many regions (Fig. 8c-ii). Therefore, it is difficult to distinguish them in mixtures using narrow-band SEIRA substrates. With the help of PCA and SVM algorithms, 100% recognition accuracy is achieved (Fig. 8c-iii–vi). This strategy can be designed to cover the entire infrared fingerprint range, that is, by designing the antenna length for specific molecular monitoring, such as proteins, sugars, lipids, nucleic acids, and volatile organic compounds.
        |  | ||
| Fig. 8 ML-enhanced detection of small molecules. (a) Glucose detection using PCA-assisted SEIRA.56 (i): Schematic diagram of the biosensor. (ii): SEM image of the biosensor. (iii): First and second principal components of the measurement data using PCA. (iv): Visualization of classification results. Copyright 2019 American Chemical Society. (b) Multiple chemical detection using SVM-assisted SEIRA.122 (i): Schematic diagram of the SEIRA-based sensor. (ii): Visualization of classification results for different analytes. (iii): Confusion matrix for analyte classification results. Copyright 2021 American Chemical Society. (c) Volatile organic compounds (VOCs) detection using PCA/LDA-assisted SEIRA.121 (i): Schematic diagram of the SEIRA-based VOC sensor. (ii): The measured spectra of analytes showing spectral overlapping, which is difficult to distinguish clearly with the classic data processing methods. (iii): The measured raw spectral data. (iv): Data preprocessing using PCA. (v): The weight of scores of each spectrum in 3D space. (vi): The confusion map for analyte classification results. Copyright 2022 Springer Nature Limited. | ||
Kim and co-authors demonstrated the detection of extracellular vehicles (EVs) as an excellent resource of diagnostic biomarkers in serum samples from normal controls individuals, and chronic pancreatitis and pancreatic cancer patients using a SERS-based immunoassay technique,227 as shown in Fig. 9a. Applying a machine learning algorithm (Fig. 9a-iv) to the analysis of the expression level of EVs biomarkers in pancreatic cancer, chronic pancreatitis, and normal control individuals, the sensitivity and specificity were measured as 0.95 and 0.96, respectively, which suggests the great potential of using this biomarker to differentiate pancreatic cancers from chronic pancreatitis. Dual-modal (FRGBRSERS, FRRSERS) fluorescence–SERS quantum dot (QD)-embedded silver bumpy nanoparticles (Fig. 9b-i) are developed by Cha and co-authors for high-throughput multiplex analysis.233 Each FRRSERS nanoprobe produces strong SERS and fluorescence signals for multiplex analysis (Fig. 9b-ii). Based on this dual-modality, a barcode-based machine learning algorithm that transforms spectra into barcodes and identifies chemical information is created, as shown in Fig. 9b-iv. The multiplex detection platform comprising the FRRSERS nanoprobes and the high-throughput analysis algorithm will be extremely useful for analyzing and encoding biological targets.
|  | ||
| Fig. 9 (a) A SERS-based immunoassay for pancreatic cancer (PC) tumor-derived extracellular vesicles (EVs) quantification227 (i): functionalizing gold substrate with thiol and CD81 antibody. (ii): Capturing normal and tumor-derived EVs present in the serum sample. (iii): Loading EphA2-NPs-reporter to enhance Raman signal and selectively label tumor-derived EVs. (iv): The classification tree trained with the whole dataset of peak-value Raman shifts with depth = 2. Copyright 2021 John Wiley and Sons. (b) Dual-modal fluorescence–SERS quantum dot (QD)-embedded silver bumpy nanoparticles are developed for biomarkers high-throughput multiplex analysis.233 (i): Nanoprobes are prepared from silica-coated silver bumpy nanoshells (AgNS@SiO2). (ii) Each dual-modal (FRGBRSERS, FRRSERS) nanoprobe produces strong SERS and fluorescence signals for multiplex analysis. (iii): A barcode-based machine learning algorithm that transforms spectra into barcodes and identifies chemical information is created. (iv): The result of applying density-based spatial clustering of applications with noise (DBSCAN) to the spectrum of 4-bromobenzenethiol (4-BBT), and the corresponding barcode. Copyright 2021 Elsevier. | ||
The SERS spectra of blood cells and various tumor cells are measured (Fig. 10a-iii) with the silver film substrate by Fang and co-workers.247 It is found that there are significant differences in nucleic acid-related characteristic peaks between most tumor cells and blood cells. These spectra are classified by the feature peak ratio method, the principal component analysis combined with K-nearest neighbor, and residual network, which is a kind of deep learning algorithm. The results showed that the ratio method and PCA-κNN can only distinguish SERS spectra that come from blood cells and some kinds of distinct tumor cells, while the built residual network (ResNet) model can perfectly classify SERS spectra of blood cells and all kinds of tumor cells with an identification accuracy of 100%. The research studies demonstrate that the silver film SERS substrate can stably enhance the Raman scattering signal of cells, and the deep learning algorithm can quickly and effectively identify the SERS spectra of tumor cells and blood cells, as shown in Fig. 10a-iv and v. Erzina and co-workers248 presented the advanced approach for the detection of compositional changes in culture medium arising from the metabolic activity of tumor or normal cells (Fig. 10b). The approach combines innovative techniques from the field of machine learning. First, the functionalized nanoparticles, after the interaction with biosamples, were deposited on the gold grating surface to achieve plasmonic coupling and high SERS enhancement. Subsequently measured SERS spectra are used as the input for advanced CNN training and validation, while the kind of nanoparticles surface functionalization serves as additional parameters, increasing the flexibility and reliability of the method. The CNN classification results were then translated into sensitivity and specificity for each cell kind and 100% accuracy in the discrimination of tumour and normal cell's cultivation media was achieved, as shown in Fig. 10b-vii. Such an approach is useful to make clinical decisions rapidly and accurately possible.
|  | ||
| Fig. 10 (a) Fast discrimination of tumor cells by label-free SERS and deep learning247 (i): structure of the residual network unit. (ii): The basic structure of ResNet, which consists of a convolutional layer, pooling layer, two residual blocks, and two fully connected layers. (iii): The Raman spectra of tumor cells are enhanced by the silver film substrate and aluminum plate substrate. (iv): t-distributed stochastic neighbor embedding (t-SNE) diagram after extracting original data features by the ResNet model which contains two residual blocks. (v): t-SNE diagram after extracting original data features by the ResNet model which contains five residual blocks. Copyright 2021 AIP Publishing LLC. (b) Schematic representation of the proposed experimental concept.248 (i): preparation of functional AuMs and their interaction with culture medium sample. (ii): simultaneous preparation of gold grating. (iii): deposition of AuMs on the surface of the gold grating. (iv): SERS measurements. (v): implementation of CNN for SERS result interpretation. (vi): Accuracy and loss evolution during the CNN learning and validation. (vii): The visualization of predictive ability of the proposed SERS/CNN approach. Copyright 2020 Elsevier. | ||
Combining SERS sampling and machine learning data analysis, Hong designed and developed a simple, fast, and inexpensive optical sensing platform.256 The pre-processing of spectral measurement employed gold nanoparticle colloid mixing with the serum from patients with colorectal cancer (CRC) to predict the disease. Portable Raman spectrometer and the PCA were used to determine the phenotypic variations of fungal cells Candida albicans (C. Albicans) under the influence of different antifungals with various mechanisms, and unknown antifungals were predicted using the established PCA model.257 In another study, Tang and co-workers analyzed SERS through machine learning algorithms to discriminate bacterial pathogens quickly and accurately.258 Surface-enhanced Raman spectroscopy combined with machine learning techniques enables rapid discrimination between methicillin-resistant and methicillin-sensitive Gram-positive Staphylococcus aureus strains and Gram-negative Legionella pneumophila (controls).118 A total of 10 methicillin-resistant S. aureus (MRSA), 3 methicillin-sensitive S. aureus (MSSA) and 6 L. pneumophila isolates were used. The obtained spectra indicated high reproducibility and repeatability with a high signal-to-noise ratio, as shown in Fig. 11a-i. PCA, HCA, and various supervised classification algorithms were used to discriminate both S. aureus strains and L. pneumophila. The results indicate that SERS combined with machine learning can be used for the detection of antibiotic-resistant and susceptible bacteria and this technique is a very promising tool for clinical applications. Tang and co-workers116 compared three unsupervised machine learning methods and 10 supervised machine learning methods, respectively, on 2752 SERS spectra from 117 Staphylococcus strains belonging to nine clinically important Staphylococcus species to test the capacity of different machine learning methods for bacterial rapid differentiation and accurate prediction. According to the results, density-based spatial clustering of applications with noise (DBSCAN) showed the best clustering capacity (Rand index 0.9733) while CNN topped all other supervised machine learning methods as the best model for predicting Staphylococcus species via SERS spectra (ACC 98.21%, AUC 99.93%). As shown in Fig. 11b-ii. This study shows that machine learning methods are capable of distinguishing closely related Staphylococcus species and therefore have great application potential for bacterial pathogen diagnosis in clinical settings. Thrift and co-workers presented the first SERS odor compass (Fig. 11c-i).259 Using a grid array of SERS sensors, machine learning analysis enables reliable identification of multiple odor sources arising from the diffusion of analytes from one or two localized sources. Specifically, CNN and SVM classifier models achieve over 90% accuracy for a multiple odor source problem. This system is then used to identify the location of an Escherichia coli biofilm via its complex signature of volatile organic compounds. Thus, the fabricated SERS chemical sensors have the needed limit of detection and quantification for diffusion-based odor compasses. Solving the multiple odor source problem with a passive platform opens a path toward an Internet of things approach to monitoring toxic gases and indoor pathogens.
|  | ||
| Fig. 11 (a) Identification of methicillin-resistant Staphylococcus aureus bacteria using SERS and ML.118 (i): Scheme of using surface-enhanced Raman spectroscopy combined with machine learning techniques for rapid identification of pathogens. (ii): Multiple class boundaries of the SVM classifier with linear kernel function in a 2-dimensional PCA plane. SVM hyperplanes found by the one-vs.-all method and linear kernel function. (iii): MRSA-vs.-all. (iv): MSSA-vs.-all. (v): L. pneumophila-vs.-all. (vi): 5-Fold cross-validated confusion matrix of the k-nearest neighbor classifier for the identification of Staphylococcus aureus strains and Legionella pneumophila. Copyright 2020 Royal Society of Chemistry. (b) Comparative Analysis of Machine Learning Algorithms on SERS of clinical Staphylococcus species116 (i): Schematic illustration of CNN data flow during processing SERS spectra of Staphylococcus species. (ii): Clustering results of nine Staphylococcus species via (iii): k-means, (iv): DBSCAN, and (v): agglomerative nesting (AGNES). Copyright 2021 Frontiers Media SA. (c) SERS-Based Odor Compass: Locating Multiple Pathogens259 (i): NMF components determined from analyte training datasets. (ii): Cross-validation accuracy of the models used. Copyright 2019 American Chemical Society. | ||
Barucci and co-workers193 performed in situ rapid and highly sensitive analysis by label-free SERS. Then resorted to a hybrid analytical method consisting of PCA obtained by band-fitting the SERS spectra of protein samples. Compared to the application of standard PCA to raw spectral data, the method succeeded in classifying proteins excellently and preserving the biological differences contained in their SERS spectra. The scheme is shown in Fig. 12a-i. The field of application of this method combined with machine learning involves the detection and analysis of proteins and can also be extended to other biomolecular species. The expression of the S-protein in the coronavirus makes it a marker for detecting the virus. Yang and co-workers have developed a functionalized gold “virus-traps” nanoarray as a novel COVID-19 SERS sensor to capture and identify viruses with extremely high sensitivity and specificity.267 The S protein is specifically immobilized within a region of strong electromagnetic field enhancement 10 nm from the nanoneedle surface, which includes “lightning rod” and “hot spot” effects that can enhance the highly enhanced Raman signal unique to the S protein of the SARS-CoV-2 virus. The identification standard of SERS signals established by machine-learning and identification techniques has been utilized to identify simulated COVID-19 from urines with a viral load of as low as 80 copies mL−1 as short as 5 min. Viruses are captured by functionalized surfaces, generating enhanced signals, and ancillary processes of machine learning, as shown in Fig. 12b-ii, which is of great significance for achieving real-time monitoring and early warning of coronavirus.
|  | ||
| Fig. 12 (a) Label-free SERS detection of proteins based on machine learning classification of chemostructural determinants.193 (i): Scheme of label-free SERS detection of proteins based on machine learning classification of chemo-structural determinants. (ii): Average area intensity values upon fitting the SERS spectra. (iii): Exemplary cross-validated PC1 vs. PC2 score plot of area intensity values. (iv): PCA loading plot. (v): PCA explained variance. Copyright 2020 Royal Society of Chemistry. (b) Schematic diagram of COVID-19 SERS sensor design and single-virus detection mechanism.267 (i): SARS-CoV-2 can be localized by “virus-traps” nanoforest composed of the oblique gold-nanoneedles array (GNAs). Through machine learning and identification techniques, the identification standard of virus signals is established and utilized for virus diagnoses. (ii): Schematic diagrams of single-virus detection by selectively capturing and trapping virus, and the multi-SERS enhancement mechanism. Copyright 2021 Springer Nature Limited. | ||
Rapid antimicrobial susceptibility testing (AST) is an integral tool to mitigate the unnecessary use of powerful and broad-spectrum antibiotics that leads to the proliferation of multi-drug-resistant bacteria. Using a sensor platform composed of SERS sensors (Fig. 13a-i) with control of nanogap chemistry and machine learning algorithms (Fig. 13a-ii) for analysis of complex spectral data, bacteria metabolic profiles after post-antibiotic exposure are correlated with susceptibility. DNN models can discriminate the responses of Escherichia coli and Pseudomonas aeruginosa to antibiotics from untreated cells in SERS data in 10 min after antibiotic exposure with greater than 99% accuracy. Deep learning analysis is also able to differentiate responses from untreated cells with antibiotic dosages up to 10-fold lower than the minimum inhibitory concentration observed in conventional growth assays. Unsupervised Bayesian Gaussian mixture analysis achieves 99.3% accuracy in discriminating between susceptible versus resistant to antibiotic cultures in SERS using the extended latent space. Discriminative and generative models rapidly provide high classification accuracy with small sets of labeled data, which enormously reduces the amount of time needed to validate phenotypic AST with conventional growth assays. Thus, this work outlines a promising approach toward practical rapid AST, the schematic diagram is shown in Fig. 13a-iii.283 Rahman and co-workers report the application of concanavalin A lectin-modified Bacterial cellulose nanocrystals (BCNCs) for bacterial isolation and label-free SERS detection of bacterial species using Au nanoparticles (AuNPs).284 The SERS spectral dataset was analyzed using machine learning-based SVM techniques, and the SVM classifier demonstrated high overall accuracy of 87.7% in correctly distinguishing bacterial strains, as shown in Fig. 13b.
|  | ||
| Fig. 13 (a) Deep Learning Analysis of Vibrational Spectra of Bacterial Lysate for Rapid Antimicrobial Susceptibility Testing. Using a sensor platform consisting of a SERS substrate.283 (i): Sensor with nanogap chemistry control and machine learning algorithms (ii): for analyzing complex spectral data, the rapid antimicrobial susceptibility test (AST) (iii): distinguishes between E. coli and P. aeruginosa responses to antibiotics and untreated cells with over 99% accuracy in less than 10 minutes. Copyright 2022 American Chemical Society. (b) Schematic illustration of284 (i): synthesis and (ii): functionalization of BCNCs, (iii): bacteria detection assay, (iv): SERS, and v: machine-learning applications. Copyright 2020 American Chemical Society. | ||
Nguyen and co-workers report on a room temperature inhomogeneous broadening as a function of the increased adenine concentration and employ this feature to develop one-dimensional and two-dimensional chemical composition classification models of 200 long single-stranded DNA sequences.298 Afterward, they develop a reservoir computing chemical composition classification scheme of the same molecules and demonstrate enhanced performance that does not rely on manual feature identification, as shown in Fig. 14a-iii. Taking advantage of the unique merits of SERS methodology in the collection and construction of a database (e.g., abundant intrinsic fingerprint information, noninvasive data acquisition process, strong anti-interfering ability, etc.), herein Shi et al.299 set up a SERS-based database of deoxyribonucleic acid (DNA), suitable for artificial intelligence (AI)-based sensing applications (Fig. 14b-i). The database is collected and analyzed by silver nanoparticles (Ag NPs)-decorated silicon wafer (Ag NPs@Si) SERS chip, followed by training with a deep neural network (DNN). As proof-of-concept applications, three kinds of representative tumor suppressor genes, i.e., p16, p21, and p53 fragments, are readily discriminated in a label-free manner, as shown in Fig. 14b-iv. Prominent and reproducible SERS spectra of these DNA molecules are collected and employed as input data for DNN learning and training, which enables selective discrimination of DNA target(s). The AI-based sensing method for distinguishing single DNA targets and DNA mixtures, with accuracy rates of 90.28 (Fig. 14b-iii) and 74.83%, respectively (Fig. 14b-vi).
|  | ||
| Fig. 14 (a) SERS-based ssDNA composition analysis.298 (i): Experimental results of SERS spectra presenting the adenine-related peak for different adenine concentrations in ssDNA molecules. All peaks are normalized and shifted so their Raman shifts are aligned to enhance the visibility of the peak broadening. (ii): Comparison between the RC-predicted concentration of adenine bases in ssDNA molecules and the actual value. (iii): Schematic description of the proposed RC architecture for SERS spectra chemical composition analysis/classification. Copyright 2022 AIP Publishing LLC. (b) DNA discrimination from SERS database in DNN mode.299 (i): Schematic illustration of three-layer feed-forward network for DNA discrimination from SERS database. (ii): Target output for the vector definition of nine DNA targets. (iii): Test results of verification topologies after training of the deep neural network and DNA mixture discrimination from the SERS database in DNN mode. (iv): Scheme of binary and ternary mixtures of p16, p21, and p53 (30 bp). (v): Target output for the vector definition of four DNA mixture targets. (vi): Corresponding test results of verification topologies after training of the deep neural network. Copyright 2018 American Chemical Society. | ||
Combining machine learning algorithms and acquired SERS spectra to develop a brand new monitoring platform, Kazemzadeh and co-workers findings demonstrate the platform's ability to effectively fingerprint and efficiently classify, for the first time, three distinct subtypes of breast cancer EVs following the application.308 This platform and characterization approach will enhance the viability of EVs and nanoplasmonic sensors towards clinical utility for breast cancer and many other applications to improve human health. Grieve and co-workers developed a cell-free, label-free SERS approach using gold nanoparticles (nano SERS) to classify hematological malignancies referenced against two control cohorts: healthy and noncancer cardiovascular disease.309 A predictive model was built using machine-learning algorithms to incorporate disease burden scores for patients under standard treatment. Results show linear- and quadratic-discriminant analysis distinguished three cohorts with 69.8 and 71.4% accuracies, respectively. Koster and co-workers using label-free SERS to demonstrate that a dual-isolation method is necessary to isolate EVs from the major classes of lipoprotein.69 However, combining SERS analysis with machine learning-assisted classification, they show that the disease state is the main driver of distinction between EV samples, and is largely unaffected by choice of isolation. The study describes a convenient SERS assay to retain accurate diagnostic information from clinical samples by overcoming differences in lipoprotein contamination according to the isolation method. Shin and co-workers demonstrate an accurate diagnosis of early-stage lung cancer, using deep learning-based SERS of the exosomes, as shown in Fig. 15a.196 Their approach was to explore the features of cell exosomes through deep learning and figure out the similarity in human plasma exosomes, without learning insufficient human data. The deep learning model was trained with SERS signals of exosomes derived from normal and lung cancer cell lines and could classify them with an accuracy of 95%. In 43 patients, including stage I and II cancer patients, the deep learning model predicted that plasma exosomes of 90.7% of patients had higher similarity to lung cancer cell exosomes than the average of the healthy controls. The similarity was proportional to the progression of cancer. Notably, the model predicted lung cancer with an area under the curve (AUC) of 0.912 for the whole cohort and stage I patients with an AUC of 0.910. These results suggest the great potential of the combination of exosome analysis and deep learning as a method for early-stage liquid biopsy of lung cancer. Karunakaran and co-workers have stepped up on a strategic spectroscopic modality by utilizing the label-free ultrasensitive SERS technique to generate a differential spectral fingerprint for the prediction of normal (NRML), high-grade intraepithelial lesion (HSIL) and cervical squamous cell carcinoma (CSCC) from exfoliated cell samples of the cervix, as shown in Fig. 15b-ii.301 Three different approaches i.e., single-cell, cell-pellet and extracted DNA from the oncology clinic as confirmed by Pap test and HPV PCR were employed. Gold nanoparticles as the SERS substrate favored the increment of Raman intensity and exhibited signature identity for amide III/nucleobases and carotenoid/glycogen respectively for establishing the empirical discrimination. Moreover, all the spectral invention was subjected to chemometrics including the SVM which furnished an average diagnostic accuracy of 94%, 74%, and 92% in the three grades. Combining SERS read-out and machine learning techniques in field trials promises to reduce the incidence in low-resource countries.
|  | ||
| Fig. 15 Machine learning-enhanced intelligent disease diagnosis and therapy in SERS. (a) Schematic illustration of deep learning-based circulating exosome analysis for lung cancer diagnosis.196 (i): Circulation of lung cancer tumor exosomes in the bloodstream. (ii): Collection of spectroscopic data of exosomes by SERS. (iii): Overview of deep learning-based cell exosome classification and lung cancer diagnosis using exosomal SERS signal patterns. Copyright 2020 American Chemical Society. (b) Schematic illustration of experimental design for differentiating three grades viz. normal (NRML), high-grade intraepithelial lesion (HSIL), cervical squamous cell carcinoma (CSCC) using SERS.301 (i): Scrapping cells from the cervix using cytobrush. (ii): Progression pattern of cervical cancer. (iii): Set 1: single cell, set 2: cell pellet, set 3: extracted DNA (mixed with AuNPs). (iv): Independent SERS analysis of (1) single cell, (2) cell pellet, (3) extracted DNA in the glass slide. (v): empirical signal monitoring of the three grades. (vi): Chemometric analysis. Copyright 2020 Elsevier. | ||
The demonstration of the identification program for machine learning-augmented SERS is shown in Fig. 16. First, a large amount of measurement data is collected through the SERS substrate to form a database (Fig. 16a). Then, the SERS database provides great support for fast, sensitive, label-free, and non-destructive molecular detection and identification with the assistance of appropriate ML algorithms. Fig. 16b shows several representative ML analysis results with the best models in the corresponding application scenarios. However, choosing an appropriate algorithm for a specific database remains challenging. Because, with the large volumes of data generated during SERS analysis, not all algorithms could achieve relatively high accuracy. Many machine learning algorithms have not been explored, which may be appropriate for the analysis of Raman spectra and worthy of further investigation. Most of the diagnostic treatments for diseases are based on biomarkers, which are biological molecules produced in the body or at the patient's site of the disease. The biomarker is an indicator that evaluates normal biological processes, pathogenic processes, and response to exposure or intervention. Biomarker tests help to characterize changes in disease. Despite the availability of approved algorithms for disease diagnosis and treatment, however, many clinicians remain wary of ML because they are concerned about the “black box” nature of many ML models. While not all ML algorithms are uninterpretable, most ML algorithms that produce state-of-the-art results (including deep learning) have this limitation. The lack of interpretability of predictive models can undermine trust in these models, especially in healthcare, where many decisions are a matter of life and death, so clinician oversight is essential. “If used carefully, this technology could improve performance in health care and potentially reduce inequities”, Ghassemi says.310 “But if we're not actually careful, technology could worsen care”. Finally, we list some important related work on machine-enabled SEIRA and SERS technologies in the table for better reference, as shown in Table S1† (ESI†).
|  | ||
| Fig. 16  Flow chart of machine learning-enhanced SERS (a) the formation process of the SERS database.227 (i): SEM image of captured extrinsic Raman label on the gold-coated substrate. (ii): collecting the SERS signal using Raman spectroscopy. (iii): The concentration was quantified intensity at 1336 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) cm−1. Copyright 2021 John Wiley and Sons. (b) Model categories and tools of machine learning for SERS. (c) Demo for the results of different algorithms (i): classification.248 (ii): Regression.311 (iii): Dimensionality.193 (iv): Clustering.116 Copyright 2020 Optical Society of America. | ||
Although machine learning has improved SEIRA and SERS research, the introduction of algorithms still poses many challenges that require further investigation. The most critical issue is data preparation for training, validation, and testing. First, supervised learning algorithms are highly data-demanding. But it is difficult for operators to obtain large sets of independent SERS and SEIRA data. In most cases, large sets of data are obtained by repeated measures, which are not independent of each other. Second, the generalization ability of a well-trained model is questionable. A model that performs well on one dataset may output unsatisfactory results on another dataset due to overfitting. Finally, the results and accuracy obtained by different algorithm models are different. The selection and optimization of algorithms could increase the technical difficulty and complexity of the molecular screening process using SEIRA and SERS. Despite the challenges, the benefits of machine learning to SEIRA and SERS outweigh the disadvantages.
Going forward, ML-assisted SEIRA and SERS are attractive for point-of-care testing (POCT). POCT is a medical diagnostic test performed at the time and place of patient care. It provides rapid test results and has the potential to improve patient care. The huge market of POCT makes it a good choice for the practical application and commercialization of SEIRA and SERS technologies. To ensure the correctness of the diagnostic results, POCT has high requirements on the stability, accuracy, and speed of the detection. The rapid, automated and noise-reduced technological advantages of ML can improve the performance in these aspects, thereby representing an opportunity for the evolution of SEIRA into POCT. In particular, portable Raman/IR spectrometers integrated with ML algorithms can serve as powerful tools for SERS/SEIRA substrates regarding direct signal readout, fast data processing, and accurate decision-making. In this case, it could advance SERS/SEIRA-based POCT towards home testing or self-testing.
All in all, ML-based SEIRA and SERS technologies have achieved breakthroughs in the past 15 years, including automated design of complex substrates, signal processing, decision-making, and more. However, the demand for molecular diagnostics has changed dramatically in recent years, such as demographic shifts, new infectious agents, and labor shortages in the healthcare industry. All of these will drive advances in diagnostic technology, from the laboratory to the field or POCT, manual processing to unattended operation, single-function testing to all-in-one diagnostics,312 and more. It is promising that SEIRA and SERS technologies embrace ML algorithms to address challenges and move toward next-generation molecular diagnostics.
| Footnotes | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2na00608a | 
| ‡ These authors contributed equally to this work. | 
| This journal is © The Royal Society of Chemistry 2023 |