 Open Access Article
 Open Access Article
      
        
          
            Rose G. 
            McHardy
          
        
      ab, 
      
        
          
            Georgios 
            Antoniou
          
        
      b, 
      
        
          
            Justin J. A. 
            Conn
          
        
      b, 
      
        
          
            Matthew J. 
            Baker
          
        
      bc and 
      
        
          
            David S. 
            Palmer
          
        
       *ab
*ab
      
aDepartment of Pure and Applied Chemistry, Thomas Graham Building, 295 Cathedral Street, University of Strathclyde, Glasgow, G1 1XL, UK. E-mail: david.palmer@dxcover.com
      
bDxcover Ltd, Royal College Building, 204 George Street, Glasgow, G1 1XW, UK
      
cSchool of Medicine, Faculty of Clinical and Biomedical Sciences, University of Central Lancashire, Preston, PR1 2HE, UK
    
First published on 6th July 2023
Over recent years, deep learning (DL) has become more widely used within the field of cancer diagnostics. However, DL often requires large training datasets to prevent overfitting, which can be difficult and expensive to acquire. Data augmentation is a method that can be used to generate new data points to train DL models. In this study, we use attenuated total reflectance Fourier-transform infrared (ATR-FTIR) spectra of patient dried serum samples and compare non-generative data augmentation methods to Wasserstein generative adversarial networks (WGANs) in their ability to improve the performance of a convolutional neural network (CNN) to differentiate between pancreatic cancer and non-cancer samples in a total cohort of 625 patients. The results show that WGAN augmented spectra improve CNN performance more than non-generative augmented spectra. When compared with a model that utilised no augmented spectra, adding WGAN augmented spectra to a CNN with the same architecture and same parameters, increased the area under the receiver operating characteristic curve (AUC) from 0.661 to 0.757, presenting a 15% increase in diagnostic performance. In a separate test on a colorectal cancer dataset, data augmentation using a WGAN led to an increase in AUC from 0.905 to 0.955. This demonstrates the impact data augmentation can have on DL performance for cancer diagnosis when the amount of real data available for model training is limited.
When diagnosed at an early stage, survival rates are substantially higher. If the cancer is caught during the early stages (stage I–II), the average mortality rate for primary cancers sits at 27%, when compared with the vast increase for cancers detected at stage IV leading to a mortality rate of 82%. This demonstrates that early diagnosis is important to the treatment of cancer.1
Current screening diagnosis routes for patients in at-risk populations include a mammography for breast cancer,3 Pap smear for cervical cancer,4 low-dose computed tomography for lung cancer,5 endoscopic ultrasound for pancreatic cancer,6 and colonoscopy for colorectal cancer.7 Although many of these methods have been deemed effective, they are often expensive, and in some cases invasive.8 There is therefore an urgent need for a more convenient method for earlier diagnosis for many cancers.
Liquid biopsies are a cost-effective method of utilising a wide variety of substances, both tumor and non-tumor derived, to detect various cancer types, particularly at the early stages.9 Most commonly, circulating-tumor DNA (ctDNA) is a key marker used to detect cancer within the blood stream. The analysis of ctDNA has shown promise in the detection of various cancer types and is the primary biomarker type used by the current liquid biopsy platforms.10 However, the success of detecting ctDNA is limited to the monitoring of advanced stage cancers; the levels of ctDNA during the early stages are often too low to be detected.8
One method that has developed over recent years is the use of vibrational spectroscopy as the basis of the liquid biopsy in order to capture multiple tumor and non-tumor derived biomarkers within one measurement.11–13 It has the benefits of being rapid and low-cost, and can be used to analyse multiple different biofluids.14 Vibrational spectroscopic methods such as Raman and infrared (IR) spectroscopy have previously demonstrated their potential uses within cancer detection.14,15 In particular, attenuated total reflection Fourier-transform infrared (ATR-FTIR) has shown great promise for early cancer detection.16–18 Brennan et al.19 reported a sensitivity and specificity of 0.81 and 0.80, respectively, for the diagnosis of brain tumours in a prospectively collected clinical dataset of dried serum samples. For pancreatic cancer in particular, Sala et al.20 were able to utilise ATR-FTIR analysis of dried serum samples and machine learning to achieve an area under the receiver operating characteristic (ROC) curve (AUC) of 0.95 when classifying between pancreatic cancer and healthy samples (n = 200), and an AUC of 0.83 when classifying between pancreatic cancer and samples from patients presenting as symptomatic of pancreatic cancer but subsequently diagnosed as non-cancerous (n = 70). This example used machine learning methods common in the field of chemometrics, namely partial least squares (PLS) and random forest (RF).
With continuous hardware developments, more interest is being directed at the use of deep learning, particularly within cancer diagnostics.21 However, despite the general success of deep learning over the recent years, the main obstacle researchers face in the healthcare field is data availability.22 Although the volume of data needed for deep learning is present within electronic health records, healthcare data is often limited in quality due to data sparsity, variability, and privacy policies.22 Deep learning models, such as convolutional neural networks (CNNs), require large volumes of data in order to achieve maximum performance as they have many parameters; small datasets can lead to non-generalisable models that overfit and perform poorly on unseen data.
A solution to reaching the dataset sizes required for deep learning is data augmentation.23 Data augmentation is a method to artificially increase the size of a dataset with the aim to improve the performance of a predictive model. It can be particularly useful when larger datasets are either not available or if it would be particularly laborious to generate more samples.
Data augmentation can be broadly split into two categories: non-generative and generative methods. Non-generative methods create new data from the original data using some well-defined transformations. For example, for image augmentation, this can be in the form of geometric or colour transformations. Generative methods use neural networks to generate data artificially without directly using the original dataset other than for model training.
Non-generative methods of data augmentation have been used successfully within image classification using relatively simple and computationally inexpensive methods, as demonstrated by Taylor et al.24 who were able to use geometric and photometric transformations to generate new images to train a CNN and increase their classification accuracy from 0.48 to 0.62. Similar data augmentation methods have also been used within cancer diagnostics. Hao et al.25 used various techniques such as rotation, flipping, and cropping of magnetic resonance images to diagnose prostate cancer, increasing the AUC of the CNN from 0.80 to 0.85. Non-generative methods have also been used for spectral data, in particular infrared (IR) data.26,27 Bjerrum et al.26 in particular were able to decrease the root mean squared error from 4.01 mg to 1.80 mg by changing the offset and slope of spectra to generate more synthetic samples.
Over recent years, more complex forms of data augmentation have begun to surface, such as generative adversarial networks (GANs).28 GANs comprise two neural networks: a discriminator and a generator. The generator is tasked with generating new data based on the training set of available real data, and the discriminator is tasked with becoming an expert in determining whether a sample is real or simulated. These components work adversarially to generate the most realistic augmented data possible. GANs have great potential for data augmentation applications, but are substantially more computationally expensive when compared with non-generative methods.
In particular, GANs have shown their use within cancer diagnostics. Al-Dhabyani et al.29 utilised GANs to generate ultrasound images for the diagnosis of breast cancer, increasing their diagnostic accuracy from 84% to 96%.
As well as image-based data, GANs have been used previously with infrared (IR) spectra also. Wickramaratne et al.30 were able to use GANs with IR spectra to classify a subject's task as either a left finger tap, right finger tap, or a foot tap. They were able to increase the AUC from 0.79 to 0.98. Despite their benefits, GANs persistently suffer from problems with vanishing gradients, which can lead to a halt in generator learning, and mode collapse, in which the generator continuously generates similar data points that have been found to trick the discriminator.31 One solution to eliminate these issues are Wasserstein GANs (WGANs).32 The more stable WGANs have already shown their use for deep learning models trained on spectral data. Nagasawa et al.33 utilised WGANs to augment near IR spectra to classify motor tasks. They were able to increase their classification accuracy from 0.4 to 0.7, demonstrating the potential use of WGANs with spectra data for other applications. Zhao et al.34 also utilised WGANs with IR spectra with multiple traditional and deep learning models. In all cases, adding WGAN augmented spectra considerably increased the classification accuracy.
In this study, we aim to demonstrate the benefits of using data augmentation within spectral liquid biopsies to diagnose pancreatic cancer. Previous studies have been carried out that have used data augmentation and imaging data to diagnose pancreatic cancer, but none as of yet related to spectral data.35–37
Firstly, we will use non-generative data augmentation methods, including adding noise to spectra and averaging spectra, to create new data points, which can be found in section 3.6. Secondly, we will then optimise a WGAN network structure to simulate pancreatic cancer and non-cancer spectra, which can be found in section 4.1.
Thirdly, we will compare non-generative and WGAN augmentation methods during CNN model training, which can be found in section 4.2. We show that when we compare a CNN containing no augmented spectra, a CNN trained with WGAN augmented spectra has a better overall performance for diagnosing pancreatic cancer. We also use a separate colorectal cancer dataset to demonstrate that the method is disease and dataset invariant.
|  | (1) | 
![[Doublestruck E]](https://www.rsc.org/images/entities/char_e168.gif) is the expected value,
 is the expected value, ![[Doublestruck P]](https://www.rsc.org/images/entities/char_e173.gif) r is the real data probability distribution,
r is the real data probability distribution, ![[Doublestruck P]](https://www.rsc.org/images/entities/char_e173.gif) g is the generated data probability distribution,
g is the generated data probability distribution, ![[x with combining tilde]](https://www.rsc.org/images/entities/i_char_0078_0303.gif) = Gw(z), where z is the latent variable, θ are the discriminator weights, w are the generator weights, and x is the real training data.
 = Gw(z), where z is the latent variable, θ are the discriminator weights, w are the generator weights, and x is the real training data.
        GAN training occurs by continuously updating the discriminator weights, θ, before updating the generator weights, w, to minimise the Jensen–Shannon divergence, which measures the similarity between two probability distributions.38 One of the main issues however with GANs is vanishing gradients, which is caused by an optimised discriminator that cannot provide enough information for generator training to progress. This is often caused by the Jensen–Shannon divergence not being continuous with respect to w when probability distribution domains do not overlap. GANs also are known to experience mode collapse, where the generator continuously outputs similar data points which successfully fool the discriminator.28
The minmax objective for a WGAN is instead defined as:
|  | (2) | 
Originally, Arjovsky et al.31 used a Lipschitz constraint on the gradient functions to ensure a maximum gradient. This was enforced on the critic by clipping its weights to lie within an interval [−c, c], where c is the real number representing the weight clipping parameter, to allow faster training by constraining the critic gradient. However, it was further proposed by Gulrajani et al.32 that this was a problematic method of training the critic. Without careful tuning of the weight clipping parameter, the critic can experience exploding or vanishing gradients; if c is too large, then the critic will never train optimally, too small and it will cause vanishing gradients. Therefore, Gulrajani et al.32 changed the value function for WGANs to include a gradient penalty term (WGAN-GP) which thus leads to the minmax objective being defined as:
|  | (3) | 
![[x with combining circumflex]](https://www.rsc.org/images/entities/i_char_0078_0302.gif) is defined as:32
 is defined as:32| ![[x with combining circumflex]](https://www.rsc.org/images/entities/i_char_0078_0302.gif) = εx + (1 − ε) ![[x with combining tilde]](https://www.rsc.org/images/entities/i_char_0078_0303.gif) , | (4) | 
A WGAN-GP can be extended by imposing conditions based on some additional information, y, to obtain a conditional WGAN-GP (CWGAN-GP).39 In this study, y corresponds to the class label of the spectra. This results in the following minmax objective:
|  | (5) | 
In the present paper, we will be utilising a CWGAN-GP for generating synthetic FTIR spectra.
Blood samples were obtained with venipuncture using serum collection tubes; S-Monovette Z Gel (Sarstedt, Germany) and Vacutainer SST/SST II (BD, USA), and anonymized. Serum was extracted via centrifugation and stored in a −80 °C freezer. Non-identifiable clinical and demographic data were obtained in-line with each biobank's data control procedures.
Ethical approval for this study was granted by Lothian REC(15/ES/0094), Preston Brain Tumour North-West (BTNW) Application #1108, Beatson West of Scotland Cancer Centre (MREC 10/S0704/18), and the Integrated Research Application System, IRAS, (ID #238735) from Health Research Authority (HRA) and University of Strathclyde Ethics Committee (UEC 17/81). All participants consented to inclusion in the study.
| C | NC | Total | ||
|---|---|---|---|---|
| Age, years | Mean | 64 | 56 | 60 | 
| Min–max | 40–83 | 20–80 | 20–83 | |
| Sex, n (%) | Female | 25 (50) | 30 (60) | 55 (55) | 
| Male | 25 (50) | 20 (40) | 45 (45) | |
| Cancer stage, n (%) | I | 2 (4) | — | 2 (2) | 
| II | 20 (40) | — | 20 (20) | |
| III | 22 (44) | — | 22 (22) | |
| IV | 6 (12) | — | 6 (6) | |
The remainder of the dataset will be labelled the 525 patient dataset and comprised 116 pancreatic cancer and 409 non-cancer patients. The patient metadata of the full dataset and the 525-patient dataset can be found in Tables S1 and S2† respectively.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 42 for R v4.1.2
42 for R v4.1.2![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 43 with Tensorflow v2.8.0
43 with Tensorflow v2.8.0![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 44 as the computation back end. The final chosen model consisted of two consecutive units each of which comprised a one-dimensional convolutional layer (with 10 filters and a kernel size of 5) with a rectified linear activation, batch normalization, and a max-pooling layer with a pool size of 2. The output from the final convolutional unit was flattened and fed into dense layers, with 0.1 and 0.2 dropout respectively and a rectified linear activation. The output was two dense neurons with a softmax activation. The loss function was categorical cross entropy and training was done with the RMSprop optimizer.45 All CNN model training was carried out using the ARCHIE-WeST High Performance Computing Centre based at the University of Strathclyde, with each CNN model being trained using 10 Lenovo SD530 CPU cores with model training lasting on average 12 hours per model.
44 as the computation back end. The final chosen model consisted of two consecutive units each of which comprised a one-dimensional convolutional layer (with 10 filters and a kernel size of 5) with a rectified linear activation, batch normalization, and a max-pooling layer with a pool size of 2. The output from the final convolutional unit was flattened and fed into dense layers, with 0.1 and 0.2 dropout respectively and a rectified linear activation. The output was two dense neurons with a softmax activation. The loss function was categorical cross entropy and training was done with the RMSprop optimizer.45 All CNN model training was carried out using the ARCHIE-WeST High Performance Computing Centre based at the University of Strathclyde, with each CNN model being trained using 10 Lenovo SD530 CPU cores with model training lasting on average 12 hours per model.
        The architecture for the CNN models used in this study can be seen in Fig. 1.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 spectra were created: 5000 pancreatic cancer and 5000 non-cancer spectra. The volume of augmented spectra added during the model training is described further in Tables 3 and 4. The routines used for the non-generative augmentations are described in the next three subsections.
000 spectra were created: 5000 pancreatic cancer and 5000 non-cancer spectra. The volume of augmented spectra added during the model training is described further in Tables 3 and 4. The routines used for the non-generative augmentations are described in the next three subsections.
        Fig. 2 shows randomly adding noise to spectra within varying levels of standard deviations. It was determined as part of initial investigations that adding one standard deviation of noise was sufficient.
While maintaining a fixed generator architecture (generator architecture was determined as part of initial investigations), the critic architecture was optimised. The tested architectures included varying the number of hidden layers from 1 to 3 and the corresponding hidden units from 256 to 2048. A general trend was observed that increasing the number of units within each dense layer improved the distribution of the generated spectra. However, with an increase in the number of dense layers, while the distribution of generated spectra improved, the level of noise within the generated spectra also increased.
As the distribution of generated spectra improved with the increase in the number of dense layers and the number of units within those layers, the choice was made to use three dense layers in the critic. All results from WGAN architecture optimisation can be found in Fig. S1–6.†Table 2 describes the final network structure for each of the layers in the critic and generator.
| Layer | Critic | Generator | 
|---|---|---|
| Input | 1802 × 2 | 100 × 2 | 
| Hidden | Units = 2048 layer norm. | Units = 256 batch norm. Dropout = 0.3 | 
| Hidden | Units = 1024 layer norm. | Units = 512 batch norm. Dropout = 0.3 | 
| Hidden | Units = 512 layer norm. | Units = 1024 batch norm. Dropout = 0.3 | 
| Output | 1 | 1802 × 2 | 
Input to the critic consisted of FTIR spectral data from patient serum analysis. The dimension of the latent space was 100, and a latent variable formed by randomly sampling from a normal distribution N(0, 1) with a dimension of 100 was used as the input to the generator. The generator network included batch normalisation after each hidden layer. Layer normalisation was used for the critic because batch normalisation is incompatible with the gradient penalty.32 Dropout (0.3) was also added after each hidden layer in the generator to further prevent mode collapse.
A leaky rectified linear unit (ReLU) was used as the activation function for the hidden layers in both the critic and the generator.
The weights of the critic and generator were updated using the Adam optimizer, with the parameter values for Adam being α = 0.0001, β1 = 0, and β2 = 0.9. The critic weights were also updated 5 times in the space of the generator weights being updated once. These are the optimal values determined by Gulrajani et al.32
The WGAN training was set to run for a maximum of 6000 epochs, with early-stopping applied using parametric functions measuring absolute noise in the wavenumber region 3500–3000 cm−1. These parameters measure the relative height of peaks and troughs in the region to determine satisfactory spectral quality. The patience for early stopping was set to 600 epochs. All WGAN models were developed with Python v3.9.7![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 46 using Tensorflow v2.9.1.44 All WGAN training was carried out using the ARCHIE-WeST High Performance Computing Centre based at the University of Strathclyde, with WGAN training utilising 10 NVidia A100 GPU cores housed in Lenovo SR670 servers and optimisation lasting on average 4 hours. GPU computation was done using CUDA version 11.2.
46 using Tensorflow v2.9.1.44 All WGAN training was carried out using the ARCHIE-WeST High Performance Computing Centre based at the University of Strathclyde, with WGAN training utilising 10 NVidia A100 GPU cores housed in Lenovo SR670 servers and optimisation lasting on average 4 hours. GPU computation was done using CUDA version 11.2.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 30 split, repeated 51 times. Model hyper-parameters were tuned to optimize the area under the ROC curve during the inner 5-fold CV on the training set (70%). The trained model was used to make predictions from the spectra in the test set (30%). Training and test sets were stratified by patient ID, and therefore spectra from individual patients were not allowed to be present in both the training and test sets for a given resample. To ensure that the CNN training was not tuning the model towards the augmented spectra, any augmented spectra were removed from the validation set used for model tuning and early stopping prior to training. The by-spectrum AUCs obtained for each of the 51 outer CV iterations were aggregated, and the mean and standard deviation of the resulting classification metrics were computed.
30 split, repeated 51 times. Model hyper-parameters were tuned to optimize the area under the ROC curve during the inner 5-fold CV on the training set (70%). The trained model was used to make predictions from the spectra in the test set (30%). Training and test sets were stratified by patient ID, and therefore spectra from individual patients were not allowed to be present in both the training and test sets for a given resample. To ensure that the CNN training was not tuning the model towards the augmented spectra, any augmented spectra were removed from the validation set used for model tuning and early stopping prior to training. The by-spectrum AUCs obtained for each of the 51 outer CV iterations were aggregated, and the mean and standard deviation of the resulting classification metrics were computed.
      
    
    
      
      ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 spectra to use as input for the subsequent WGAN augmented CNN models; 5000 pancreatic cancer and 5000 non-cancer spectra.
000 spectra to use as input for the subsequent WGAN augmented CNN models; 5000 pancreatic cancer and 5000 non-cancer spectra.
        Although the WGAN worked well in generating fake spectra that contained the correct general spectral features, the spectra produced were also excessively noisy. This could be rectified by continuing to optimise the WGAN architecture until this noise level was reduced. However, the computational resources required to do this would be vast, and the same reduction in noise could be achieved using smoothing. Therefore, a Savitzky–Golay filter with a window of 21 was used to smooth the 10![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 generated spectra to remove this noise. Fig. 3 shows a real spectrum used to train the WGAN, a WGAN-generated spectrum, and that generated spectrum after smoothing.
000 generated spectra to remove this noise. Fig. 3 shows a real spectrum used to train the WGAN, a WGAN-generated spectrum, and that generated spectrum after smoothing.
|  | ||
| Fig. 3 (A) Real spectrum, (B) WGAN-generated spectrum, and (C) WGAN-generated spectrum after smoothing. | ||
Although nested cross-validation is a data efficient and rigorous method to train ML models, it requires retraining the CNN model on 255 different training sets in every run (51 training sets in outer CV × 5 training sets in inner CV). Since nested CV had to be repeated 22 times (5 quantities of augmented data for each of 4 augmentation methods, plus two baseline models), that equates to 5610 training sets. This presents a practical problem since retraining the WGAN model on each training set would be computationally expensive and difficult to monitor. Therefore, in the first experiments, data augmentation was carried out using a WGAN pre-trained on the 525-patient dataset. To provide a like-for-like comparison, the non-generative augmented spectra were also obtained from the 525-patient dataset, and a second benchmark model was evaluated in which the 525-patient dataset was added to each training set during nested cross-validation (further experiments in which the CNN and WGAN were trained on the same training sets are described below). All models are compared via the by-spectrum AUCs in Table 3.
| Augmentation | No. of spectra added | No. of training spectra | AUC | 
|---|---|---|---|
| No augmentation | 630 | 0.668 | |
| Augmentation with real spectra | 4725 | 5355 | 0.748 | 
| Random noise | 500 | 1130 | 0.721 | 
| 1000 | 1630 | 0.746 | |
| 2000 | 2630 | 0.748 | |
| 5000 | 5630 | 0.743 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.753 | |
| Mean bootstrap sample | 500 | 1130 | 0.745 | 
| 1000 | 1630 | 0.753 | |
| 2000 | 2630 | 0.759 | |
| 5000 | 5630 | 0.730 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.757 | |
| Splice spectra | 500 | 1130 | 0.708 | 
| 1000 | 1630 | 0.751 | |
| 2000 | 2630 | 0.749 | |
| 5000 | 5630 | 0.737 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.749 | |
| WGAN augmentation | 500 | 1130 | 0.771 | 
| 1000 | 1630 | 0.768 | |
| 2000 | 2630 | 0.770 | |
| 5000 | 5630 | 0.781 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.757 | |
Table 3 shows that data augmentation leads to an improvement in model performance in all cases. Furthermore, the CNNs that utilize the WGAN-augmented spectra during training perform better than those that use augmented spectra from non-generative methods. The model highlighted in bold, that is the CNN that adds 5000 WGAN-augmented spectra at each training step, shows a statistically significant improvement (using a Student's t-test) from the second benchmark model in which real spectra from the 525-patient dataset are used for data augmentation.
Although Table 3 shows evidence of model improvement from data augmentation, the imbalance between the augmented training set and the unaugmented validation set used for early stopping seemed to have a detrimental effect on model training. For example, adding 5000 augmented spectra led to 5405 spectra in the training set and only 99 spectra in the early-stopping set during 5-fold cross validation. A small validation set might cause inaccurate hyperparameter selection or underfitting.
Therefore, CNN models were re-run, this time with a fixed amount of data added to the early stopping sets. To provide a like-for-like comparison between different augmentation methods, each early stopping set was augmented with the same 4725 real spectra from the 525-patient dataset. This meant that for each resample, the training set would contain 5405 spectra (consisting of the 100-patient dataset split for training and generated spectra) and the validation set would contain 4824 spectra (consisting of the 100-patient dataset split for validation and the 525-patient dataset), balancing the ratio. The results for these models is shown in Table 4.
| Augmentation | No. of spectra added | No. of training samples | AUC | 
|---|---|---|---|
| No augmentation | 630 | 0.668 | |
| Augmentation with real spectra | 4725 | 5355 | 0.748 | 
| Random noise | 500 | 1130 | 0.729 | 
| 1000 | 1630 | 0.734 | |
| 2000 | 2630 | 0.735 | |
| 5000 | 5630 | 0.750 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.737 | |
| Mean bootstrap sample | 500 | 1130 | 0.740 | 
| 1000 | 1630 | 0.736 | |
| 2000 | 2630 | 0.751 | |
| 5000 | 5630 | 0.733 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.738 | |
| Splice spectra | 500 | 1130 | 0.711 | 
| 1000 | 1630 | 0.741 | |
| 2000 | 2630 | 0.730 | |
| 5000 | 5630 | 0.728 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.730 | |
| WGAN augmentation | 500 | 1130 | 0.779 | 
| 1000 | 1630 | 0.800 | |
| 2000 | 2630 | 0.768 | |
| 5000 | 5630 | 0.787 | |
| 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 630 | 0.765 | |
Table 4 shows that two more of the WGAN augmented models present a statistical improvement on the by-spectrum AUC produced by the benchmark model, with one model achieving an AUC of 0.800. This improvement due to data augmentation is evident in the comparison of the mean ROC curves in Fig. 4. Mean ROC curves were obtained by averaging sensitivity and specificity for a fixed threshold across 51 models. There is still the same trend as previously in which WGAN-augmented spectra seem to have more benefit than the non-generative methods.
An unexpected result apparent in Tables 3 and 4 is the lack of correlation between the number of generated spectra added during training and the resultant AUC. It was assumed that including more data during model training would result in improved model performance. However, this study shows that this is not the case. This might suggest that, even though adding more data is expanding the training set, the generated data is limited in the new information it can add to model learning. This explain the non-linear relationship between data volume and model performance.
To simulate the model being used in a real-life scenario, when the CNN and WGAN would more likely be trained on the same training set, two CNN models were trained on the full 525 patient dataset, one augmented with 1000 WGAN generated spectra and one without. Both CNN models were then used to predict the 100 patient dataset. The improvement in AUC from 0.661 to 0.757, as reported in Table 5, is a clear demonstration of the benefit of data augmentation using the WGAN.
| Model | AUC | 
|---|---|
| Train: 525-patient dataset test: 100-patient dataset | 0.661 | 
| Train: 525-patient dataset and augmented spectra test: 100-patient dataset | 0.757 | 
The previous results demonstrate the successful use of WGANs to improve the AUC of the pancreatic dataset. To further demonstrate the benefit of WGAN augmentation on other datasets, a separate dataset comprising of colorectal cancer patients (N = 200) and non-cancer patients (N = 459) in which FTIR spectra had been measured from dried serum samples was used. These samples are covered by the Ethical Approval quoted in section 3.1. In a similar method to the pancreatic dataset, a subset of 100 samples were set aside to use as an external test set, leaving 559 samples to be used for WGAN training. The subsets of the data were age and sex matched to the full dataset. The patient metadata of the full colorectal dataset, the 559-patient dataset, and the 100-patient dataset can be found in Tables S3–5† respectively. The same architecture used previously was first trained on this 559-patient dataset and then used to predict the 100-patient dataset. This result was then used to compare a CNN trained on the 559-patient dataset alongside 1000 WGAN-augmented spectra to predict the 100-patient dataset. These by-spectrum AUCs are shown in Table 6.
| Model | AUC | 
|---|---|
| Train: 559 patient dataset test: 100 colorectal | 0.905 | 
| Train: 559 patient dataset and augmented spectra test: 100 colorectal | 0.955 | 
The increase in the test set AUC after the addition of WGAN augmented spectra for a differing dataset further demonstrates the benefit of data augmentation and its use across different cancer datasets.
These results show for the first time the benefit of data augmentation for model training within spectroscopic liquid biopsy cancer diagnostics. The results follow the trend demonstrated by Wickramaratne et al.30 who demonstrated that data augmentation using GANs can improve model performance, as well as Nagasawa et al.33 who showed similar results with WGAN generated spectra. However, our study, to the best of our knowledge, is the first to compare the performance of various data augmentation methods using multiple independent resamples to reduce bias that can occur from particular train/test splits. This method has enabled a like-for-like comparison between non-generative augmentation and WGAN generated spectra.
One hypothesis that was made before the study began was that as more augmented data was added, the CNN performance was expected to increase. However, this was not the case. This could demonstrate a particular downfall of data augmentation: the quality of information it adds to a model. Although it was somewhat expected that WGAN augmentation would outperform non-generative augmentation as it doesn't just change aspects of the data, but actually learns from the real data, there is still a limit to the information it can learn and create. It seems to be that augmentation simply adds complexity to the dataset helping to regularise the model, hence the non-linear relationship between the number of augmented data points added and model performance.
Further work would include obtaining an external test set of patients from a different location to the training set. This would have the added benefit of providing spectra that were also analysed at a different time, further validating the model's performance as a potential method for clinical use. The methods described could also be applied to other cancer types, particularly rarer types where there are naturally fewer samples available for analysis.
| Footnote | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3an00669g | 
| This journal is © The Royal Society of Chemistry 2023 |