Benjamin R.
Smith
ab,
Katherine M.
Ashton
c,
Andrew
Brodbelt
d,
Timothy
Dawson
c,
Michael D.
Jenkinson
d,
Neil T.
Hunt
e,
David S.
Palmer
*a and
Matthew J.
Baker
*b
aWestCHEM, Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow, Scotland G1 1XL, UK. E-mail: david.palmer@strath.ac.uk
bWestCHEM, Department of Pure and Applied Chemistry, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow G1 1RD, UK. E-mail: matthew.baker@strath.ac.uk
cNeuropathology, Lancashire Teaching Hospitals NHS Trust, Royal Preston Hospital, Sharoe Green Lane, Fulwood, Preston, PR2 9HT, UK
dNeurosurgery, The Walton Centre NHS Foundation Trust, Lower Lane, Fazakerley, Liverpool, L9 7LJ, UK
eSUPA, Department of Physics, University of Strathclyde, 107 Rottenrow East, Glasgow, G4 0NG, UK
First published on 19th January 2016
Fourier transform infrared (FTIR) spectroscopy has long been established as an analytical technique for the measurement of vibrational modes of molecular systems. More recently, FTIR has been used for the analysis of biofluids with the aim of becoming a tool to aid diagnosis. For the clinician, this represents a convenient, fast, non-subjective option for the study of biofluids and the diagnosis of disease states. The patient also benefits from this method, as the procedure for the collection of serum is much less invasive and stressful than traditional biopsy. This is especially true of patients in whom brain cancer is suspected. A brain biopsy is very unpleasant for the patient, potentially dangerous and can occasionally be inconclusive. We therefore present a method for the diagnosis of brain cancer from serum samples using FTIR and machine learning techniques. The scope of the study involved 433 patients from whom were collected 9 spectra each in the range 600–4000 cm−1. To begin the development of the novel method, various pre-processing steps were investigated and ranked in terms of final accuracy of the diagnosis. Random forest machine learning was utilised as a classifier to separate patients into cancer or non-cancer categories based upon the intensities of wavenumbers present in their spectra. Generalised 2D correlational analysis was then employed to further augment the machine learning, and also to establish spectral features important for the distinction between cancer and non-cancer serum samples. Using these methods, sensitivities of up to 92.8% and specificities of up to 91.5% were possible. Furthermore, ratiometrics were also investigated in order to establish any correlations present in the dataset. We show a rapid, computationally light, accurate, statistically robust methodology for the identification of spectral features present in differing disease states. With current advances in IR technology, such as the development of rapid discrete frequency collection, this approach is of importance to enable future clinical translation and enables IR to achieve its potential.
Current diagnostic methods are time consuming, expensive and require highly skilled practitioners to interpret them. In the case of brain and CNS cancers, the test usually consists of an MRI or CT scan in the first instance.10 These types of complex results can be subjective in their conclusions. (The CT scan itself has disputed health risks.11,12) In some cases, these tests prove to be inconclusive, and warrant further investigation. Upon such an occurrence a biopsy of the suspected tumour is indicated. A biopsy of brain tissue represents an invasive and stressful procedure for patients, again needing highly skilled surgical and pathological expertise. Even when a biopsy is taken, there can be discrepancies in the interpretation of results between medics. Bruner et al. found that of 500 biopsy cases, 214 (42.8%) had a degree of disagreement between original and review diagnoses.13 In England in the period 2006–2010, 54% of cancer cases were diagnosed following a routine or urgent GP referral, either as part of the “two week wait” system or otherwise.14 The two-week wait system is an urgent referral (less than two weeks of waiting time for a consultation) of a patient to a specialist, should cancer be suspected by a GP. However, 23% of cancer cases in England during this time period were diagnosed after presenting as an emergency.14 Concentrating on brain and CNS cancers, the National Cancer Intelligence Network (NCIN) statistics show that more than half (58%) of brain and CNS cancers are diagnosed through presentation at emergency facilities, with the “two week wait” system only accounting for 1% of the total.15 In these later diagnoses, prognosis is much poorer for all types of cancer. As well as having on average a later secondary care diagnosis, brain tumours are very difficult to identify in primary care and a high index of suspicion is required. A survey carried out on behalf of The Brain Tumour Charity (UK) found that 38% of people living with a brain tumour had visited their GP more than 5 times before being diagnosed.16 Overall, the current system for the detection and diagnosis of brain tumours in general is not satisfactory. A reliable, fast and simple method to screen for these types of cancer would reduce time before diagnosis and therefore increase survival rates, as many therapies are more effective when started early.
IR spectroscopy has previously been investigated as a cancer diagnosis tool. Haka et al. showed the merit of human tissue spectroscopy in distinguishing breast tumours from normal tissue.17 Laboratory based proof of principle studies have shown the ability of serum spectroscopy to diagnose cancerous disease states, such as those reviewed by Kondepati et al.18 Pichardo et al. were also able to use spectroscopy together with machine learning to detect breast cancer.19 Most of these investigations in the literature focus on very specific types of cancerous diseases states, or require tissue samples from suspected tumours to aid in diagnosis. A broader approach based on IR spectroscopy of serum samples could be an ideal solution. Serum can be acquired from blood samples in a much less invasive procedure. Backhaus and Mueller et al. demonstrated a method to successfully detect breast cancer using serum samples and IR spectroscopy. The sensitivity and specificity achieved were 98% and 95% respectively.20 Gajjar et al. used FTIR of serum and plasma samples to distinguish ovarian and endometrial cancer patients from control patients.21 They used various feature extraction methods to obtain very promising results from a small pilot study. Ovarian cancer was detectable with 95% correct classification.
Our previous research has shown the ability of combined FTIR and machine learning to identify differing levels of cytokine and angiogenesis factors in patients with glioma.22 The Bioplex study provided sensitivities and specificities as high as 88% and 81% respectively. Furthermore, sensitivities and specificities of 87.5% and 100% were achieved when combining ATR-FTIR with Support Vector Machines (SVM). A similar approach was employed to distinguish differing grades of glioma (high-grade and low-grade) from non-cancer.6 Sensitivity and specificity were 93.75% and 96.53% respectively.
We now build on our previous diagnostic research6,22–25 with a larger dataset, a different approach to machine learning and the technique of generalised 2D correlational analysis. Some studies have used “black box” machine learning.22,26 While this can give good predictions, it does not give full insight into the actual features being used for classification, which in turn does not aid clinical translation. In order to translate novel technologies to the clinic, further information is required to identify spectral peaks that provide diagnostic information. In addition, identifying relevant features will enable future rapid collection protocols via techniques such as Quantum Cascade Laser IR spectroscopy.27,28 Focusing on the salient information of a spectral dataset also provides enhanced diagnostic accuracy due to the removal of extraneous information that is clouding the diagnosis based upon biological variance. Dorling and Baker29 describe the utility of serum spectroscopy in the clinic, in order to achieve this we have to provide rapid, accurate and information-rich analysis to correctly describe the difference in disease state molecular species.
IR spectroscopy of biological materials is not straightforward. Serum itself is a very complex mixture of various components. In addition to the chemical complexity of serum, optimum sample preparation techniques and their effect on the spectrum are not known. The group has therefore established guidelines by means of a dilution study coupled with a comparison of ATR-FTIR and High Throughput (HT)-FTIR.30 It was found that 3-fold dilution was optimal for HT-FTIR in terms of scores in a spectral quality test. ATR-FTIR although slower than HT-FTIR, proved to be the best when investigating discernible features. Also previously established by the group was the optimum drying time of 8 minutes.6
Generalised 2D correlation spectroscopy is a data analysis method developed by Noda,31 as distinct from ultrafast 2D-IR spectroscopy, which measures vibrational couplings in an analogous way to 2D-NMR.32 Noda's generalised method allows external perturbations (temperature, concentration, pH etc.) to be used to obtain information about the effect of external influences on spectra, and does so by offering two main types of correlation which are synchronous and asynchronous. Synchronous spectra represent coinciding changes in different spectral regions. The synchronous spectrum is symmetrical, and contains peaks (called autopeaks) along the diagonal. The strength of the highlighted areas along this line represent the band strength of the IR regions. Peaks off the diagonal (called cross-peaks) show correlation between underlying 1D spectral peaks. If the sign is positive (blue in this work), then both 1D peaks are changing in the same direction, either increasing or decreasing in intensity. Negative signs (red) show that the 1D spectral peaks are moving in opposite directions in terms of intensity. Asynchronous spectra represent sequential changes in the 1D spectra due to the perturbation. When a cross-peak is of a positive sign, then a peak from the first spectrum is changing before a band from the second spectrum, and vice versa. A feature of these asynchronous spectra is that they contain no autopeaks, and are anti-symmetric with respect to the diagonal.
The type of machine learning used in this work (Random Forest) is interpretable, and results in a better understanding of the relative importance of distinguishing features. Furthermore, we combine this method with 2D correlational analysis. The novel combination of these usually disparate methods allows for both the building of an accurate classifier, and the characterisation of spectral features important for diagnosis.
Disease state | Number of patients |
---|---|
Cancer | 311 |
Non-cancer | 122 |
Primary brain cancer | 134 |
Metastatic brain cancer | 177 |
High grade glioma | 64 |
Low grade glioma | 23 |
Fig. 2 All spectra with various pre-processing. Red – raw data, magenta – normalised spectra, green – normalised first derivative and blue – normalised second derivative. |
Ratio ID | Wavenumbers | Regions |
---|---|---|
a [∑(1380–1420) + ∑(1480–1580) + ∑(1600–1680)]:[∑(1430–1470) + ∑(1720–1760) + ∑(2830–3000)]. | ||
1030:1050 | 1030:1050 | Carb. |
1030:1080 | 1030:1080 | Carb.–phos. |
1050:1080 | 1050:1080 | Phos. |
3160:3170 | 3160:3170 | Alcohols |
3190:3200 | 3190:3200 | O–H–O |
Navarro | ∑(1650–1700):∑(1730–1800) | Protein:lipid |
Baker | manya | Protein:lipid |
BRS 1 | ∑(1600–1680):∑(1500–1580) | AmideI:AmideII |
BRS 2 | ∑(1220–1280):∑(1380–1420) | Phos.A:COO− |
BRS 3 | ∑(1000–1050):∑(1430–1470) | Carb.:CH2 |
BRS 4 | ∑(1000–1050):∑(2830–3000) | Carb.:CH2,CH3 |
The random forest machine learning method (MLM) was chosen for this work for several reasons. Firstly, RF is easily scalable when compared to other MLM. This means that the same (or very similar) parameters can be used in the future with larger datasets. Secondly, RF has easily interpreted results when used with the Gini impurity metric (see section 2.5 for an explanation of the Gini metric). This meant that important distinguishing wavenumbers were clearly defined, and their relative importance was readily established. Third, RF is known for being able to robustly handle outliers in the input space. This property potentially allows classification of spectra without heavy pre-processing, whereas other MLM may require it. Finally, RF deals well with missing values from input classes. This was especially important to our work, as the wavenumber range 1800–2400 cm−1 was removed.
For the interpretation of the RF outcome, two main groups of results were considered. Firstly, a selection of statistical metrics were generated to give an in-depth analysis of the accuracy and reliability of each classification. These were based upon true positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions as well as “real”(actual number of positives and negatives in the dataset) positives (P) and negatives (N). The abbreviation MCC stands for Matthews Correlation Coefficient.
Number of positives (P) = TP + FN | (1) |
Number of negatives (N) = TN + FP | (2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
(9) |
In the statistical results tables below, TS and CV represent results for the test set and cross-validation respectively. The tolerances shown for each result are their standard errors, generated according to eqn (9), where SE is the standard error of the mean (of the 96 iterations), σ is the standard deviation and n is the number of samples.
(10) |
Every time a node is split on a predictor (wavenumber), the Gini impurity for the two child nodes is less than the parent node. This is because the dataset is gradually being sorted into predicted classes, and becoming more homogeneous with respect to the proportion of classes A or B. When node τ is split, resulting in two child nodes υ and ϕ, the change in Gini (Δg) is found by eqn (11) where nυ and nϕ are the number of spectra in nodes υ and ϕ respectively. The value of Δg is larger when a greater change in impurity occurs after the split, thus allowing for the decrease in Gini to be used as a measure of importance of a certain wavenumber.
(11) |
The overall Gini importance (G) of a particular spectral feature θ is found by the sum across all nodes of each tree ψ, and across all trees in forest ω (eqn (12)).
(12) |
These values are then averaged for the 96 independent classifications, to arrive at the Mean Decrease (Gini) used in the figures presented in the Results and discussion section.
Metric | 900–1800 cm−1 | 2400–4000 cm−1 | Both |
---|---|---|---|
Sensitivity TS | 0.901 ± 0.013 | 0.875 ± 0.014 | 0.900 ± 0.012 |
Sensitivity CV | 0.899 ± 0.006 | 0.872 ± 0.007 | 0.897 ± 0.006 |
Specificity TS | 0.785 ± 0.028 | 0.764 ± 0.030 | 0.812 ± 0.027 |
Specificity CV | 0.768 ± 0.014 | 0.733 ± 0.016 | 0.786 ± 0.014 |
Prediction accuracy TS | 0.869 ± 0.003 | 0.846 ± 0.003 | 0.877 ± 0.003 |
Prediction accuracy CV | 0.863 ± 0.001 | 0.836 ± 0.001 | 0.868 ± 0.001 |
Positive precision TS | 0.918 ± 0.012 | 0.916 ± 0.012 | 0.930 ± 0.011 |
Positive precision CV | 0.912 ± 0.006 | 0.905 ± 0.006 | 0.922 ± 0.006 |
Negative precision TS | 0.746 ± 0.029 | 0.677 ± 0.032 | 0.746 ± 0.029 |
Negative precision CV | 0.739 ± 0.015 | 0.661 ± 0.016 | 0.729 ± 0.015 |
Matthews correl. coeff. TS | 0.674 ± 0.007 | 0.615 ± 0.007 | 0.693 ± 0.007 |
Matthews correl. coeff. CV | 0.659 ± 0.003 | 0.585 ± 0.003 | 0.667 ± 0.003 |
Metric | 900–1800 cm−1 | 2400–4000 cm−1 | Both |
---|---|---|---|
Sensitivity TS | 0.918 ± 0.011 | 0.897 ± 0.013 | 0.912 ± 0.012 |
Sensitivity CV | 0.918 ± 0.006 | 0.887 ± 0.007 | 0.912 ± 0.006 |
Specificity TS | 0.883 ± 0.022 | 0.820 ± 0.027 | 0.875 ± 0.023 |
Specificity CV | 0.839 ± 0.013 | 0.793 ± 0.015 | 0.853 ± 0.013 |
Prediction accuracy TS | 0.909 ± 0.003 | 0.877 ± 0.003 | 0.902 ± 0.003 |
Prediction accuracy CV | 0.897 ± 0.001 | 0.863 ± 0.001 | 0.897 ± 0.001 |
Positive precision TS | 0.957 ± 0.008 | 0.937 ± 0.010 | 0.957 ± 0.008 |
Positive precision CV | 0.941 ± 0.005 | 0.928 ± 0.005 | 0.948 ± 0.005 |
Negative precision TS | 0.795 ± 0.027 | 0.730 ± 0.030 | 0.769 ± 0.028 |
Negative precision CV | 0.785 ± 0.014 | 0.698 ± 0.015 | 0.768 ± 0.014 |
Matthews correl. coeff. TS | 0.776 ± 0.007 | 0.691 ± 0.008 | 0.755 ± 0.008 |
Matthews correl. coeff. CV | 0.742 ± 0.002 | 0.653 ± 0.003 | 0.740 ± 0.003 |
Metric | 900–1800 cm−1 | 2400–4000 cm−1 | Both |
---|---|---|---|
Sensitivity TS | 0.928 ± 0.011 | 0.825 ± 0.015 | 0.923 ± 0.011 |
Sensitivity CV | 0.929 ± 0.005 | 0.823 ± 0.008 | 0.922 ± 0.006 |
Specificity TS | 0.915 ± 0.019 | 0.766 ± 0.035 | 0.914 ± 0.018 |
Specificity CV | 0.888 ± 0.011 | 0.730 ± 0.018 | 0.892 ± 0.011 |
Prediction accuracy TS | 0.924 ± 0.002 | 0.813 ± 0.03 | 0.921 ± 0.002 |
Prediction accuracy CV | 0.918 ± 0.001 | 0.805 ± 0.001 | 0.914 ± 0.001 |
Positive precision TS | 0.970 ± 0.007 | 0.937 ± 0.010 | 0.970 ± 0.007 |
Positive precision CV | 0.960 ± 0.004 | 0.929 ± 0.005 | 0.963 ± 0.004 |
Negative precision TS | 0.812 ± 0.026 | 0.511 ± 0.034 | 0.804 ± 0.026 |
Negative precision CV | 0.813 ± 0.013 | 0.491 ± 0.017 | 0.792 ± 0.014 |
Matthews correl. coeff. TS | 0.811 ± 0.006 | 0.513 ± 0.008 | 0.805 ± 0.006 |
Matthews correl. coeff. CV | 0.795 ± 0.002 | 0.481 ± 0.003 | 0.784 ± 0.002 |
The normalised second derivative gave the best overall accuracy according to the statistical metrics which are presented in the tables below. The first derivative results achieved a performance between normalised spectra and second derivative in terms of classification accuracy. The 2400–4000 cm−1 section of the spectrum did not perform as well as the 900–1800 cm−1 section, nor did it add any accuracy when these two sections were combined together.
Wavenumber range | ∑Gini | Tentative assignment |
---|---|---|
997–1003 cm−1 | 131.8 | Carbohydrate |
1290–1294 cm−1 | 93.1 | Phosphate |
1462–1464 cm−1 | 78.7 | Lipid CH2 |
1527–1533 cm−1 | 71.9 | Amide II |
1028–1034 cm−1 | 59.8 | Carbohydrate |
1387–1390 cm−1 | 46.7 | Protein COO− |
1194–1198 cm−1 | 41.9 | |
1373–1377 cm−1 | 39.9 | Protein COO− |
1080–1082 cm−1 | 31.4 | Phosphate |
1280–1282 cm−1 | 28.6 | |
1093–1095 cm−1 | 24.0 |
Fig. 3–6 show the mean decrease in Gini coefficient for all wavenumbers in a range, alongside generalised 2D correlation plots for the same data. In the 2D correlation plots, cancer spectra are on the horizontal axis, and non-cancer spectra are on the vertical. Data for the generalised 2D correlational analysis were obtained by averaging subsections of intensities in the dataset. In the figures showing the 2D correlation of the second derivative spectra, the averages were as follows: Fig. 3: synchronous spectrum of the average of all cancer spectra in the range 900–1800 cm−1vs. all non-cancer spectra in the same range. Fig. 4–6 use the same pattern of average intensities of cancer and non-cancer spectra in the same wavenumber range, for both synchronous and asynchronous 2D plots. In the ESI, Fig. S4–S11† show similar plots for normalised and first derivative data. Fig. S12–S14† show average cancer and non-cancer spectra plotted together with Gini importance for normalised, first and second derivative spectra.
Usually when these 2D plots are generated, an incremental variable is used for the perturbation between the two sides of the correlation (e.g. temperature, pressure etc.). In the case of this work, the perturbation is whether the patient has cancer or a normal diagnosis. It should also be borne in mind that the two in-putted comparison spectra were averages of the whole class; i.e. cancer and non-cancer. However, some information can still be gleaned from Noda's method in this scenario. The synchronous spectrum shows whether two wavenumber ranges increase or decrease in intensity in the same or opposite directions. This is as normal for a synchronous generalised 2D correlation plot. The asynchronous spectrum is more subtle in its interpretation with a correlation such as this. Noda's rules44 show whether an intensity change occurs before or after another, with an incremental increase of some outside perturbation. There is no such incremental increase of a perturbation in this work, only a binary cancer/non-cancer descriptor. Therefore, the asynchronous plots serve as extra clarification to further highlight differences in the averaged spectra, without the influence of auto peaks. This therefore can provide a better means to identify major regions of the spectra which are responsible for the cancer/non-cancer distinction.
In Fig. 3 & 4, it can be seen that the peaks at around 1640 and 1530 cm−1 show a strong correlation with each other. Only the peak centred around 1530 cm−1 appears as a strong spike in the RF classification. However, these two peaks are themselves correlated to other wavenumber ranges which are heavily featured in the RF study. In Fig. 5 & 6, a similar situation can be seen with the two cross-peaks around the 2850 and 2930 cm−1 areas. This time however, both of these peaks also show strong peaks in the mean decrease in Gini.
The coupling together of the RF importance charts and the 2D correlations is helpful for further clarification of where the differences in the spectra lie, and whether these are reproduced in both studies. Another helpful piece of information from this is whether the major features of the RF importance charts match up to areas of major difference in intensities at certain wavenumbers. This allows further characterisation of the RF results, and gives clues as to whether minor or major differences in spectra are responsible for the greatest discrimination. Overall, the RF and 2D correlations are showing the same features via very different analyses, providing us further confidence due to the use of orthogonal techniques.
It was found that the 900–1800 cm−1 range of the spectral data produced the greatest accuracy for RF. The higher end of the spectrum from 2400–4000 cm−1 performed adequately alone, but did not add to the accuracy of the classification when used alongside 900–1800 cm−1. This suggests a correlation between the upper and lower ends of the spectrum, as evidenced also in the ratio correlation graphic (Fig. 7). The lower end therefore represents a better option, as the calculation is less expensive than using the full range. In the work leading up to these results, the first derivative spectra were found to be intermediate between raw data and the second derivative in terms of statistical results and clarity of important wavenumbers. The results of the generalised 2D correlational analysis proved to be useful as a comparison to the RF results. It allowed for better visualisation of the differences in spectra and tallied well (as expected) with the importance charts. Our method works well for the binary classification of cancer/non-cancer, the next step would be to develop further the classification of disease states within the cancer subset.
This research has shown a rapid, computationally light, accurate, statistically robust methodology for the identification of spectral features that define a dataset. The identification of these features is in line with Occam's razor and supports accurate diagnostics by focusing upon salient information as opposed to including information from biological variance within the diagnosis. With current advances in IR technology, such as the development of rapid discrete frequency collection, this approach is of importance to enable future clinical translation and enables IR to achieve its potential.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c5an02452h |
This journal is © The Royal Society of Chemistry 2016 |