Mohammad Shahbazya,
Ali Zahraeib,
Jamshid Vafaeimanesh*b and
Mohsen Kompany-Zareh*a
aDepartment of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan 45137-66731, Iran. E-mail: kompanym@iasbs.ac.ir; Tel: +98 24 3315 3123
bClinical Research Development Center, Qom University of Medical Sciences, Qom, Iran. E-mail: j.vafaeemanesh@muq.ac.ir; Tel: +98 25 3612 2949
First published on 17th November 2015
Coronary artery disease (CAD), one of the most common fatal diseases in the world, was examined in the present study via the investigation of the 1H-NMR spectra of human blood plasma and clinical laboratory parameters, with the aim of early disease diagnosis. Partial least squares-discriminant analysis (PLS-DA), a common supervised pattern recognition method, assisted by a genetic algorithm (GA) based feature selection procedure, was used to classify CAD− and CAD+ individuals based on spectral patterns and clinical parameters. Meanwhile, unsupervised pattern recognition methods (i.e., hierarchical cluster analysis (HCA) and principal component analysis (PCA)) were implemented to precisely visualize and examine the spectroscopic and clinical datasets. GA and ANOVA techniques were employed to select the discriminant and most effective clinical parameters for recognizing CAD− and CAD+ samples. Finally, the calculated classification models were successfully able to distinguish between CAD− and CAD+ individuals using 1H-NMR spectra and clinical laboratory parameters as a safe, economic, simple and also non-invasive method in comparison with coronary angiography for CAD diagnosis.
Atherosclerosis is an inflammatory disease affecting all arteries that may lead to ischemia in the heart and brain as a fatal initiated event.5 Risk factors associated with CAD which are strongly related to poor lifestyle and influenced by stress include hypertension, smoking, diabetes mellitus, obesity, physical inactivity and dyslipidemia.1 These coronary risk factors lead to endothelial injury, plaque formation and the promotion of arterial thrombus deposition by different mechanisms.6
Among the mentioned risk factors for CAD, dyslipidemia is considered to be the most common risk factor. In order to initiate appropriate treatment and minimize morbidity and mortality, and also to optimize the cost-effectiveness of such treatments, identifying the patients at risk for CAD and the early treatment of atherosclerotic lesions are really important.1 Atherosclerosis is a complex process and it is considered to have an inflammatory background; consequently the associations between various inflammatory markers, occurrence, severity and clinical phenomena related to CAD have been studied.7 The interactions between genetic and environmental factors induce the arterial wall to respond to stimuli through the actions in endothelial cells, smooth muscle, inflammatory cells and platelets that leads to plaque formation.8 There is much evidence that inflammation plays a key role in the pathogenesis of stable CAD and acute coronary syndromes.9 The most frequently studied parameters have been leukocyte count, C-reactive protein (CRP), fibrinogen, and uric acid.10,11 Furthermore, Danesh et al. found a significant association of fibrinogen, CRP, albumin and leukocyte count with CAD.12
Implementing chromatographic and spectroscopic based high-tech analytical methods (e.g., mass spectrometry (MS),13–15 fluorescence spectroscopy,16–19 gas/liquid chromatography coupled to MS (GC/LC-MS),20–27 comprehensive two-dimensional GC (GC × GC),28 combined GC × GC with time-of-flight MS (GC × GC-TOF-MS),29,30 proton nuclear magnetic resonance (1H-NMR),14,31,32 and LC-NMR33) can be fruitful for metabolomics and proteomics studies in clinical and biological systems.
Up to 1000 metabolites can be recognized and evaluated through metabolic profiling and assessing pathways during occurred variations in metabolite concentration. Consequently, metabolomics/proteomics studies via instrumental analysis techniques as non-invasive and rapid tools might be advantageous to discriminate and recognize variations of the metabolites/proteins in diverse biofluids such as blood plasma, urine and serum toward the detection of an external malignant factor effecting a particular disease. It can be useful to identify discriminant metabolites for biomarker discovery and the early diagnosis of various diseases.26,31,32,34–36
Chemometrics, a well-established analytical approach, has been increasingly utilized to associate instrumental analysis techniques with metabolomics/proteomics research.31,37,38 This paper is concerned with the prediction of CAD clinical status and its diagnosis as a mortal disease through the analysis of selected clinical laboratory parameters as variables and the acquired 1H-NMR spectra from human blood plasma samples, by using pattern recognition based chemometric methods. Partial least squares-discriminant analysis (PLS-DA)39,40 is a supervised pattern recognition technique that correlates variation in the dataset with class membership and this in turn can provide an additional confidence measure for any resultant clustering. Principal component analysis (PCA) and hierarchical cluster analysis (HCA), as some common unsupervised pattern recognition methods, were used to visualize the relationship between CAD− and CAD+ individuals with the aim of a cluster analysis of clinical parameters and 1H-NMR spectra datasets.
In the present study, analysis of variance (ANOVA) and genetic algorithm (GA) based feature selection approaches were used to select the most vital and effective clinical parameters to anticipate CAD+ or CAD− clinical statuses as patient and healthy classes, respectively, for the considered individuals.
During evaluation of the data it was clear that critical parameters were age, gender, information about the patient’s history including hypertension (indicated by a systolic blood pressure of ≥140 mmHg, a diastolic blood pressure of ≥90 mmHg and anti-hypertensive medication), smoking (patients who had stopped smoking for 10 years or less were classified as smokers) and biochemical parameters (i.e., hemoglobin, leucocytes, thrombocytes, C-reactive protein (CRP) and erythrocyte sedimentation rate (ESR)).
All the blood samples were taken after overnight fasting. All the measurements were performed between 8:00 and 11:00 AM in a temperature-controlled room with the subjects in a resting supine state. The subjects abstained from alcohol, caffeine, tobacco and food for 12 hours prior to the study. Long-acting vasoactive medications including calcium channel blockers, beta-adrenergic blocking agents, nitrates and converting enzyme inhibitors were discontinued for 12 hours prior to the study.
The erythrocyte sedimentation rate was measured over a period of 1 hour and the normal value was considered to be 10 mm in the first hour. The number of leukocytes was also determined in all the patients so that the normal number of leukocytes was considered to be 4.000–10.000 cell per mL. The CRP level was determined in all the patients with a normal reference value of 6.0 mg L−1.
This is mainly due to the high quality of water suppression with little calibration and consistency in the obtained spectra. In this study, since the broad protein peaks have not been taken away through filtering before the NMR spectrometry measurements, CPMG as a special pulse sequence is used to remove them.31 Fig. 1 shows a 1H-NMR spectrum that was obtained from one of the control samples.
![]() | ||
Fig. 2 The mean of the NMR spectra for both classes (i.e., CAD− (healthy) and CAD+) after excluding non-informative/water signals and the binning procedure. |
For classification modelling, 75% of all the samples (48 samples) were randomly selected as the training set and the rest of the samples (16 individuals) were considered to be the external test set for evaluating the model performance during diverse steps of the modelling procedure (i.e., training, venetian blind cross validation and prediction of the external test set samples).
In the present study, for the improvement of the accuracy of the classification model and also its prediction ability, GA as an evolutionary and intelligent method was used for feature selection among the PLS factors to pick out a subset including the discriminant factors to classify the samples as CAD− and CAD+ (GA was applied on the PLS scores dataset; 64 × 40). Among the 40 PLS factors, 28 latent variables were selected as the most informative and effective factors for distinguishing CAD− and CAD+ individuals by the GA algorithm. These 28 factors had a higher frequency of inclusion in the best of the constructed PLS-DA classification models based on the model accuracy evaluation parameters (e.g., none error rate for cross validated samples (NERcv) that examines the model performance for predicting the class membership of a validation set of samples which was not implemented during the modelling procedure). The latent variables with a lower inclusion frequency in the PLS-DA models produced during the GA procedure were discarded and not used for the classification modelling.
To further reduce the data dimensionality (from 28 to only 3 factors) and avoid over-fitting, as well as enhancing the classification model’s accuracy, a data transformation was used to transform from selected PLS factors’ space to the three rotated PLS factors via oblique rotation of the factors based on a simple fitness function (by the GA based optimization) to minimize the ratio of the distance of an object from its class’s centre to the object’s distances to other class centres.43 This procedure, through producing the optimal PLS-OR factors (a dataset of 64 × 3), significantly improved the model performance and discrimination between the classes from the 400 MHz NMR spectra with medium resolution between CAD− and CAD+ cases.
Furthermore, hierarchical cluster analysis (HCA), as a common unsupervised pattern recognition method, was used (on the 1H-NMR dataset; 64 × 458) to properly visualize the details of the data space and how the distributions of CAD− and CAD+ individuals are related (within and between the classes) via a k-nearest neighbour (kNN) algorithm. The obtained dendrogram from the HCA is shown in Fig. 4. Clearly, there are two clusters corresponding to CAD− (healthy) and CAD+ cases.
![]() | ||
Fig. 4 A dendrogram of kNN based HCA for clustering of the objects into two main clusters; healthy and CAD. |
In the next step, PLS-DA was applied to build a classification model (using a training set with 48 samples from the optimal PLS-OR factors dataset; 64 × 3) to predict the presence of CAD in unknown samples (the external test set). The produced model provided a potent ability to classify CAD− and CAD+ individuals.
The mean values of the evaluation parameters of the PLS-DA model’s performance for the CAD and healthy cases by using the optimal PLS-OR factors are reported in Table 1.
Modelling step | Class label | Specificity | Sensitivity | Precision | NER |
---|---|---|---|---|---|
Training | Healthy | 1 | 1 | 1 | 1 |
CAD | 1 | 1 | 1 | ||
Cross validation | Healthy | 1 | 0.979 | 1 | 0.988 |
CAD | 0.979 | 1 | 0.979 | ||
Prediction | Healthy | 1 | 0.937 | 1 | 0.963 |
CAD | 0.937 | 1 | 0.937 |
These results were collected during the modelling steps which include training, cross validation and prediction for the external test set. According to Table 1, for the PLS-DA model via the optimal PLS-OR factors, the NERcv and NERtst values are acceptable and equal to 0.988 and 0.963, respectively.
The scores visualization of the PLS-DA model on the first two latent variables confirms excellent discrimination for the healthy and CAD cases in Fig. 5. The pink line denotes the linear discriminator boundary between the classes which was produced via the DA algorithm.
It should be noted that N. smoker is a unit for measuring the amount a person has smoked over a long period of time. It is calculated by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked for.
The patients were divided into two groups according to positive and negative coronary artery disease (35 and 29 patients, respectively). The characteristics of the two groups are shown in Tables 2 and 3. The differences in basic characteristics and risk factors are presented in Table 2.
Parameter | N = 64 mean ± SD | CAD− N = 29 | CAD+ N = 35 | p-value |
---|---|---|---|---|
Age | 55.25 ± 12.09 | 51.34 ± 12.04 | 58.49 ± 11.31 | 0.018 |
Abdominal circumference | 100.03 ± 13.42 | 102.86 ± 12.04 | 97.69 ± 14.21 | 0.126 |
Systolic BP | 134.20 ± 22.37 | 128.90 ± 18.00 | 138.60 ± 24.83 | 0.084 |
Diastolic BP | 79.34 ± 10.24 | 79.76 ± 10.88 | 79.00 ± 9.84 | 0.771 |
WBC | 8061.09 ± 2266.98 | 8068.97 ± 2623.84 | 8054.57 ± 1962.88 | 0.980 |
Hemoglobin | 13.87 ± 1.75 | 13.91 ± 1.54 | 13.83 ± 1.92 | 0.859 |
Hematocrit | 41.78 ± 4.87 | 41.48 ± 5.20 | 42.02 ± 4.65 | 0.663 |
CPK | 20.84 ± 13.72 | 16.41 ± 5.27 | 24.51 ± 17.185 | 0.017 |
LDH | 356.14 ± 149.91 | 331.66 ± 101.04 | 376.43 ± 179.72 | 0.237 |
ESR | 16.61 ± 17.43 | 10.66 ± 9.65 | 21.54 ± 20.76 | 0.012 |
IgG titer | 70.28 ± 31.03 | 56.59 ± 31.12 | 81.63 ± 26.34 | 0.001 |
CRP | 6.64 ± 10.64 | 4.94 ± 3.81 | 8.05 ± 13.91 | 0.248 |
Platelets | 239![]() ![]() |
249![]() ![]() |
232![]() ![]() |
0.274 |
N. smoker | 3.30 ± 8.46 | 1.24 ± 5.64 | 5.00 ± 10.00 | 0.077 |
Troponin T | 0.08 ± 0.27 | 0.01 ± 0.00 | 0.13 ± 0.36 | 0.078 |
Parameter | Status | N (%) | CAD− N = 29 | CAD+ N = 35 | p-value |
---|---|---|---|---|---|
Gender | Male | 34(53.1) | 12(41.4) | 22(62.9) | 0.087 |
Female | 30(46.9) | 17(58.6) | 13(37.1) | ||
Smoker | Yes | 12(18.8) | 2(6.9) | 10(28.6) | 0.027 |
No | 52(81.2) | 27(93.1) | 25(71.4) | ||
Hypertension | Yes | 42(65.6) | 16(55.2) | 26(74.3) | 0.109 |
No | 22(34.4) | 13(44.8) | 9(25.7) | ||
Cardiac disease | Yes | 22(34.4) | 4(13.8) | 18(51.4) | 0.002 |
No | 42(65.6) | 25(86.2) | 17(48.6) | ||
Cardiac failure | Yes | 33(51.6) | 14(48.3) | 19(54.3) | 0.632 |
No | 31(48.4) | 15(51.7) | 16(45.7) | ||
CCU admission | Yes | 33(51.6) | 7(24.1) | 26(74.3) | 0.000 |
No | 31(48.4) | 22(75.9) | 9(25.7) | ||
Troponin | Positive | 11(17.2) | 1(3.4) | 10(28.6) | 0.008 |
Negative | 53(82.8) | 28(96.6) | 25(71.4) | ||
Bloating | Yes | 21(32.8) | 13(44.8) | 8(22.9) | 0.062 |
No | 43(67.2) | 16(55.2) | 27(77.1) | ||
Gastroesophageal reflux | Yes | 30(46.9) | 17(58.6) | 13(37.1) | 0.087 |
No | 34(53.1) | 12(41.4) | 22(62.9) | ||
Bitterness of the mouth | Yes | 30(46.9) | 16(55.2) | 14(40.0) | 0.226 |
No | 34(53.1) | 13(44.8) | 21(60.0) | ||
Dyspepsia | Yes | 32(50.0) | 18(62.1) | 14(40.0) | 0.079 |
No | 32(50.0) | 11(37.9) | 21(60.0) | ||
H. pylori | Positive | 27(42.2) | 16(55.2) | 11(31.4) | 0.056 |
Negative | 37(57.8) | 13(44.8) | 24(68.6) |
Significant differences were found between the CAD+ and CAD− patients in terms of age, CPK, ESR, IgG titer and history of CCU admission. The mean value of age was equal to 58.49 ± 11.31 years in CAD+ while it was 51.34 ± 12.04 years in CAD− patients and it was also statistically significant (p-value = 0.018). Furthermore, the CPK level was higher in CAD+ patients (24.51 ± 17.185 vs. 16.41 ± 5.27, p-value = 0.017).
Moreover, the ESR level was higher in CAD+ patients (21.54 ± 20.76 vs. 10.66 ± 9.65, p-value = 0.012). It is confirmed that ESR can be used as a predictor of coronary artery disease.44 In fact, some researchers such as Prakash and colleagues believe that this parameter is one of the few laboratory markers that differs between CAD patients and healthy individuals.45
Besides, another factor that was different between the CAD+ patients and healthy people was the IgG titer. However, another study found that H. pylori infection and consequently H. pylori antibody titer was higher among CAD+ patients.46
In this study, no significant difference was observed between the two groups in terms of CRP but the opposite finding has been mentioned in a study by Leite et al.47
The basic differences in the laboratory parameters are shown in Table 3. The smoker status is yes if the individual has a history of smoking. If the individual has a history of heart disease, the cardiac disease status is noted as yes. Heart failure, often referred to as congestive heart failure, occurs when the heart is unable to pump sufficiently to maintain blood flow to meet the body’s needs. When a person has been hospitalized in the intensive care unit for the heart, the status is yes for CCU admission. Troponin was determined using bioMerieux kits. Results of more than 0.01 are considered to be positive. The H. pylori parameter was detected based on serum titers of higher than 30 AU mL−1.
Among patients with coronary artery disease, positive troponin levels were obviously higher (p-value = 0.008). Also, the smoking rate was higher among CAD+ patients. Additionally, a history of heart disease, previous hospitalization in a CCU ward and high blood pressure were significantly more likely in CAD+ patients and these are known risk factors for CAD.
Although our study showed that hematological parameters such as the number of white blood cells and platelets were not associated with CAD, Jia et al.’s study did find this association.48
Modelling step | Class label | Specificity | Sensitivity | Precision | NER |
---|---|---|---|---|---|
Training | Healthy | 1 | 1 | 1 | 1 |
CAD | 1 | 1 | 1 | ||
Cross validation | Healthy | 0.931 | 0.952 | 0.923 | 0.947 |
CAD | 0.952 | 0.931 | 0.976 | ||
Prediction | Healthy | 1 | 0.875 | 0.953 | 0.937 |
CAD | 0.875 | 1 | 0.917 |
The scores obtained for the first two identified latent variables by the PLS-DA model are shown in Fig. 6. It is clear that there is good discrimination between CAD− and CAD+ individuals. The blue and red lines around objects show the individual space of each class in the score space for healthy and CAD cases, respectively. It can be concluded that the mentioned clinical laboratory parameters would be advantageous to distinguish CAD patients from healthy cases with high accuracy and precision.
Furthermore, through clinical parameters measured in the laboratory, CAD− and CAD+ individuals were identified by classification modelling. Among thirty parameters, thirteen parameters (i.e., age, gender, abdominal circumference, systolic BP, diastolic BP, cardiac disease, WBC, ESR, troponin, CRP, LDH, IgG titer and dyspepsia) were selected as the discriminant and most important to distinguish CAD− from CAD+ using a genetic algorithm based feature selection approach.
Finally, it was demonstrated that the above-mentioned workflow and approaches of using the acquired 1H-NMR spectra and measured clinical laboratory parameters were able to accurately predict CAD disease in the suspected cases with lower risk than other clinical methods, being fast, simple and non-invasive.
This journal is © The Royal Society of Chemistry 2015 |