Jian-Hua Huanga,
Liang Fub,
Bin Lia,
Hua-Lin Xie*b,
Xiaojuan Zhanga,
Yanjiao Chena,
Yuhui Qina,
Yuhong Wanga,
Shuihan Zhanga,
Huiyong Huanga,
Duanfang Liaoa and
Wei Wang*a
aTCM and Ethnomedicine Innovation & Development Laboratory, Sino-Luxemburg TCM Research Center, School of Pharmacy, Hunan University of Chinese Medicine, Changsha, 410208, P. R. China. E-mail: wangwei402@hotmail.com; Fax: +86731-8845-8227; Tel: +86731-8845-8240
bCollege of Chemistry and Chemical Engineering, Yangtze Normal University, Chongqing, 408100, China. E-mail: hualinxie@163.com
First published on 29th June 2015
In this study, we proposed a metabolomics strategy to distinguish different metabolic characters of healthy controls, breast benign (BE) patients, and breast malignant (BC) patients by using the GC-MS and random forest method (RF). In the current study, the serum samples from healthy controls, BE patients, and BC patients were characterized by using GC-MS. Then, random forest (RF) models were established to visually discriminate the differences among three groups' metabolites profiles, and further investigate the progress of breast cancer from benign to malignant in patients based on these GC-MS profiles. We successfully discovered the differences between the healthy and breast cancer patients. And the metabolic changes from benign to malignant cancer were obviously visualized. The results suggested that combining GC-MS profiling with random forest method is a useful approach to analyze metabolites and to screen the potential biomarkers for exploring the serum metabolic profiles of breast cancer.
Metabolomics is an important platform for quantitative analysis of the metabolites in living systems and their dynamic responses to the changes of both endogenous and exogenous factors by using all kinds of analytical approaches, including gas chromatography-mass spectrometry GC-MS,7–10 high-resolution nuclear magnetic (NMR),11–14 ultra-performance liquid chromatography-mass spectrometry (UPLC-MS).15 Recently, the metabolomics methods were widely used to monitor disease progression, and showed its advantages in various researches, such as diagnosis of human diseases,16 physiological evaluations,17 elucidation of biomarkers,18,19 and drug toxicity.20 The transforming process from normal to malignant cells is always associated with some metabolic disturbances. Therefore, using the metabolomics method for breast cancer research is very suitable. Some previous researches have demonstrated that some volatile organic metabolites could indicate the differences between breast cancer patients and healthy controls,6,21 and some other researchers have reported the serum concentrations of free fatty acids (FFAs) in patients with BC were significantly decreased compared with those in healthy controls.22,23 These researches indicated that using GC-MS metabolites profiles can help breast cancer diagnosis. Besides, GC-MS analysis method has some advantages such as, favorable stability, reproducibility, and sensitivity, and rapid analysis.
Owning to the complexity of these metabolic profiles, multivariate statistical methods are extensively used to deal with these ‘Omics’ data. Principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were the most often used method to visually represent the data information, and some other machine learning methods were applied in the researches more and more frequent.24,25 Random forest (RF) model, one of these machine learning methods, has its own characteristic advantages on dealing with complex metabolomics data. This algorithm has showed its advantages in dealing with these complex metabolomics data, not only distinguish different groups (patients and healthy), but also can help finding the significant changes of metabolites as a potential biomarker, as showed in our previous researches.26–28
Based on these reasons, we established a metabolomics strategy to distinguish different metabolic characters of healthy controls, breast benign patients, and breast malignant patients by using the GC-MS and random forest (RF). The whole experiment contains several steps: firstly, the serum samples from healthy, breast benign patients, and breast malignant patients were profiled by using GC-MS analytical technique; after being pretreated, metabolites information was processed by using RF method; finally, RF model can calculate the sample proximity matrix, by using this sample proximity matrix, not only the differences between the healthy and breast cancer patients were observed, but also the differences between breast benign patients and breast malignant patients were obviously visualized. And some informative metabolites or potential biomarkers have been successfully discovered by means of variable importance ranking in random forest program.
Here, two useful tools, the variable importance measure and proximity matrix, in the RF will be introduced, which have showed their advantages in the data interpretation and visualization. The variable importance measure can be used to estimate the importance of each metabolite in the model classification. This information can help us to find the potential biomarkers. In current study, ‘the mean decrease in classification’ measure was adopted. For each tree, the classification accuracy of the OOB samples is determined both with and without random permutation of the variable values one by one. The accuracy of permutation is subtracted from that before permutation, and then averaged over all trees in the forest (calculated as eqn (1)).
Importance of j = Accuracyj normal − Accuracyj permuted | (1) |
The other attractive feature in RF algorithm is the proximity matrix calculation. Proximity values can indicate the similarities among all the samples. In normal situation, samples from the same group always fall into the same or nearby tree node (this is the principle of tree method). In tree method, distance matrix was used to calculate the similarities of samples. In RF method, the proximity between two samples was calculated as the number of times the two samples fall into the same terminal node of a tree, and then divided by the number of trees in the forest.31 After the proximity values are calculated, multi-dimensional scaling (MDS) plot is always used to visualize these analysis results. MDS is a set of related statistical techniques often used to visually explore similarities or dissimilarities in data.32 We can project the first two or three scaling coordinates into low dimensions and obtain the clustering plot of all the samples.
![]() | ||
Fig. 1 The typical total ion chromatograms (TICs) of healthy control (in blue line), BE (in black line), and BC (in red line). |
After these metabolic profiles were collected, qualitative and the quantitative work were carried out, mainly metabolites, including amino acid, organic acid, fatty acid, and carbohydrates were found in the chromatograms (detailed results were listed in Table 1). Then, these metabolites data were input to some pattern recognition algorithms for further analysis.
id | trb (min) | Endogenous metabolites | BC group | BE group | Healthy |
---|---|---|---|---|---|
a Identified by standard substances.b Retention time. | |||||
1 | 5.922 | Ethylbis(trimethylsilyl)amine | 0.2456 ± 0.0705 | 0.1905 ± 0.0567 | 0.1958 ± 0.0551 |
2 | 6.593 | Ethylene glycol | 0.0182 ± 0.0020 | 0.0530 ± 0.0428 | 0.0746 ± 0.0626 |
3 | 6.608 | N,N-Diethylacetamide | 0.0120 ± 0.0060 | 0.0220 ± 0.0325 | 0.0227 ± 0.0302 |
4 | 6.84 | N,N-Diethyl-acetamide | 0.0657 ± 0.0087 | 0.0476 ± 0.0202 | 0.0557 ± 0.0107 |
5 | 7.716 | Lactic acida | 0.0872 ± 0.0374 | 0.0952 ± 0.0592 | 0.1482 ± 0.2155 |
6 | 7.934 | Acetic acid | 0.0629 ± 0.0140 | 0.0856 ± 0.0333 | 0.0412 ± 0.0403 |
7 | 10.01 | Phosphate | 3.1278 ± 1.0173 | 1.4730 ± 0.7381 | 1.3767 ± 1.0361 |
8 | 10.2 | L-Threonine | 0.0173 ± 0.0098 | 0.0108 ± 0.0068 | 0.0096 ± 0.0065 |
9 | 10.297 | Acetic acid, phenyl- | 0.0047 ± 0.0023 | 0.0159 ± 0.0133 | 0.0147 ± 0.0117 |
10 | 10.382 | Succinic acida | 0.0811 ± 0.0429 | 0.0098 ± 0.0131 | 0.0119 ± 0.0086 |
11 | 10.447 | [1,2-Phenylenebis(oxy)]bis[trimethyl- | 0.0120 ± 0.0072 | 0.0078 ± 0.0047 | 0.0067 ± 0.0039 |
12 | 10.503 | Glyceric acid | 0.0961 ± 0.0266 | 0.0400 ± 0.0282 | 0.0183 ± 0.0167 |
13 | 10.723 | (R*,R*)-2,3-Dihydroxybutanoic acid | 0.0267 ± 0.0053 | 0.0137 ± 0.0014 | 0.0053 ± 0.0029 |
14 | 11.357 | 2,4-Bis[(trimethylsilyl)oxy]-butanoic acid | 0.0147 ± 0.0051 | 0.0055 ± 0.0030 | 0.0066 ± 0.0047 |
15 | 11.583 | (R*,S*)-3,4-Dihydroxybutanoic acid | 0.0304 ± 0.0098 | 0.0132 ± 0.0064 | 0.0178 ± 0.0107 |
16 | 11.797 | N-(1-oxobutyl)-glycine | 0.0653 ± 0.0244 | 0.0319 ± 0.0186 | 0.0274 ± 0.0151 |
17 | 12.341 | Isovaleroglycine | 0.0356 ± 0.0134 | 0.0160 ± 0.0079 | 0.0107 ± 0.0073 |
18 | 12.483 | D-Threitol | 0.0214 ± 0.0073 | 0.0290 ± 0.0130 | 0.0251 ± 0.0151 |
19 | 12.645 | N-Crotonylglycine | 0.0640 ± 0.0146 | 0.0207 ± 0.0129 | 0.0148 ± 0.0099 |
20 | 14.53 | N-(1-oxohexyl)-glycine | 0.0160 ± 0.0072 | 0.0121 ± 0.0073 | 0.0132 ± 0.0081 |
21 | 14.713 | D-Xylose | 0.0208 ± 0.0075 | 0.0082 ± 0.0044 | 0.0093 ± 0.0063 |
22 | 14.823, 15.057 | D-Ribose | 0.0126 ± 0.0070 | 0.0152 ± 0.0042 | 0.0250 ± 0.0110 |
23 | 15.509, 15.733 | Arabitol | 0.0487 ± 0.0364 | 0.0283 ± 0.0179 | 0.0278 ± 0.0215 |
24 | 16.023 | D-Galactose, 6-deoxy-2,3,4,5-tetrakis-O-(trimethylsilyl)- | 0.0336 ± 0.0083 | 0.0177 ± 0.0100 | 0.0149 ± 0.0104 |
25 | 16.087 | Mannonic acid | 0.0505 ± 0.0177 | 0.0211 ± 0.0143 | 0.0168 ± 0.0138 |
26 | 16.2 | cis-Aconitic acida | 0.0435 ± 0.0388 | 0.0105 ± 0.0079 | 0.0168 ± 0.0147 |
27 | 16.357 | Phosphoric acid | 0.0414 ± 0.0252 | 0.0230 ± 0.0141 | 0.0212 ± 0.0168 |
28 | 17.177 | Isocitric acida | 0.0464 ± 0.0121 | 0.0340 ± 0.0093 | 0.0448 ± 0.0838 |
29 | 17.563 | Hippuric acid | 0.0270 ± 0.0126 | 0.0180 ± 0.0104 | 0.0156 ± 0.0116 |
30 | 17.85, 17.96 | D-Fructosea | 0.0712 ± 0.0586 | 0.0471 ± 0.0145 | 0.0580 ± 0.1031 |
31 | 18.087 | D-Galactosea | 0.0796 ± 0.0214 | 0.0455 ± 0.0272 | 0.0389 ± 0.0287 |
32 | 18.197, 18.147 | D-Glucosea | 0.2785 ± 0.0918 | 0.1741 ± 0.7354 | 0.1859 ± 0.4136 |
33 | 18.507 | Altronic acid | 0.0202 ± 0.0069 | 0.0185 ± 0.0100 | 0.0102 ± 0.0074 |
34 | 18.577, 18.65 | D-Sorbitola | 0.0259 ± 0.0169 | 0.0254 ± 0.0187 | 0.0300 ± 0.0275 |
35 | 18.983, 19.533 | Galactonic acid | 0.1213 ± 0.0482 | 0.0817 ± 0.0328 | 0.0441 ± 0.0351 |
36 | 19.99 | Palmitic acid | 0.0127 ± 0.0017 | 0.0148 ± 0.0029 | 0.0071 ± 0.0025 |
37 | 20.403 | Myo-inositol | 0.0247 ± 0.0128 | 0.0197 ± 0.0037 | 0.0334 ± 0.0129 |
38 | 25.465 | D-Turanose | 0.0216 ± 0.0138 | 0.0197 ± 0.0190 | 0.0510 ± 0.1099 |
39 | 28.125 | D-(+)-Lactose monohydratea | 0.8475 ± 0.1366 | 1.0400 ± 0.3349 | 0.6559 ± 0.2286 |
40 | 29.927 | Lactose | 0.0142 ± 0.0043 | 0.0143 ± 0.0075 | 0.0190 ± 0.0163 |
41 | 35.223 | Cholesterola | 0.0107 ± 0.0038 | 0.0101 ± 0.0021 | 0.0107 ± 0.0034 |
Firstly, we used the principal component analysis (PCA) to present the cluster trends of these three groups samples. PCA can project the metabolites profiles into a lower dimensional space to visually evaluate clustering trends. The first three principal components, i.e., PC1, PC2 and PC3, were used to draw the Scores plot (Fig. 2) which can present the samples distribution of three groups. The total contribution of these three PCs accumulated to 94.59% in the total variance of the raw data. As visually observed, the healthy controls are significant different with the BE and BC groups. But the differences of BE and BC group cannot be discriminated, some samples from two groups are overlapped.
![]() | ||
Fig. 2 The first three principal components from PCA Scores plot of serum profiles for healthy, BE and BC samples. |
Therefore, in order to further classify the BE and BC patients, random forest (RF) method was adopted to analyze these metabolites; all the metabolites were used as variables for discrimination. According to the pre-set parameters, RF models were established. During the model training process, the samples proximities are calculated for each pair of cases. As similar samples always fall into the same terminal node or derive from the same parent node. Thus, the samples in the same group always have a larger similarity value than that in other group samples.
To more directly and conveniently observe the patterns in the proximity matrix, multidimensional scaling (MDS) was employed to map the proximity into a lower-dimensional space. From Fig. 3, a good separation between the healthy controls and breast cancer patients could be observed. Furthermore, the differences between BC and BE patients were also emerged. These results sufficiently indicated that the metabolic characters among BC patients, benign patients (BE), and healthy control are distinction. The BE patients were located in the middle of BC patients and healthy controls, and they may develop and progress to malignant tissues. More detailed analysis for each pairs of group has been done in the following sections.
Some of metabolites, such as acetic acid, (R*,R*)-2,3-dihydroxybutanoic acid, palmitic acid, and D-(+)-lactose monohydrate, have great contributions to classification accuracy. (R*,R*)-2,3-Dihydroxybutanoic acid is a normal organic acid in human biofluids. Palmitic acid is a saturated fatty acid, may inhibit the metabolic actions of insulin and attenuate insulin signal transduction.33 Moreover, there is a significant direct association between palmitic acid in erythrocyte and risk of breast cancer.34 These metabolites could be considered as potential biomarkers for diagnosing the breast benign patients.
As could be seen from Fig. 5, several metabolites were consistent with these in BE and healthy controls, such as, (R*,R*)-2,3-dihydroxybutanoic acid and D-(+)-lactose monohydrate. Other metabolites such as D-xylose and galactonic acid were also found larger contribution for the classification. A property of many malignancies, including breast cancer, is constitutive upregulation of glycolysis with persistent glycolysis despite the present of oxygen.35 These metabolites represented with some of changes in metabolic activity of several pathways associated with breast cancer, including amino acid metabolism, glycolysis metabolism. Galactonic acid, is a sugar acid35 and one of the oxidized form of D-galactose. D-Xylose is a five-carbon aldose that can be catabolized or metabolized into useful product by lots of organisms.36,37 These means these metabolites could be considered as potential biomarkers for diagnosing the breast malignant patients.
As could be seen from Fig. 6, three metabolites could be found as the potential biomarkers D-glucose, D-(+)-lactose monohydrate, and D-xylose. Furthermore, the D-xylose is a special metabolite for BC patients, which is different with BE patients and healthy controls. This might be a useful potential biomarker for monitoring the transforming process and metabolic disturbances from benign to malignant cancer. These molecular biomarkers generally can provide prognostic symbols and their diagnostic detection is becoming increasingly important in early diagnosis of breast cancer.
RF | Random forests |
GC-MS | Gas chromatography-mass spectrometry |
BC | Breast cancer |
PCA | Principal component analysis |
PLS-DA | Partial least squares discriminant analysis |
This journal is © The Royal Society of Chemistry 2015 |