Pure ion chromatogram extraction via optimal k -means clustering

Hongchao Ji; Hongmei Lu; Zhimin Zhang

doi:10.1039/C6RA08409E

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/C6RA08409E (Paper) RSC Adv., 2016, 6, 56977-56985

Pure ion chromatogram extraction via optimal k-means clustering

Hongchao Ji, Hongmei Lu* and Zhimin Zhang*
College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. E-mail: hongmeilu@csu.edu.cn; zhangzhimin@csu.edu.cn

Received 1st April 2016 , Accepted 31st May 2016

First published on 2nd June 2016

Abstract

Untargeted analysis of complex samples with liquid chromatography coupled with mass spectrometry (LC-MS) has shown a great prospect. However, it is still difficult to extract useful information from complicated LC-MS data. Recently, pure ion chromatograms (PIC) were introduced. They are effective for reducing ions not related to meaningful compounds. In this study, a novel method to extract PIC based on optimal k-means clustering (KPIC) is proposed. KPIC uses the clustering tendency of centroids of pure ions to extract PIC from raw LC-MS datasets adaptively. KPIC was tested with 3 datasets: simulated, MM48 and Arabidopsis thaliana datasets. Compared with PITracer and XCMS methods, the results show that KPIC has better accuracy of feature extraction. It is able to provide higher quality chromatographic peaks, particularly for low concentration compounds. KPIC reduces the number of split signals, due to avoiding estimation of ion mass difference tolerances subjectively. KPIC is implemented in R programming language, which is available as an open source package at https://github.com/hcji/KPIC.

1. Introduction

Liquid chromatography coupled with mass spectrometry (LC-MS) is currently a primary method for complex sample analysis, for example, metabolome analysis.^1,2 In order to detect all the possible components, samples are often analyzed by a full scan MS and as much retention time as possible is covered. The datasets collected by this method become very complicated and difficult to handle. The difficulty in extracting useful information mainly originates from the overlapping chromatographic signals, fluctuations in the detected m/z of ions between each scan and the chemical or instrumental noise. It is essential to develop a reliable and effective method for extracting representations of measured ions from the raw data.³

Currently, most methods extract features from LC-MS data by slicing data into bins covering a narrow m/z range, which is called an extracted ion chromatogram (EIC). This strategy avoids the problem of resolving peaks in the m/z direction.^4,5 Each EIC can be treated as a single chromatogram. The algorithms use for processing chromatograms are able to be directly used to treat EIC. For example, XCMS filters each EIC with a second-order Gaussian function to reduce noise first, then selects peaks based on the signal to noise ratio (SNR).^6,7 Another widely used software, MZmine,^8,9 works similarly. It adopts alternative filter functions, including Gaussian and Lorentzian, and considers peak width during peak selection. XCMS and MZmine are able to deal with both centroid and continuum mode data. There are numerous other methods such as MetAlign,^10,11 MathDAMP,¹² MetaboAnalyst¹³ and MSResolver,¹⁴ involving a binning procedure in the m/z direction. Although widely used, this strategy is proven to have drawbacks. Binning does not make full use of the information given by high resolution mass spectrometers, even though the bin size can be adjusted depending on the resolution. Moreover, it is difficult to define optimal bins for all conditions,¹⁵ and some signals from the same analyte are liable to be split into two adjacent bins.⁴

To circumvent drawbacks of binning, another method called centWave¹⁶ is integrated into XCMS. It utilizes a regions-of-interest (ROI) function, which aims to detect features from raw data produced by high resolution mass spectrometry such as TOF and orbitrap. The ROI function locates the region that is supposed to contain potentially interesting masses and finds ROIs based on the mass accuracy. Therefore, there is no need to perform binning. Multiple scale continuous wavelet transform (CWT) is conducted to detect peaks of ROI. CentWave avoids disadvantages of binning to some degree, and multiple scale peak detection is more sensitive than the matched filter method. However, the relative mass differences between the adjacent ions are correlated with intensity.¹⁷ Furthermore, relative mass difference tolerances of different MS instruments are also different. Hence, a fixed relative mass difference tolerance is still not satisfactory due to the inconsistent relative mass differences.¹⁸

Recently, a new strategy of extracting pure ion chromatograms (PIC) has been proposed. This strategy allows for the full analytical power of a high-resolution mass spectrometer to be utilized. “Pure ion” is defined as an ion originating from the same analyte, and PIC is a chromatogram of the “pure ion”. Many packages based on PIC have been developed, including TracMass,¹⁹ Massifquant,²⁰ TracMass2,²¹ PITracer.¹⁵ TracMass treats values of m/z and intensity of pure ions, changing with retention time as the object moves in the radar images. With the change being continuous and smooth, it can be predicted by Kalman tracking. TracMass extracts PIC by connecting the data points in the predicting region one by one. This method simultaneously considers nearly all the information produced by LC-MS, including retention time, m/z value and intensity. Unfortunately, TracMass is not an open source. Therefore, Massifquant, which is similar to TracMass, was proposed after years. Massifquant is written in C and R language, and its code is open source. However, the Kalman filter, used by TracMass and Massifquant, is complicated and time-consuming. Apart from this though, the change in the m/z value is extremely small, whereas in the intensity direction, the value changes sharply and drastically. As a result, the Kalman filter does not seem very suitable in this situation.

TracMass2 gives up the Kalman filter and tracks the pure ion depending only on m/z values of data points by a nearest neighbor strategy. It works as follow: the sample will be scanned several times. In the first scan, each point is assigned a PIC ID. In the next scan, if the m/z value of a data point is sufficiently close to that of a point in the previous scan, the data point is assigned the same PIC ID; if the value of m/z difference is over the tolerance, a new PIC ID will be assigned. After iteratively computing, all PICs are found. Compared with TracMass, TracMass2 is proven to be faster. PITracer goes a step forward. It calibrates the mass values of ions according to a reference ion and adjusts the mass tolerance of each scan. Since PITracer uses relative mass differences to estimate the mass difference tolerance, high resolution MS is more suitable. Unfortunately, tracking the pure ion from scan to scan based on m/z tolerance, such as TracMass2 and PITracer, still has a disadvantage. The missing points and incorrect distribution in one scan will cause an obvious influence in tracking in the next scan. As a result, length of PICs will be shortened, and the signal will be split. Therefore, the tracking method still needs to be tested to determine whether it is good enough.²²

Binning, finding the ROI and tracking the pure ion are three popular methods for extracting useful information from LC-MS datasets. The principle of searching ROIs is that the m/z value difference between scans is small enough. The key problem is “how small is enough?”. CentWave uses a constant value as the threshold; however, the optimized value cannot be constant. PIC is created by connecting similar points from scan to scan. The key problem becomes “how to define ‘similarity’?”. TracMass2 needs a mass tolerance given by the user. PITracer estimates the tolerance by evaluating the distribution of relative mass differences between the adjacent m/z values. Neither of them gives a perfect answer.

In this study, we propose optimal k-means clustering for extracting pure ion chromatograms (KPIC) to avoid the problems mentioned above. In a LC-MS dataset, data points of each scan, which originate from the same ion always, have nearly the same m/z values, and their m/z values distribute near a constant value. It can be concluded that the m/z values of these points should have a tendency of clustering, whereas m/z values of noise are supposed to distribute randomly. On this basis, the optimal k-means clustering is applied to extract pure ion peaks adaptively from the raw LC-MS dataset. In this way, all data points in a certain region are taken into consideration instead of tracking from scan to scan. Concentrating on their clustering tendency, ions originating from the same compound can be picked out easily. After extraction, the evaluation of signal quality is performed. Good quality PICs are considered as real features. To increase the computing speed, multi-core techniques are used.²³

2. Method and theory

2.1 Description of the KPIC

In this section, KPIC will be elucidated as clearly as possible. This method comprises PIC extraction and PIC evaluation. PIC extraction is an iterative process. The work flow diagram and schematic are illustrated in Fig. 1 and 2, respectively.


	Fig. 1 The work flow of KPIC.


	Fig. 2 The clustering procedure for KPIC.

Fig. 1 illustrates the architecture and overall processes of KPIC. First, all the data points are sorted according to intensity. Then, the iterative process starts. In each iteration, (1) the point with the highest intensity is chosen as the landmark; (2) the ROI is delimited based on the landmark, and the data points in the ROI are clustered; (3) data points belonging to the same cluster as the landmark are selected, collected into a PIC list, and removed from the original data. After the highest intensity of the remaining points is below the set value, the iterative process stops.

Fig. 2 illustrates the clustering procedure. First, the centroid with the highest intensity is chosen as the landmark. Basing on the landmark, the ROI can be delimited. One can observe that most of pure ions have similar m/z values (A). If the square of the difference of the m/z for the data point and the landmark is calculated, the points have an obvious tendency of clustering (B). With the optional k-means clustering method, the points cluster into three categories (C). After clustering, only those belonging to the same cluster as the landmark are regarded as the pure ions. These points are converted into a PIC (D). More details are elucidated in the following section.

2.1.1 Landmark ion selection. A raw LC-MS dataset was converted into a three-column data matrix. These three columns collect the values of retention time (rt), m/z and intensity. The matrix was sorted according to intensity. Usually, signals with stronger intensity have a higher probability of being the real signal other than noise. Therefore, data points of all scans were sorted by intensity in a descending order. The m/z value of the centroid with the strongest intensity was chosen as the landmark. Furthermore, the data points, which are supposed to be derived from the same ion as the landmark, were selected. The selected points, together with the landmark point, constituted a PIC trace. This trace was saved into a list, and the points were removed from the original data matrix. In the next iteration, the centroid with the highest intensity in the remained data will be chosen as the landmark.

2.1.2 Region of interest of landmark ion. In order to ensure clustering effectiveness, the range for clustering has to be determined. If the range is too small, the signal may be split. If the range is too big, the algorithm may become more time-consuming. Therefore, before the first iteration, all the m/z values were sorted in ascending order and the difference in adjacent values were calculated. Furthermore, a Gaussian kernel was used to estimate the density of each relative mass difference. The value of the average density plus the deviation was set to be the minimum density. The difference value with the minimum density was recorded as the “gap”. This process is similar to PITracer. Herein, we used the absolute difference instead of the relative difference, because we only need to know the maximum tolerance. As for retention time direction, a parameter called “Range” was set. This value should be a slightly higher than the largest width of the chromatographic peak to ensure the signal not be split. Finally, rt and m/z values of data points to be clustered in each iterative should satisfy eqn (1) and (2):


abs(rt − rt₀) < t_Range	(1)


abs(m/z − m/z₀) < α × g × t_Range/t₀	(2)

where rt₀ and m/z₀ are the retention time and mass-to-charge ratio of the reference centroid, respectively. g is the “gap” value, α is an adjustment factor and t₀ is the interval of each scan.

2.1.3 Optimal k-means clustering on ROI. The k-means algorithm^24,25 is widely used for solving clustering problems, which iteratively calculates the sum of the distances within the cluster and modifies the group memberships until the sum of the distances within cluster becomes minimal. If it is assumed that the m/z values of all the scans distributed across the one-dimension m/z value axis, the m/z values of the centroids originating from the same ion will gather near a certain point. Because there are usually thousands types of ions in a full scan LC-MS dataset and many of them may have very similar m/z values, it is certain that a number of clusters will overlap. Fortunately, the retention time provides separation information, so if points within a retention time range are clustered, the clustering is very clearly.

In the k-means algorithm, one parameter, the number of clusters, should be predefined. This parameter has a significant influence on performance of clustering. Referring to this condition, it represents how many types of ions are in the choosing region, which was unknown previously. Due to the landmark ion being chosen, centroids in the ROI can be divided into three parts: (1) centroids originating from the same ion as the reference; (2) centroids originating from the ions with very similar m/z; (3) centroids originating from other ions or noise. There is no vagueness in part one and part three. However, part two is a different situation. The number of ions can be zero, one or more. In order to achieve the best performance of clustering, the parameter cannot be fixed. Optimal k-means clustering can solve this type of problem effectively.

Optimal k-means clustering in one dimension by dynamic programming (referred as ckmeans.1d.dp)²⁶ is an algorithm that guarantees optimality of clustering in 1-D. One can input a range of numbers of clusters such as 1, 2, …, k, and it works as follow:

(1) Define a sub-problem such as finding the minimum withiness (the within-cluster sum of squares for each cluster) of x₁, …, x_i into m clusters.

(2) Record the responding minima in entry D[i, m] of an n + 1 by k + 1 matrix D.

(3) Because D[j − 1, m − 1] must be the optimal withiness for the first j − 1 points in m − 1 clusters, a cyclic equation establishes the optimal substructure:


	(3)

where d(x_j, …, x_i) is the sum of squared distances between x_j, …, x_i and their means. After calculating the recurrence, D[n, m], the minimum withiness if all n numbers are clustered into m groups, can be acquired; moreover, the optimal clustering results are obtained.

Compared with the standard k-means algorithm, ckmeans.1d.dp tends to be more robust and can find the global optimum. In order to find data points with the same source ion as the reference centroid, the distance between these points and the reference is supposed to be as small as possible. Moreover, other points are far from the reference. Therefore, the square of the difference is used for running the ckmeans.1d.dp algorithm. Results show that it works better than the absolute difference does (Fig. 3). Since it is impossible that there are so many types of ions having similar m/z to the reference centroid, the range clusters of ckmeans.1d.dp is set only from one to four. Only those which belong to the same cluster as the reference are included in the PIC. If the number of points is less than 5, they will be removed from the dataset directly.


	Fig. 3 The performance of the cluster using (A) the difference between the candidate centroid and reference centroid and (B) the square of difference between them.

2.1.4 Pure ion extraction by KPIC. After clustering, data points belonging to the same cluster as the reference centroid should satisfy the following two requirements: (1) their retention time is continuous; (2) Only one point is supposed to appear in each scan. If the extracted trace is satisfactory, it will be saved into a PIC list. The process then turns to next step. Sometimes, unsatisfactory cases may exist. They are removed through the following steps. First, if the retention time is not continuous and the number of missing points is more than five, they will be divided into different sequence. Only the longest continuous sequence will be collected into the PIC list. Second, if more than one point is chosen in a scan, only the one with the nearest m/z value of the reference and the highest intensity will remain.

2.1.5 Remove unsatisfactory PICs. Not all exacted PICs are significant. Therefore, a standard of refining should be defined to evaluate the quality of the PICs. In this study, three indices are used, including Gaussian similarity, sharpness and SNR (signal to noise ratio).

Gaussian similarity: theoretically, the ideal chromatographic signals have a similar shape to the Gaussian function curve. The classical and modified Gaussian functions can be used to estimate chromatographic signals.²⁷ Gaussian similarity is also used to evaluate the goodness of a chromatographic feature.¹² Herein, 1.5 times the R² of the Gaussian fit is used as score of Gaussian similarity. R² is used for reflecting the degree of PIC similar to Gaussian curve, and is defined as follows:


	(4)

Sharpness: sharpness indicates how quickly the intensity of the signal changes with time. A high sharpness value means that feature is likely to be generated by a single ion.²⁸ Sharpness is defined as follow:


	(5)

SNR: SNR stands for signal to noise ratio, which is widely used in peak detection algorithms.^29,30 Usually noise is estimated as 95 percentage quantile of absolute continuous wavelet transform (CWT) coefficients of the smallest scale within a local window.²⁵ However, the length of the PICs is usually small and varies, so the scale of the CWT should have self-adaption with the length. Therefore, the FFT algorithm³¹ is used to perform a CWT with accelerated convolution, the smallest scale set to 2, the spacing between scales set to 0.4875 and the number of scales defined as follow:


	(6)

where L is the length of the PIC and “fix” means round down to the nearest whole unit. The maximum of the CWT coefficients of scales is used as the signal estimate;²⁶ since our PICs are short, 80% percentage quantile is used as a noise estimation.

In order to define a score function considering Gaussian similarity, sharpness and SNR simultaneously, the scores of each aspect should have a similar scale. Therefore, the sharpness score of the i_th PIC is defined as


	(7)

and the SNR score of the i_th PIC as


	(8)

Finally, the total score of the i_th PIC is defined as


Score_total,i = score_gaussian,i + score_sharpness,i + score_SNR,i	(9)

and the threshold value as


Score_threshold = avg(score) − λ × sd(score)	(10)

where λ is a parameter. PICs with scores larger than the threshold remain.

2.2 Datasets

2.2.1 Simulated dataset. 1370 metabolites are selected from HMDB,³² with their mono masses ranging from 10.5 to 1809.9. Their molecular ion peaks are simulated and their intensities are randomly set at three levels: 500–2000, 2000–10 [thin space (1/6-em)]

000 and 10 [thin space (1/6-em)]

000–30

000. It is assumed that each ion submits to a Gaussian distribution in the scan index direction, with the mean of the random value from 80 to 1900 and variances of the random value from 2 to 5. Next, each peak has a deviation, δ, in the m/z direction, and a disturbance, ξ, in the intensity direction. They are calculated as follows:


δ = m/z × R × 10⁻⁶	(11)


ξ = abs(0.05 × I × R)	(12)

where m/z and I represent the m/z and intensity of each peak, respectively. R is a random value obeying the standard normal distribution. Finally, we set the random noise signal to 300 [thin space (1/6-em)]

000 with a scan index ranging from 1 to 2000, m/z values ranging from 10 to 1900 and the intensity to 200 × R, which means numerous real peaks are covered by noise.

2.2.2 MM48 dataset. The MM48 dataset was downloaded from http://www.ebi.ac.uk/metabolights/MTBLS188. It was obtained using these steps: (1) a mixture of 48 known compounds as the pure solution was used, covering a broad mass range between 161 and 822 Da with different physico-chemical properties, (2) the pure solution at a 20 μM concentration using HPLC-MS was detected, which obtained dataset one, (3) the pure solutions at different concentrations (20, 5 μM) were spiked with extracts of Arabidopsis thaliana leaves to simulate a realistic complex system and (4) samples prepared in step 3 were detected and dataset two was obtained. More details can be found in the ref. 33.

2.2.3 Arabidopsis thaliana dataset. The metabolome dataset from seeds and leaves of Arabidopsis thaliana generated by UPLC/ESI-QTOF-MS can be downloaded at http://msbi.ipb-halle.de/msbi/centwave/. The experiment is described in the ref. 13.

2.3 Evaluation criteria

Comparison analysis is conducted to evaluate the feature extracting performance of KPIC. PITracer and centWave are chosen as references. Herein, the criteria for evaluation will be briefly described.

For the simulated dataset, the truth is known. It is easy to distinguish whether an identified feature is a true feature. The TPR (true positive rate) and the FDR (false discover rate) are both used to describe the performance.


TPR = TP/(TP + FN)	(13)


FDR = FP/(FP + TP)	(14)

where TP is the number of true positive features, FN is the number of false negative features and FP is the number of false positive features. For feature extraction, a larger TPR means more true features are extracted by a method, whereas a smaller FDR means fewer mistakes are made by a method. Therefore, with the same FDR, the method with the larger TPR has the better performance. As the parameters of a method are adjusted, a series of TPR and FDR values can be calculated. The ROC curves of different methods can be obtained by plotting TPR versus FDR, which is an informative measure to describe the performance of feature extraction.

The MM48 dataset is used to test the performance of methods in real complex systems. First, sample one is processed by each method; the extracted features are supposed to belong to these 48 known compounds. The parameters of each method are adjusted to ensure that the number of extracted features by each method is similar and as many features as possible can be extracted by all methods. Then, not changing the parameters except for the intensity threshold, sample two and sample three are processed by each method. A good method is supposed to find the features obtained by the previous procedure, even with the interference peaks produced by the substances in the leaves. Therefore, the methods are evaluated with recall.


Recall = TP/NP	(15)

where TP is the number of features of the 48 known compounds and NP is the number of all the extracted features.

The Arabidopsis thaliana dataset is used to observe the performance of methods in unknown systems. Real features are liable to be extracted by different methods. If a method provides more features in common with other methods, it tends to be reliable. In other words, if a method provides more distinctive features, it tends to be less reliable.

3. Results and discussion

3.1 PICs extraction results and comparisons

3.1.1 Simulated dataset. When processing a simulated dataset, at least one parameter of each method should be adjusted to obtain the ROC curve. The set parameters of the three methods are listed in Table 1. The feature extracting performances of the abovementioned methods were evaluated by the ROC curve (Fig. 4).

Table 1 The parameter of the three methods

Algorithm	Parameters
KPIC	Alpha = 0.1/0.5, t_Range = 10, lambda = 3, level = 500–2000
PITracer	maxGap = 5, calibration tolerance = 100, length > 25, intensity > 500–2000
centWave	Peak width = (10, 70), ppm = 150, snthresh = 3–5, prefilter = (1500–2000)


	Fig. 4 ROC curves of the three methods.

The FDR of the centWave is limited in a small range due to the robustness. The lower ceiling of the TPR than the other two methods means lower intensity features covered by noise cannot be extracted. As mentioned above, centWave is based on peak detection in ROI. The peak detection with wavelet transform is influenced by the SNR evidently. When high level noise gets into the ROI, peak detection can be affected severely. PITracer shows better performance than centWave, which might be because PITracer adjusts the mass tolerance of each scan.

The shape of the ROC curve from PITracer and KPIC is similar, whereas the ROC from KPIC shows better performance both in TPR and FDR. When extracting high intensity signals, the FDR of KPIC is zero. This is because KPIC searches centroids with intensities from high to low, not considering the noise. When it comes to low intensity features, the FDR increases gradually, but maintains a small range. The influence of signal intensity is limited in KPIC, because KPIC considers the gathering tendency of centroids, i.e., random noise does not appear continuously in groups.

3.1.2 MM48 dataset. The extraction result of sample one is summarized in a Venn diagram in Fig. 5. Usually the possibility of features being extracted by multiple methods reflects the reliability of a method. The real features are supposed to be easily extracted by different methods. More than 80% of the features can be extracted by both KPIC and centWave. This reflects the reliability of KPIC to some degree. The recall of sample 2 and sample 3 is summarized in Table 2.


	Fig. 5 Venn diagrams summarizing the results of feature detection of the three methods.

Table 2 The recall of the three methods

Algorithm	20 μM		5 μM
Algorithm	Recall	TP	Recall	TP
KPIC	0.827	2018	0.788	1924
PITracer	0.668	1234	0.435	1093
centWave	0.822	1915	0.758	1766

The results show that centWave has a similar recall to KPIC at higher concentration levels. CentWave adopts multiscale peak detection algorithm. This makes it have an excellent feature identifying performance from interference signals.

PITracer shows a worse performance than it does in simulated datasets. In simulated datasets, the noise is added randomly. Herein, the disturbing signals are not just random noise. Since PITracer traces pure ions from scan to scan, it can be affected when two ion traces tangle. KPIC considers from an overall view, and aims at the distribution tendency of centroids rather than on a single centroid, so the effect of disturbance in KPIC is limited. Compared with the centWave method, KPIC tends to identify more features with a similar recall, which is also favorable.

3.1.3 Arabidopsis thaliana dataset. A one seed dataset and a one leaf dataset are both processed for comparison. The results are shown in Venn charts (Fig. 6). The centWave method is run with the published parameters. The parameters of PITracer and KPIC are optimized through the same procedure as ref. 13.


	Fig. 6 The detection results of metabolomics, leaf and seed.

It can be seen that most of the features are simultaneously extracted by two of the three methods. This shows that these methods are reliable and the features tend to be real. The ratio of unique features of each method is similar since each method chooses features in different ways, so some small signal differences are understandable.

3.2 Quality assessment of PICs

Except for recognizing features, a good method is supposed to extract high quality chromatographic signals. The signals with better shape are not only more likely derived from meaningful analytes rather than noise, but also high quality peaks that can be used for further calculations such as integration of peak areas. Properties, such as Gaussian similarity, sharpness, and SNR, can be used for identification of the feature and evaluation of the goodness of the PICs.³⁴

We evaluated the quality of extracted PICs from PITracer and KPIC. Due to the fact that centWave is not based on a pure ion but a peak detection method on ROI, it is not considered here. The calculations for sharpness and SNR are referred to in Section 2. Here, R-square stands for the Gaussian similarity. For those not having an upper bound and not fit with Gaussian function, the values are set as zero.

Furthermore, another value, peak significance, is introduced. It is defined as the mean intensity of data points near the peak apex, divided by the mean intensity of data points near the two boundaries.¹⁸ PIC is different from EIC. It is usually very short and has very fuzzy peak boundaries. Therefore, the mean intensity of the boundaries is replaced by the first quartile of the whole PIC while the meaning remains the same.

We evaluated the peak quality of the PICs extracted from sample one of the MM48 dataset and a mixed standard solution with the criteria referred above. As was shown before, most of the features are identified by both KPIC and PITracer. The results are shown in Table 3, where the R-square, sharpness and SNR values are the mean values of the PICs extracted by each method, and the peak significance values are median values. This is because the peak significance value is very sensitive. Features extracted by KPIC have higher values in all the four indices. This indicates that KPIC extracts higher quality PICs.

Table 3 PIC quality index values of KPIC and PITracer

	R-Square	Sharpness	SNR	Significance
KPIC	0.8161	10.9632	7.5201	15.1047
PITracer	0.8033	7.5008	3.6112	6.2698

To make the result more visual, four peaks are chosen and shown in Table 4. Their R-square, sharpness, SNR and peak significance values are given individually and the corresponding PICs are plotted in Fig. 7. It is very clear that PICs with higher values are of better quality.

Table 4 The R-square, sharpness, SNR and peak significance of the chosen PICs

	m/z	rt	R-Square	Sharpness	SNR	Significance
KPIC	312.133	309.095	0.846	21.242	8.501	9.865
PITracer	312.129	309.094	0.768	9.157	2.761	2.761
KPIC	341.095	351.543	0.723	14.357	4.439	5.468
PITracer	341.086	351.542	0.540	6.880	1.258	1.503
KPIC	483.187	583.996	0.758	6.921	9.000	1.629
PITracer	483.184	584.000	0.657	1.862	2.179	4.884
KPIC	226.953	1133.80	0.572	12.906	10.932	117.739
PITracer	226.953	1133.80	0.649	3.719	1.828	2.114


	Fig. 7 Comparison of PICs with high values of R-square, sharpness, SNR, peak significance extracted by KPIC (left) and low values extracted by PITracer (right).

3.3 Performance on low concentration compounds

Detecting low concentration compounds with weak signals is as important as high concentration compounds because low concentration compounds also play an important role in samples. Therefore, when generating the simulated dataset, half of the features are set to be very weak. The performance of KPIC (ROC curve) is much better than those of the other two methods. Herein, we lay emphasis on weak signal extraction. There are 10 features with m/z values ranging from 150 to 160, and only 4 out of them can be clearly seen. After processing the KPIC, all the PICs can be extracted (Fig. 8). The results of the other two methods are summarized in Table 5. One can see from Table 5 and Fig. 8 that KPIC has significantly better performance in extracting low concentration compounds.


	Fig. 8 (A) Raw signal and (B) PICs extracted by KPIC of the m/z between 150 and 160.

Table 5 The performance of detecting weak signals

	True positive	False positive	Split signals
KPIC	10	0	0
PITracer	9	1	1
centWave	6	0	0

4. Conclusions

In this study, the KPIC method was proposed and implemented to extract PICs from raw LC-MS data for untargeted analysis of complex samples. The optimal k-means clustering was applied to discriminate the signal of the compound from noise. Unlike binning combined peak detection and scan to scan tracking, KPIC distills PICs at the peak scale. It is more reasonable than previous methods. Feature extraction performance tests on simulated MM48 and Arabidopsis thaliana datasets proved that KPIC can extract real features from complex systems, even with high level noise. By comparing the feature extraction performance and peak quality with two popular methods, the results showed that KPIC is more effective and reliable. Even when the signals are weak, KPIC can distill them effectively. Through testing on a MM48 dataset, KPIC has better reproducibility than both XCMS and PITracer. Furthermore, the quality of the PICs was also evaluated with criteria such as Gaussian similarity, sharpness and SNR. The results showed that KPIC can provide higher quality PICs than PITracer. Moreover, only a few intuitional parameters need to be optimized in KPIC.

Acknowledgements

This study is financially supported by the National Natural Science Foundation of China (Grants Nos. 21375151 and 21305163), the Hunan Provincial Natural Science Foundation of China (Grants No. 14JJ3031), the China Postdoctoral Science Foundation (No. 2014M552146), the National Instrumentation Program of China (No. 2011YQ03012407), and the Fundamental Research Funds for the Central Universities of Central South University (No. 2016zzts247)

References

B. Nathan, Nature, 2008, 455, 697–700 CrossRef PubMed.
G. A. Theodoridis, H. G. Gika, E. J. Want and I. D. Wilson, Anal. Chim. Acta, 2012, 711, 7–16 CrossRef CAS PubMed.
M. Katajamaa and M. Oresic, J. Chromatogr. A, 2007, 1158, 318–328 CrossRef CAS PubMed.
C. A. Smith, E. J. Want, G. O'Maille, R. Abagyan and G. Siuzdak, Anal. Chem., 2006, 78, 779–787 CrossRef CAS PubMed.
R. Tautenhahn, G. J. Patti, D. Rinehart and G. Siuzdak, Anal. Chem., 2012, 84, 5035–5039 CrossRef CAS PubMed.
H. Gowda, J. Ivanisevic, C. H. Johnson, M. E. Kurczy, H. P. Benton, D. Rinehart, T. Nguyen, J. Ray, J. Kuehl, B. Arevalo, P. D. Westenskow, J. Wang, A. P. Arkin, A. M. Deutschbauer, G. J. Patti and G. Siuzdak, Anal. Chem., 2014, 86, 6931–6939 CrossRef CAS PubMed.
H. P. Benton, D. M. Wong, S. A. Trauger and G. Siuzdak, Anal. Chem., 2008, 80, 6382–6389 CrossRef CAS PubMed.
M. Katajamaa, J. Miettinen and M. Oresic, Bioinformatics, 2006, 22, 634–636 CrossRef CAS PubMed.
T. Pluskal, S. Castillo, A. Villar-Briones and M. Oresic, BMC Bioinf., 2010, 11, 395 CrossRef PubMed.
A. Lommen, Anal. Chem., 2009, 81, 3079–3086 CrossRef CAS PubMed.
A. Lommen and H. J. Kools, Metabolomics, 2012, 8, 719–726 CrossRef CAS PubMed.
R. Baran, H. Kochi, N. Saito, M. Suematsu, T. Soga, T. Nishioka, M. Robert and M. Tomita, BMC Bioinf., 2006, 7, 530 CrossRef PubMed.
J. Xia, N. Psychogios, N. Young and D. S. Wishart, Nucleic Acids Res., 2009, 37, W652–W660 CrossRef CAS PubMed.
I. B. R. Helena, E. Per-Olof, O. M. Kvalheim, S. K. Ina and S. P. Jacobsson, Anal. Chem., 2003, 75, 4784–4792 CrossRef.
S. Ragnar, R. J. O. Torgrip, L. Johan, C. Leonard, K. Johan, S. K. Ina and S. P. Jacobsson, Anal. Chem., 2006, 78, 975–983 CrossRef PubMed.
R. Tautenhahn, C. Bottcher and S. Neumann, BMC Bioinf., 2008, 9, 504 CrossRef PubMed.
V. V. Mihaleva, O. Vorst, C. Maliepaard, H. A. Verhoeven, R. C. H. de Vos, R. D. Hall and R. C. H. J. van Ham, Metabolomics, 2008, 4, 171–182 CrossRef CAS.
S. Y. Wang, C. H. Kuo and Y. J. Tseng, Anal. Chem., 2015, 87, 3048–3055 CrossRef CAS PubMed.
K. M. Aberg, R. J. Torgrip, J. Kolmert, I. Schuppe-Koistinen and J. Lindberg, J. Chromatogr. A, 2008, 1192, 139–146 CrossRef PubMed.
C. J. Conley, R. Smith, R. J. Torgrip, R. M. Taylor, R. Tautenhahn and J. T. Prince, Bioinformatics, 2014, 30, 2636–2643 CrossRef CAS PubMed.
E. Tengstrand, J. Lindberg and K. M. Aberg, Anal. Chem., 2014, 86, 3435–3442 CrossRef CAS PubMed.
B. G. M. Vandeginste, Chemom. Intell. Lab. Syst., 2015, 149, 118–126 CrossRef CAS.
Z. M. Zhang, Y. Z. Liang and Q. S. Xu, Chemom. Intell. Lab. Syst., 2009, 96, 94–97 CrossRef CAS.
J. A. Hartigan and M. A. Wong, J. Appl. Stat., 1979, 28, 100–108 CrossRef.
S. P. Lloyd, IEEE Trans. Inf. Theory, 1982, 28, 129–137 CrossRef.
H. Z. Wang and M. Z. Song, The R Journal, 2011, 3, 29–33 Search PubMed.
Y. Kalambet, Y. Kozmin, K. Mikhailova, I. Nagaev and P. Tikhonov, J. Chemom., 2011, 25, 352–356 CrossRef CAS.
Y. Ni, Y. Qiu, W. Jiang, K. Suttlemyre, M. Su, W. Zhang, W. Jia and X. Du, Anal. Chem., 2012, 84, 6619–6629 CrossRef CAS PubMed.
P. Du, W. A. Kibbe and S. M. Lin, Bioinformatics, 2006, 22, 2059–2065 CrossRef CAS PubMed.
Z. M. Zhang, X. Tong, Y. Peng, P. Ma, M. J. Zhang, H. M. Lu, X. Q. Chen and Y. Z. Liang, Analyst, 2015, 140, 7955–7964 RSC.
C. Torrence and G. P. Compo, Bull. Am. Meteorol. Soc., 1998, 79, 61–78 CrossRef.
D. S. Wishart, T. Jewison, A. C. Guo, M. Wilson, C. Knox, Y. F. Liu, Y. Djoumbou, R. Mandal, F. Aziat, E. Dong, S. Bouatra, I. Sinelnikov, D. Arndt, J. G. Xia, P. Liu, F. Yallou, T. Bjorndahl, R. Perez-Pineiro, R. Eisner, F. Allen, V. Neveu, R. Greiner and A. Scalbert, Nucleic Acids Res., 2013, 41, D801–D807 CrossRef CAS PubMed.
C. Kuhl, R. Tautenhahn, C. Bottcher, T. R. Larson and S. Neumann, Anal. Chem., 2012, 84, 283–289 CrossRef CAS PubMed.
W. Zhang and P. X. Zhao, BMC Bioinf., 2014, 15(11), S5 CrossRef PubMed.

Click here to see how this site uses Cookies. View our privacy policy here.