Hongchao Ji,
Hongmei Lu* and
Zhimin Zhang*
College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. E-mail: hongmeilu@csu.edu.cn; zhangzhimin@csu.edu.cn
First published on 2nd June 2016
Untargeted analysis of complex samples with liquid chromatography coupled with mass spectrometry (LC-MS) has shown a great prospect. However, it is still difficult to extract useful information from complicated LC-MS data. Recently, pure ion chromatograms (PIC) were introduced. They are effective for reducing ions not related to meaningful compounds. In this study, a novel method to extract PIC based on optimal k-means clustering (KPIC) is proposed. KPIC uses the clustering tendency of centroids of pure ions to extract PIC from raw LC-MS datasets adaptively. KPIC was tested with 3 datasets: simulated, MM48 and Arabidopsis thaliana datasets. Compared with PITracer and XCMS methods, the results show that KPIC has better accuracy of feature extraction. It is able to provide higher quality chromatographic peaks, particularly for low concentration compounds. KPIC reduces the number of split signals, due to avoiding estimation of ion mass difference tolerances subjectively. KPIC is implemented in R programming language, which is available as an open source package at https://github.com/hcji/KPIC.
Currently, most methods extract features from LC-MS data by slicing data into bins covering a narrow m/z range, which is called an extracted ion chromatogram (EIC). This strategy avoids the problem of resolving peaks in the m/z direction.4,5 Each EIC can be treated as a single chromatogram. The algorithms use for processing chromatograms are able to be directly used to treat EIC. For example, XCMS filters each EIC with a second-order Gaussian function to reduce noise first, then selects peaks based on the signal to noise ratio (SNR).6,7 Another widely used software, MZmine,8,9 works similarly. It adopts alternative filter functions, including Gaussian and Lorentzian, and considers peak width during peak selection. XCMS and MZmine are able to deal with both centroid and continuum mode data. There are numerous other methods such as MetAlign,10,11 MathDAMP,12 MetaboAnalyst13 and MSResolver,14 involving a binning procedure in the m/z direction. Although widely used, this strategy is proven to have drawbacks. Binning does not make full use of the information given by high resolution mass spectrometers, even though the bin size can be adjusted depending on the resolution. Moreover, it is difficult to define optimal bins for all conditions,15 and some signals from the same analyte are liable to be split into two adjacent bins.4
To circumvent drawbacks of binning, another method called centWave16 is integrated into XCMS. It utilizes a regions-of-interest (ROI) function, which aims to detect features from raw data produced by high resolution mass spectrometry such as TOF and orbitrap. The ROI function locates the region that is supposed to contain potentially interesting masses and finds ROIs based on the mass accuracy. Therefore, there is no need to perform binning. Multiple scale continuous wavelet transform (CWT) is conducted to detect peaks of ROI. CentWave avoids disadvantages of binning to some degree, and multiple scale peak detection is more sensitive than the matched filter method. However, the relative mass differences between the adjacent ions are correlated with intensity.17 Furthermore, relative mass difference tolerances of different MS instruments are also different. Hence, a fixed relative mass difference tolerance is still not satisfactory due to the inconsistent relative mass differences.18
Recently, a new strategy of extracting pure ion chromatograms (PIC) has been proposed. This strategy allows for the full analytical power of a high-resolution mass spectrometer to be utilized. “Pure ion” is defined as an ion originating from the same analyte, and PIC is a chromatogram of the “pure ion”. Many packages based on PIC have been developed, including TracMass,19 Massifquant,20 TracMass2,21 PITracer.15 TracMass treats values of m/z and intensity of pure ions, changing with retention time as the object moves in the radar images. With the change being continuous and smooth, it can be predicted by Kalman tracking. TracMass extracts PIC by connecting the data points in the predicting region one by one. This method simultaneously considers nearly all the information produced by LC-MS, including retention time, m/z value and intensity. Unfortunately, TracMass is not an open source. Therefore, Massifquant, which is similar to TracMass, was proposed after years. Massifquant is written in C and R language, and its code is open source. However, the Kalman filter, used by TracMass and Massifquant, is complicated and time-consuming. Apart from this though, the change in the m/z value is extremely small, whereas in the intensity direction, the value changes sharply and drastically. As a result, the Kalman filter does not seem very suitable in this situation.
TracMass2 gives up the Kalman filter and tracks the pure ion depending only on m/z values of data points by a nearest neighbor strategy. It works as follow: the sample will be scanned several times. In the first scan, each point is assigned a PIC ID. In the next scan, if the m/z value of a data point is sufficiently close to that of a point in the previous scan, the data point is assigned the same PIC ID; if the value of m/z difference is over the tolerance, a new PIC ID will be assigned. After iteratively computing, all PICs are found. Compared with TracMass, TracMass2 is proven to be faster. PITracer goes a step forward. It calibrates the mass values of ions according to a reference ion and adjusts the mass tolerance of each scan. Since PITracer uses relative mass differences to estimate the mass difference tolerance, high resolution MS is more suitable. Unfortunately, tracking the pure ion from scan to scan based on m/z tolerance, such as TracMass2 and PITracer, still has a disadvantage. The missing points and incorrect distribution in one scan will cause an obvious influence in tracking in the next scan. As a result, length of PICs will be shortened, and the signal will be split. Therefore, the tracking method still needs to be tested to determine whether it is good enough.22
Binning, finding the ROI and tracking the pure ion are three popular methods for extracting useful information from LC-MS datasets. The principle of searching ROIs is that the m/z value difference between scans is small enough. The key problem is “how small is enough?”. CentWave uses a constant value as the threshold; however, the optimized value cannot be constant. PIC is created by connecting similar points from scan to scan. The key problem becomes “how to define ‘similarity’?”. TracMass2 needs a mass tolerance given by the user. PITracer estimates the tolerance by evaluating the distribution of relative mass differences between the adjacent m/z values. Neither of them gives a perfect answer.
In this study, we propose optimal k-means clustering for extracting pure ion chromatograms (KPIC) to avoid the problems mentioned above. In a LC-MS dataset, data points of each scan, which originate from the same ion always, have nearly the same m/z values, and their m/z values distribute near a constant value. It can be concluded that the m/z values of these points should have a tendency of clustering, whereas m/z values of noise are supposed to distribute randomly. On this basis, the optimal k-means clustering is applied to extract pure ion peaks adaptively from the raw LC-MS dataset. In this way, all data points in a certain region are taken into consideration instead of tracking from scan to scan. Concentrating on their clustering tendency, ions originating from the same compound can be picked out easily. After extraction, the evaluation of signal quality is performed. Good quality PICs are considered as real features. To increase the computing speed, multi-core techniques are used.23
Fig. 1 illustrates the architecture and overall processes of KPIC. First, all the data points are sorted according to intensity. Then, the iterative process starts. In each iteration, (1) the point with the highest intensity is chosen as the landmark; (2) the ROI is delimited based on the landmark, and the data points in the ROI are clustered; (3) data points belonging to the same cluster as the landmark are selected, collected into a PIC list, and removed from the original data. After the highest intensity of the remaining points is below the set value, the iterative process stops.
Fig. 2 illustrates the clustering procedure. First, the centroid with the highest intensity is chosen as the landmark. Basing on the landmark, the ROI can be delimited. One can observe that most of pure ions have similar m/z values (A). If the square of the difference of the m/z for the data point and the landmark is calculated, the points have an obvious tendency of clustering (B). With the optional k-means clustering method, the points cluster into three categories (C). After clustering, only those belonging to the same cluster as the landmark are regarded as the pure ions. These points are converted into a PIC (D). More details are elucidated in the following section.
abs(rt − rt0) < tRange | (1) |
abs(m/z − m/z0) < α × g × tRange/t0 | (2) |
In the k-means algorithm, one parameter, the number of clusters, should be predefined. This parameter has a significant influence on performance of clustering. Referring to this condition, it represents how many types of ions are in the choosing region, which was unknown previously. Due to the landmark ion being chosen, centroids in the ROI can be divided into three parts: (1) centroids originating from the same ion as the reference; (2) centroids originating from the ions with very similar m/z; (3) centroids originating from other ions or noise. There is no vagueness in part one and part three. However, part two is a different situation. The number of ions can be zero, one or more. In order to achieve the best performance of clustering, the parameter cannot be fixed. Optimal k-means clustering can solve this type of problem effectively.
Optimal k-means clustering in one dimension by dynamic programming (referred as ckmeans.1d.dp)26 is an algorithm that guarantees optimality of clustering in 1-D. One can input a range of numbers of clusters such as 1, 2, …, k, and it works as follow:
(1) Define a sub-problem such as finding the minimum withiness (the within-cluster sum of squares for each cluster) of x1, …, xi into m clusters.
(2) Record the responding minima in entry D[i, m] of an n + 1 by k + 1 matrix D.
(3) Because D[j − 1, m − 1] must be the optimal withiness for the first j − 1 points in m − 1 clusters, a cyclic equation establishes the optimal substructure:
![]() | (3) |
Compared with the standard k-means algorithm, ckmeans.1d.dp tends to be more robust and can find the global optimum. In order to find data points with the same source ion as the reference centroid, the distance between these points and the reference is supposed to be as small as possible. Moreover, other points are far from the reference. Therefore, the square of the difference is used for running the ckmeans.1d.dp algorithm. Results show that it works better than the absolute difference does (Fig. 3). Since it is impossible that there are so many types of ions having similar m/z to the reference centroid, the range clusters of ckmeans.1d.dp is set only from one to four. Only those which belong to the same cluster as the reference are included in the PIC. If the number of points is less than 5, they will be removed from the dataset directly.
Gaussian similarity: theoretically, the ideal chromatographic signals have a similar shape to the Gaussian function curve. The classical and modified Gaussian functions can be used to estimate chromatographic signals.27 Gaussian similarity is also used to evaluate the goodness of a chromatographic feature.12 Herein, 1.5 times the R2 of the Gaussian fit is used as score of Gaussian similarity. R2 is used for reflecting the degree of PIC similar to Gaussian curve, and is defined as follows:
![]() | (4) |
Sharpness: sharpness indicates how quickly the intensity of the signal changes with time. A high sharpness value means that feature is likely to be generated by a single ion.28 Sharpness is defined as follow:
![]() | (5) |
SNR: SNR stands for signal to noise ratio, which is widely used in peak detection algorithms.29,30 Usually noise is estimated as 95 percentage quantile of absolute continuous wavelet transform (CWT) coefficients of the smallest scale within a local window.25 However, the length of the PICs is usually small and varies, so the scale of the CWT should have self-adaption with the length. Therefore, the FFT algorithm31 is used to perform a CWT with accelerated convolution, the smallest scale set to 2, the spacing between scales set to 0.4875 and the number of scales defined as follow:
![]() | (6) |
In order to define a score function considering Gaussian similarity, sharpness and SNR simultaneously, the scores of each aspect should have a similar scale. Therefore, the sharpness score of the ith PIC is defined as
![]() | (7) |
![]() | (8) |
Finally, the total score of the ith PIC is defined as
Scoretotal,i = scoregaussian,i + scoresharpness,i + scoreSNR,i | (9) |
Scorethreshold = avg(score) − λ × sd(score) | (10) |
δ = m/z × R × 10−6 | (11) |
ξ = abs(0.05 × I × R) | (12) |
For the simulated dataset, the truth is known. It is easy to distinguish whether an identified feature is a true feature. The TPR (true positive rate) and the FDR (false discover rate) are both used to describe the performance.
TPR = TP/(TP + FN) | (13) |
FDR = FP/(FP + TP) | (14) |
The MM48 dataset is used to test the performance of methods in real complex systems. First, sample one is processed by each method; the extracted features are supposed to belong to these 48 known compounds. The parameters of each method are adjusted to ensure that the number of extracted features by each method is similar and as many features as possible can be extracted by all methods. Then, not changing the parameters except for the intensity threshold, sample two and sample three are processed by each method. A good method is supposed to find the features obtained by the previous procedure, even with the interference peaks produced by the substances in the leaves. Therefore, the methods are evaluated with recall.
Recall = TP/NP | (15) |
The Arabidopsis thaliana dataset is used to observe the performance of methods in unknown systems. Real features are liable to be extracted by different methods. If a method provides more features in common with other methods, it tends to be reliable. In other words, if a method provides more distinctive features, it tends to be less reliable.
Algorithm | Parameters |
---|---|
KPIC | Alpha = 0.1/0.5, tRange = 10, lambda = 3, level = 500–2000 |
PITracer | maxGap = 5, calibration tolerance = 100, length > 25, intensity > 500–2000 |
centWave | Peak width = (10, 70), ppm = 150, snthresh = 3–5, prefilter = (1500–2000) |
The FDR of the centWave is limited in a small range due to the robustness. The lower ceiling of the TPR than the other two methods means lower intensity features covered by noise cannot be extracted. As mentioned above, centWave is based on peak detection in ROI. The peak detection with wavelet transform is influenced by the SNR evidently. When high level noise gets into the ROI, peak detection can be affected severely. PITracer shows better performance than centWave, which might be because PITracer adjusts the mass tolerance of each scan.
The shape of the ROC curve from PITracer and KPIC is similar, whereas the ROC from KPIC shows better performance both in TPR and FDR. When extracting high intensity signals, the FDR of KPIC is zero. This is because KPIC searches centroids with intensities from high to low, not considering the noise. When it comes to low intensity features, the FDR increases gradually, but maintains a small range. The influence of signal intensity is limited in KPIC, because KPIC considers the gathering tendency of centroids, i.e., random noise does not appear continuously in groups.
Algorithm | 20 μM | 5 μM | ||
---|---|---|---|---|
Recall | TP | Recall | TP | |
KPIC | 0.827 | 2018 | 0.788 | 1924 |
PITracer | 0.668 | 1234 | 0.435 | 1093 |
centWave | 0.822 | 1915 | 0.758 | 1766 |
The results show that centWave has a similar recall to KPIC at higher concentration levels. CentWave adopts multiscale peak detection algorithm. This makes it have an excellent feature identifying performance from interference signals.
PITracer shows a worse performance than it does in simulated datasets. In simulated datasets, the noise is added randomly. Herein, the disturbing signals are not just random noise. Since PITracer traces pure ions from scan to scan, it can be affected when two ion traces tangle. KPIC considers from an overall view, and aims at the distribution tendency of centroids rather than on a single centroid, so the effect of disturbance in KPIC is limited. Compared with the centWave method, KPIC tends to identify more features with a similar recall, which is also favorable.
It can be seen that most of the features are simultaneously extracted by two of the three methods. This shows that these methods are reliable and the features tend to be real. The ratio of unique features of each method is similar since each method chooses features in different ways, so some small signal differences are understandable.
We evaluated the quality of extracted PICs from PITracer and KPIC. Due to the fact that centWave is not based on a pure ion but a peak detection method on ROI, it is not considered here. The calculations for sharpness and SNR are referred to in Section 2. Here, R-square stands for the Gaussian similarity. For those not having an upper bound and not fit with Gaussian function, the values are set as zero.
Furthermore, another value, peak significance, is introduced. It is defined as the mean intensity of data points near the peak apex, divided by the mean intensity of data points near the two boundaries.18 PIC is different from EIC. It is usually very short and has very fuzzy peak boundaries. Therefore, the mean intensity of the boundaries is replaced by the first quartile of the whole PIC while the meaning remains the same.
We evaluated the peak quality of the PICs extracted from sample one of the MM48 dataset and a mixed standard solution with the criteria referred above. As was shown before, most of the features are identified by both KPIC and PITracer. The results are shown in Table 3, where the R-square, sharpness and SNR values are the mean values of the PICs extracted by each method, and the peak significance values are median values. This is because the peak significance value is very sensitive. Features extracted by KPIC have higher values in all the four indices. This indicates that KPIC extracts higher quality PICs.
R-Square | Sharpness | SNR | Significance | |
---|---|---|---|---|
KPIC | 0.8161 | 10.9632 | 7.5201 | 15.1047 |
PITracer | 0.8033 | 7.5008 | 3.6112 | 6.2698 |
To make the result more visual, four peaks are chosen and shown in Table 4. Their R-square, sharpness, SNR and peak significance values are given individually and the corresponding PICs are plotted in Fig. 7. It is very clear that PICs with higher values are of better quality.
m/z | rt | R-Square | Sharpness | SNR | Significance | |
---|---|---|---|---|---|---|
KPIC | 312.133 | 309.095 | 0.846 | 21.242 | 8.501 | 9.865 |
PITracer | 312.129 | 309.094 | 0.768 | 9.157 | 2.761 | 2.761 |
KPIC | 341.095 | 351.543 | 0.723 | 14.357 | 4.439 | 5.468 |
PITracer | 341.086 | 351.542 | 0.540 | 6.880 | 1.258 | 1.503 |
KPIC | 483.187 | 583.996 | 0.758 | 6.921 | 9.000 | 1.629 |
PITracer | 483.184 | 584.000 | 0.657 | 1.862 | 2.179 | 4.884 |
KPIC | 226.953 | 1133.80 | 0.572 | 12.906 | 10.932 | 117.739 |
PITracer | 226.953 | 1133.80 | 0.649 | 3.719 | 1.828 | 2.114 |
![]() | ||
Fig. 7 Comparison of PICs with high values of R-square, sharpness, SNR, peak significance extracted by KPIC (left) and low values extracted by PITracer (right). |
True positive | False positive | Split signals | |
---|---|---|---|
KPIC | 10 | 0 | 0 |
PITracer | 9 | 1 | 1 |
centWave | 6 | 0 | 0 |
This journal is © The Royal Society of Chemistry 2016 |