Jian
Guo
,
Huaxu
Yu
,
Shipei
Xing
and
Tao
Huan
*
Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, BC Canada, V6T 1Z1, Canada. E-mail: thuan@chem.ubc.ca
First published on 16th August 2022
Advancements in computer science and software engineering have greatly facilitated mass spectrometry (MS)-based untargeted metabolomics. Nowadays, gigabytes of metabolomics data are routinely generated from MS platforms, containing condensed structural and quantitative information from thousands of metabolites. Manual data processing is almost impossible due to the large data size. Therefore, in the “omics” era, we are faced with new challenges, the big data challenges of how to accurately and efficiently process the raw data, extract the biological information, and visualize the results from the gigantic amount of collected data. Although important, proposing solutions to address these big data challenges requires broad interdisciplinary knowledge, which can be challenging for many metabolomics practitioners. Our laboratory in the Department of Chemistry at the University of British Columbia is committed to combining analytical chemistry, computer science, and statistics to develop bioinformatics tools that address these big data challenges. In this Feature Article, we elaborate on the major big data challenges in metabolomics, including data acquisition, feature extraction, quantitative measurements, statistical analysis, and metabolite annotation. We also introduce our recently developed bioinformatics solutions for these challenges. Notably, all of the bioinformatics tools and source codes are freely available on GitHub (https://www.github.com/HuanLab), along with revised and regularly updated content.
Category | Software name | Purpose |
---|---|---|
Data acquisition | DaDIA13 | Combining DDA and DIA modes for metabolomics data acquisition |
Feature extraction | Paramounter14 | Directly measuring the optimal feature extraction parameters |
Integrated Feature Extraction,15 JPA16 | Extracting metabolic features of both high and low confidence | |
EVA17 | Evaluating feature fidelity using chromatographic peak shapes | |
ISFrag18 | De novo annotation of false positive metabolic features generated from in-source fragmentation | |
Quantitative comparison and statistical analysis | MRC19 | Correcting fold change compression and inflation in MS-based metabolomics |
MAFFIN20 | Post-acquisition sample normalization | |
PHPA_precision21 | Correcting computational variation caused by peak height or peak area-based quantification | |
PowerU22 | Improving the statistical power of MS-based metabolomics | |
ABC Transformation23 | Improving data normality with feature-specific data transformation | |
Metabolite annotation | HNL, CSS, and McSearch24 | Concept, algorithm, and web platform to perform spectral similarity analysis and molecular networking |
MS2Purifier25 | Recognizing and removing contamination fragment ions in experimental MS/MS spectra | |
SteroidXtract26 | Extracting steroid-like metabolic features based on their unique MS/MS patterns |
To address the knowledge gap, we systematically compared the three abovementioned data acquisition modes, focusing on their performance in metabolomics profiling36 and ability to identify statistically significant features.37 In the comparison of metabolomics profiling, we assessed the number of features, MS/MS spectra coverage and quality, quantitative precision, and data processing convenience (Fig. 2). Our results show that the most metabolic features are extracted from full-scan data, which is 53.7% and 64.8% more than DIA and DDA, respectively. In terms of MS/MS spectra, DDA generates higher quality MS/MS spectra that match MS/MS libraries better, whereas DIA has higher MS/MS spectral coverage. Regarding the quantitative precision, no significant difference was observed among these three acquisition modes. In the comparison of significant features discovered across these acquisition modes, we concluded that the consistently discovered ones are mostly true positive features (i.e., real metabolic features).37 They have a strong correlation in abundance among all three modes and present similar statistical performance. On the other side, many uniquely discovered significant features are false positive features from background noise and system contamination.37 Although DDA slightly underperforms full-scan and DIA in significant metabolic feature discovery, it is the most convenient method to obtain high-quality MS/MS spectra for metabolite annotation.
Following the comparison of these data acquisition modes, we believe that a better data acquisition strategy that integrates the advantages of the existing methods is essential for advancing LC-MS-based metabolomics. We thus developed data-dependent assisted data-independent acquisition (DaDIA).13 The DaDIA workflow performs DIA analyses of biological samples and DDA analyses of the pooled quality control (QC) samples analysed at regular intervals between biological samples throughout the analytical sequence (Fig. 2). The DIA analyses provide high coverage of metabolic features and MS/MS spectra, and the DDA analyses generate high-quality MS/MS spectra to improve the overall confidence of metabolite annotation. We further developed an R package, DaDIA.R, to automate the data processing and metabolite annotation of DaDIA data. Since DaDIA takes full advantage of DDA and DIA, it achieves a much higher coverage of metabolic features with better spectral quality. DaDIA was applied to a study comparing the metabolic alteration in the plasma of leukemia patients before and after receiving chemotherapy. Our results demonstrated that the DaDIA workflow can efficiently detect and annotate approximately four times more significantly altered metabolites than the conventional DDA workflow.
Another notable challenge in feature extraction is that conventional peak picking algorithms are unable to completely extract features with low abundance or poor chromatographic peak shapes. In particular, many real metabolic features have valid MS/MS spectra but cannot be extracted by conventional peak picking algorithms. In this regard, we designed a peak picking algorithm that can directly extract metabolic features based on their available MS/MS spectra in the raw DDA-based LC-MS data. We also combined this MS/MS spectra-based peak picking algorithm with conventional peak shape-based peak picking to build an integrated workflow for a more comprehensive extraction of metabolic features.15 The proposed integrated feature extraction algorithm extracted 25% more metabolic features from a human urine sample than the conventional centWave-based feature extraction algorithm with the same parameter settings. Furthermore, we created a targeted feature extraction algorithm for use with a targeted list of metabolites with known m/z and retention time. Combining the peak shape-, MS/MS spectra- and targeted list-based peak picking strategies, we constructed JPA (short for joint metabolic feature extraction and automated metabolite annotation), an R package that not only extracts the metabolic features using the integrated strategy but also performs the automated metabolite annotation. When the three algorithms were applied together on a mixture of 134 endogenous metabolite standards, JPA demonstrated superior feature detection sensitivity by reaching a limit of detection (LOD) thousands of times lower than the conventional centWave peak picking algorithm. Moreover, JPA also surpassed the conventional centWave algorithm by detecting 2.3-fold more exposure chemicals from a standard mixture containing 505 drugs and pesticides.16
On the other hand, enhancing feature extraction sensitivity usually comes with an increase in false positive features. False positive metabolic features can reduce the confidence of downstream statistical and biological interpretations.38 A common practice to find false positive features is to manually check the peak shapes of the extracted ion chromatograms (EICs) of metabolic features. Typically, real metabolites are more likely to have Gaussian-shaped chromatographic peaks. Manual checking of EICs is very effective in filtering out false positive noise peaks, as an experienced analytical chemist can easily differentiate between true and false positive features by simply looking at their peak shapes. However, metabolomics data contains thousands of metabolic features, and manual inspection of their EICs is extremely labour-intensive and time-consuming. Previous work developed a strategy to send the data to a smartphone application so that users can manually check the peak shapes, but the time spent on manual checking was not clearly reduced.55 To replace the tedious process with a labour-free task, we developed an artificial intelligence-based program using a convolutional neural network (CNN) model, well-known for its efficient performance in image classification.56 CNN is a type of deep learning algorithm, along with artificial neural networks (ANN), recurrent neural networks (RNN), and many others.57 Compared to traditional machine learning (e.g., Support Vector Machine and Random Forest), the biggest advantage of these deep learning algorithms is that they can learn high-level features from data without manual feature extraction and are very efficient with large-scale data sets.58 Notably, the metabolomics community has recognized the potential of deep learning in metabolomics.58,59 Previous research efforts have incorporated deep learning in metabolic feature extraction, metabolite annotation, and the predictions of retention time and collision cross section.60–69 Regarding classifying good and bad chromatographic peak shapes, a previous work applied CNN to the classification of GC-MS chromatographic peaks.70 In another study, CNN was used to determine the chromatographic peak shapes in LC-MS-based metabolomics data.71 However, that workflow involves multiple R and Python scripts that users have to run individually. Due to their small training data size, users would also need to retrain the model for different LC-MS conditions, such as different spectra acquisition rates.
In our work, we aimed to develop a robust and easy-to-use CNN model for chromatographic peak recognition. To minimize data overfitting and ensure the robustness of the model, we trained our CNN model with over 25000 manually inspected plots of true and false chromatographic peaks generated from 22 different LC-MS-based metabolomics studies. Furthermore, we created a Windows application named EVA (short for evaluation of chromatographic peak shapes) for the convenience of metabolomics researchers with limited programming experience.17 Evaluated using metabolomics data from different MS instruments and acquisition rates, EVA was proved to achieve over 90% classification accuracy when referenced against manual checking results. Notably, another work was published later following a similar CNN strategy. Future work is needed to make performance comparisons.72
Removing false positive features with poor peak shapes is not the last stop, as metabolic features with good EIC peak shapes still might not be real metabolites. Another common type of false positive feature originates from in-source fragmentation (ISF). In LC-MS analysis, ions generated during electrospray ionization (ESI) are always accompanied by ion fragmentation, which leads to ISF ions. ISF is a naturally occurring and inevitable phenomenon that is independent of ionization voltage.73 Annotating ISF features as real metabolites by mistake is detrimental to the downstream biological interpretation. Previous works have developed strategies to recognize ISF via manual efforts, stable isotopes, or reference standards.74–78 However, many metabolic features have diverse ISF patterns and might not have a standard MS/MS spectrum available for such manual checking. To provide an automated workflow for de novo recognition of ISF features, we developed the MS/MS library-free R package ISFrag to seek ISF features based on three patterns: (1) ISF ions coelute with their precursor ion, (2) the m/z of ISF ions appear in the MS/MS spectrum of their precursor ion, and (3) ISF ions and their precursor ion are similar in fragmentation patterns and thus have highly correlated MS/MS spectra.18 Notably, ISFrag can be used on LC-MS data generated from full-scan, DIA, and DDA modes as long as at least one DDA analysis is performed to provide the MS/MS spectra required by ISFrag. Our results show that ISFrag achieves 100% accuracy in recognizing the ISF features that fit all three abovementioned patterns from the data of a standard mixture containing 125 endogenous metabolites. ISFrag allowed us to successfully recognize falsely annotated metabolites in a human urine dataset, determining them to be in-source fragments.
The fair comparison of samples with equal amounts or concentrations is also critical to quantitative accuracy, which is achieved by sample normalization. Sample normalization is especially important for biological samples with significant biological dilution effects, such as urine, saliva, and feces.80–83 In general, sample normalization can be applied either before or after data acquisition. Pre-acquisition sample normalization measures a certain quantity that reflects the total sample amount or metabolite concentration. The samples are then reconstituted to appropriate final volumes based on the measured quantities to make the total concentration consistent between the samples. For example, creatinine level generally reflects the urine concentration and is commonly used for normalizing urine samples.80,84,85 However, many biological sample types lack reliable quantities that represent the total sample amount for normalization.84
Post-acquisition sample normalization is an alternate strategy that is data-driven and does not require reliable quantities for different sample types. Certain assumptions are made about the metabolomics data structure, and then normalization factors are calculated for adjusting the measured signal intensities. In essence, the accuracy of the assumption determines the post-acquisition sample normalization performance. For instance, the mass spectrum total useful signal (MSTUS) algorithm assumes equal total signal intensities among samples. However, the MS signal intensities of different metabolic features can vary by several magnitudes owing to different concentrations and ionization efficiencies. Therefore, the MSTUS algorithm can be dominated by high-intensity metabolic features and fail to reflect the change in the total metabolome. Probabilistic quotient normalization (PQN) addresses the issue of drastically different signal intensities and thus can be more useful. In the PQN-based workflow, quotients of all metabolic features are calculated against the reference sample, and the median of the calculated quotients is used as a normalization factor.86 After quotient calculation, all metabolic features are equally considered for normalization regardless of their original MS intensities. However, the median quotient only correctly represents the normalization factor when the numbers of up- and down-regulated metabolic features are equal.20 However, it is quite common to see different numbers of up- and down-regulated metabolic features in metabolomics. In addition, given the above-mentioned issue of signal ratio bias, calculating quotients directly using MS signals might not represent metabolic concentration changes accurately. Even so, the detected metabolic features are not pre-processed before normalization in conventional post-acquisition normalization algorithms, which causes significant bias.
In this regard, we developed MAFFIN (short for maximal density fold change normalization with high-quality metabolic features and corrected signal intensities), an accurate and robust post-acquisition sample normalization workflow for MS-generated metabolomics data that is independent of sample type (Fig. 4b).20 MAFFIN first selects high-quality metabolic features by evaluating multiple orthogonal quantification criteria and then corrects their MS signal intensities for normalization. Then, we created an efficient method to calculate normalization factors, which is based on the maximal density fold change (MDFC) computed by a kernel density approach.87 Unlike the PQN algorithm, which relies on balanced up- and down-regulated metabolic features, MDFC normalization assumes that the unchanged metabolic features dominate the fold change frequency. Hence, it is not influenced by the balance of up- and down-regulated metabolic features. Using simulated data, we show that as long as the percentage of unchanged metabolic features is larger than 25%, MDFC is a good representation of the true normalization factor.20 Using twenty publicly available and two in-house metabolomics data sets, we confirmed that MAFFIN outperforms four commonly used post-acquisition normalization methods, including total intensity, median intensity, PQN, and quantile normalizations, in terms of reducing intragroup variations. The biological application of MAFFIN on a human saliva metabolomics study reduces the unwanted variation introduced by the biological dilution effect, leading to better data separation in principal component analysis (PCA) and more significantly altered metabolic features.
Quantitative precision is another key factor for a successful metabolomics study. Our recent work recognizes that besides the well-recognized analytical and biological variations, untargeted metabolomics encounters additional quantitative variation, termed computational variation.21 The computational variation is caused by automated computational data processing steps, where the software cannot accurately determine chromatographic peak heights/areas for metabolic features with poor chromatographic peak shapes (Fig. 4c).16 Using various biological sample types, we systematically investigated how sample concentration, LC separation conditions, and data processing software contribute to computational variation. Our results suggest that the computational variation is largely determined by the data processing software. In addition, the magnitude of the computational variation is consistent across different samples when their metabolic concentrations are similar. We further developed PHPA_precision, a tool to minimize the computational variation in metabolomics studies by properly selecting between peak height or area for the peak intensity calculation method. This bioinformatics solution helped reduce the computational variation of 71% (652/915) of metabolic features, and over 31% (206/652) of the corrected features showed distinctly changed statistical significance.
Following the quantitative comparison, our lab also attempted to understand metabolomics data distributions in order to improve the performance of statistical analyses. Currently, parametric statistical models, such as Student's t-test, are widely used to extract the significantly changed metabolites. However, the requirements on data normality for these statistical analyses are often violated due to the nonlinear ESI responses in MS-based metabolomics. As a result, the statistical power can be reduced and some significantly changed metabolites are thus missed. Although nonlinear ESI response has been well-known for decades, its impact on data distribution and statistical analysis has not been systematically studied. To address this knowledge gap, we used both Monte Carlo simulations and real metabolomics data sets to quantitatively assess the diminished statistical power caused by nonlinear ESI responses (Fig. 4d). Our urine metabolomics data demonstrated that over 80% of metabolic features present nonlinear ESI response patterns, causing either left-skewed or right-skewed MS signal distributions.22 In addition, clear relationships between the degree of reduced statistical power and sample size/effect size were observed. To address this issue, we developed PowerU, a data processing tool to minimize the non-normality induced by nonlinear ESI response.22 Applying PowerU to a metabolomics study of mouse gut microbiome led to 105 extra metabolic features being discovered as significant, which largely reduces the chance of missing important biomarkers.
Besides nonlinear signal response, many other factors contribute to the overall non-normal metabolomics data distribution, including intrinsically non-normally distributed concentration data, sample collection, and sample preparation. As a result, the metabolomics data distributions are often diverse and complicated. However, despite the thousands of metabolomics publications every year, the study of metabolomics data distribution is limited. Additionally, in routine metabolomics practice, data transformation is commonly used to shape the various non-normal data distributions for statistical analysis. However, the most popular transformation approaches, log and square root transformations,88 do not consider the data structure and treat all the metabolic features equally. Therefore, there is no guarantee that the data normality can be improved after applying those transformations. Recently, our work explored and modeled the metabolic feature intensity distributions using three large and publicly available data sets, which confirmed that the non-normal distribution is common and varied in untargeted metabolomics research. The metabolomics data were modeled into nine types of beta distributions, among which two low-normality types are particularly common. Given the diverse data distributions, we proposed adaptive Box-Cox (ABC) transformation, a feature-specific data transformation approach for improving data normality (Fig. 4e).23 A power parameter, lambda, is tuned based on the data structure of each metabolic feature to ensure improved data normality after transformation. Tested on a series of Monte Carlo simulations, ABC transformation outperforms the two abovementioned conventional data transformation methods for both positively and negatively skewed data distributions. However, it is important to recognize that any nonlinear data transformation will change feature-to-feature relationships. For the correlation analysis of a metabolic feature pair, it is recommended to use the original quantitative data rather than the transformed data. Additionally, data transformation methods can alter the overall data distribution pattern. Especially in our feature-specific data transformation workflow, different features can be subjected to different transformation functions. Consequently, the visualization of the overall metabolic changes (e.g., principal component analysis) might be distorted.23
First of all, annotation of unrecognized metabolites often relies on searching for known metabolites with similar chemical structures. The structural similarity can be reflected by MS/MS spectral similarity, which is the key in known-to-unknown based metabolite annotation. Therefore, the development of a proper algorithm to compute spectral similarity is of great importance. Previous developments of Global Natural Products Social Molecular Networking (GNPS) and NIST Hybrid Similarity Search (HSS), among others, have been proposed.95–97 These algorithms consider the matching of both the m/z of fragment ions and the m/z differences between fragment ions and their precursors (i.e., neutral losses). They can reflect a certain degree of spectral similarity between metabolites and their one-step reaction biotransformed derivatives. However, these conventional algorithms show limited capability in capturing the common core structural component embedded in the metabolites, as the core structural information cannot be captured using fragment ions or neutral losses.
To create a spectral similarity algorithm that considers the core structural information, we proposed the concept of hypothetical neutral loss (HNL), which is defined as the mass difference between a pair of fragment ions in an MS/MS spectrum (Fig. 5a).24 These mass differences are hypothetical as (1) some HNL values of an experimental spectrum may not represent real metabolite substructures but are merely arbitrary values; and (2) some HNLs are not even generated during the fragmentation process. We demonstrated that HNL values contain core structural information that can improve access to shared structural units between two MS/MS spectra. We thus developed the Core Structure-based Search (CSS) algorithm, which considers conventional fragment ions, neutral losses, and more importantly, HNL values. Compared to existing spectral comparison algorithms, CSS shows a significantly improved correlation between spectral and structural similarities, paving the way for more accurate and informative molecular networking analysis. Furthermore, by combining the CSS algorithm, an HNL library, and a biotransformation database, we developed Metabolite core structure-based Search (McSearch), a web-based platform to facilitate the annotation of unknown metabolites by referencing the MS/MS spectra of their structural analogs.
During spectral similarity analysis, as well as de novo spectra interpretation, the spectral quality of experimental MS/MS matters. However, MS/MS data collected from LC-MS analyses are often contaminated because the selection of precursor ions is based on a low-resolution quadrupole mass filter. A consequence of the wide m/z isolation window is that precursor ions of other chemicals with similar m/z values can also get through the mass filter into the collision cell for fragmentation. The fragmentation of unwanted precursor ions generates contamination fragmentation ions (CFIs), which show up with true fragmentation ions (TFIs) from the targeted precursor ions, leading to “chimeric” MS/MS spectra. This issue has been recognized in metabolomics with the development of RAMSY.98 To recognize and remove CFIs in experimental MS/MS spectra, we proposed a peak correlation-based approach (Fig. 5b).25 The primary premise is that TFIs should coelute with their parent ions with highly correlated LC chromatographic patterns, but CFIs do not necessarily follow the patterns. On top of that, we developed MS2Purifier, a machine learning-assisted solution that removes CFIs from experimental MS/MS spectra and improves MS/MS spectral quality for more confident metabolite identification and MS/MS interpretation. Our work was published at a similar time as another library-based MS/MS cleaning platform, DecoID.99 These two approaches use complementary algorithms to remove contamination fragment ions, and the combined usage may lead to better spectra purification.
On the other hand, in silico fragmentation is a powerful solution that generates predicted MS/MS spectra for a broad range of chemicals without reference standards.94,100–106 Particularly, combining in silico structural databases with machine learning approaches further enhances the confidence of unknown identification.93,107,108 To achieve in silico MS/MS prediction, fragmentation rules are usually implemented, of which an important one is the even-electron rule. It states that even-electron precursor ions should follow heterolytic cleavages and predominately generate even-electron fragment ions with very few radical fragment ions (RFIs).109 However, our study of over one million low-energy collision-induced dissociation (CID) MS/MS spectra for 27,613 unique chemical compounds in the NIST20 MS/MS spectral library shows that over 60% of MS/MS spectra of even-electron precursors contain at least 10% RFIs by ion-count (total number of ions) in positive and negative ESI modes (Fig. 5c).110 This work indicates that the even-electron rule is widely disobeyed, and strictly following the even-electron rule may lead to the non-comprehensive prediction of MS/MS spectra.
Last but not least, in many metabolomics studies, biological researchers are interested in not the entire metabolome but specific classes of chemicals that are essential to the biological process. For instance, steroids are a class of molecules that play a critical role in many physiological systems and diseases, yet many steroids are unrecognized and unreported in the literature. The ability to unbiasedly and accurately detect and quantify both known and unknown steroids is of great significance. However, the recognition of unknown steroids is a big challenge. To address this question, our lab proposed a biology-driven solution.26 In that work, we developed a CNN-based bioinformatics tool, SteroidXtract, to recognize steroid molecules in MS-based untargeted metabolomics using their unique MS/MS spectral patterns (Fig. 5d). Our results demonstrate that SteroidXtract can confidently identify a broad range of both known and unknown steroids in biological samples, greatly accelerating a variety of steroid-focused life science research. Compared to conventional statistics-driven untargeted metabolomics data interpretation, our work offers a novel automated biology-driven approach that prioritizes biologically significant molecules with high throughput and sensitivity.
In general, the prediction of chemical classes directly from MS/MS data alone does not work well for all chemical classes. This is mainly due to the limited reference spectra available, which leads to the problem of compound class imbalance. Our SteroidXtract work addressed this issue by data augmentation, the creation of artificial training data from the existing steroid MS/MS spectra.26 However, achieving a system-level chemical classification using data augmentation has not been tested. Moreover, many chemical classes do not have clear or specific MS/MS spectral patterns, lowering the prediction sensitivity and specificity. As such, achieving generic chemical class prediction requires other structural and spectral information. The recent publication of CANOPUS (class assignment and ontology prediction using mass spectrometry) makes it possible to perform system-level compound class predictions directly from molecular fingerprints.111 CANOPUS was trained using support vector machine and deep learning algorithms to build the connections between fragmentation patterns, molecular fingerprints, and chemical classes. A key advantage of this design is that it separates the prediction of fingerprints using MS/MS spectra and the prediction of chemical classes using fingerprints. Therefore, these two models can be trained using separate datasets. This allows the prediction of chemical class using fingerprints, not limited to the data that have available reference MS/MS spectra, and it can utilize the entire chemical database for training. For the application of CANOPUS, an inputted MS/MS spectrum is processed to generate a fragmentation tree and predicted molecular fingerprints that are then used to predict the hierarchical compound class of the represented metabolite. There are also other structural classification approaches that rely on MS/MS clustering or chemical database searching.112,113 Future research may go towards in-depth global metabolite annotation and structural analog discovery with the aid of compound class-enhanced molecular networking. Additionally, comparative metabolomics on the compound class level may also provide a more comprehensive and intuitive mechanistic insight behind biological questions.111,114
This journal is © The Royal Society of Chemistry 2022 |