Chengzhi Dengab,
Chengli Zhoua,
Lei Shi*a and
Bingyi Wang*ac
aInstitute of Highland Forest Science, Chinese Academy of Forestry, Kunming 650233, Yunnan, P. R. China. E-mail: leishi@139.com
bNanjing Forestry University, Nanjing 210037, Jiangsu, P. R. China
cKey Laboratory of Smart Drugs Control of Ministry of Education, Yunnan Police College, Kunming 650223, Yunnan, P. R. China. E-mail: wbykm@aliyun.com
First published on 22nd April 2025
In the domain of metabolomics, the accurate identification of compounds is paramount. However, this process is hindered by the vast number of metabolites, which poses a significant challenge. In this study, a novel approach to compound identification is proposed, namely a molecular-fingerprint prediction method based on the graph attention network (GAT) model. The method involves the processing of fragmentation-tree data derived from tandem mass spectrometry (MS/MS) data computation and the subsequent processing of fragmentation-tree graph data with a technique inspired by natural language processing. The model is then trained using a 3-layer GAT model and a 2-layer linear layer. The results demonstrate the method’s efficacy in molecular-fingerprint prediction, with the prediction of molecular fingerprints from MS/MS spectra exhibiting a high degree of accuracy. Firstly, this model achieves excellent performance in receiver operating characteristic (ROC) and precision–recall curves. The factors that have the most influence on the resultant performance are identified as edge features using different training parameters. Then, better performance is achieved for accuracy and F1 score in comparison with MetFID. Secondly, the model performance was validated by querying the molecular libraries through methods commonly used in related studies. In the results based on precursor mass querying, the proposed model achieves comparable performance with CFM-ID; in the results based on molecular formula querying, the model achieves better performance than MetFID. This study demonstrates the potential of the GAT model for compound identification tasks and provides directions for further research.
Mass spectrometry (MS) has emerged as a pivotal instrument in metabolite identification, facilitating comprehensive metabolomics analyses. This technique boasts several key advantages, including high sensitivity and specificity, along with the ability to analyze minimal sample volumes, rendering it a highly efficient tool for metabolite profiling.3 Nevertheless, MS furnishes a paucity of data regarding the elemental compositions and chemical structures of fragments.2 Tandem mass spectrometry (MS/MS) is a widely utilized technique that facilitates the acquisition of additional information regarding the chemical structures of compounds.4 The identification of metabolites in MS or MS/MS spectra constitutes a pivotal step in the subsequent chemical biology interpretation and modelling of metabolomics samples. In practice, this process is regarded as the most challenging and time-consuming aspect of metabolomics experiments. The fragmentation of metabolites, in contrast to the relatively straightforward process of peptide and protein fragmentation due to structural repetitions, is a more intricate and probabilistic process, characterized by the presence of different fragmentation energies. Consequently, the interpretation of mass spectra demands specialized knowledge and expertise. To address this need, numerous computational techniques and software tools have been developed to facilitate metabolite identification in metabolomics experiments.
Computational techniques for metabolite identification can generally be divided into four categories:2 (1) mass spectrum libraries: the MS/MS spectrum of the unidentified compound is compared with the reference compound spectra in the mass spectrometry database,4,5 and the candidates are scored and ranked according to their similarity to the queried spectrum. The commonly used databases are METLIN,6 HMDB,7 MassBank,8 GNPS,9 PubChem,10 KEGG,11 etc. Nevertheless, the extent of metabolite coverage provided by these databases is considerably limited in comparison to the substantial quantity of metabolites present in nature. Consequently, their capacity to identify unknown metabolites is somewhat constrained.12 (2) In silico fragmentation: a software tool for predicting fragments and their abundance from the molecular structure of compounds to fill the gap between spectral and structural databases. This strategy has been successfully applied to protein research (e.g., MASCOT13 and SEQUESTEng14). In contrast, the fragmentation of product ions of metabolites in MS/MS is a much more complicated stochastic process, depending on the 3D structures of the metabolites, the energy required to break specific bonds to obtain the product ions, the probability of different dissociation reactions, etc.2 (3) Fragmentation trees: Böcker and Rasche15 proposed the use of fragmentation trees for interpreting MS/MS spectra. Fragmentation trees can provide several benefits, such as being used to identify the molecular formula of a molecule, and being used to interpret the fragmentation process of a precursor ion through MS/MS spectra.16 In addition, it can be used for comparison by aligning fragmentation trees of two unknown compounds, which can lead to the introduction of useful information about compounds that cannot be identified, such as clustering.17,18 (4) Machine learning: in recent years, several machine-learning frameworks have been used to tackle metabolite identification tasks. For example, Brouard et al.,19 Dührkop et al.,20 and Heinonen et al.21 have proposed several methods for predicting substructures or general chemical properties. Mrzic et al.22 and van der Hooft et al.23 proposed a method for automatic discovery of substructures from MS/MS spectra, and then identifying candidate compounds from databases based on their substructures.
Molecular fingerprinting is a method of encoding the structure of a molecule that can be converted into a bit string. Each bit in this bit string represents the presence or absence of a substructure in the molecule. This method has applications in the comparison of molecular similarity and the identification of molecules with matching substructures. A multitude of molecular fingerprinting algorithms have been developed, including Avalon, MACCS, Morgan and Klekota–Roth, among others. A variety of instruments are available for the purpose of calculating molecular fingerprints, including: Open Babel,24 RDKit,25 CDK26 and so on. Meanwhile, MetFID,27 FingerID,21 CSI:FingerID20 and many other tools have been developed for predicting molecular fingerprints from MS/MS spectra.
In this study, a methodology is proposed for the prediction of molecular fingerprints of compounds from fragmentation-tree data, which are calculated from MS/MS spectra. The method can improve the accuracy of molecular-fingerprint prediction and help better metabolite identification. The workflow is shown in Fig. 1.
For a given graph G = (V, E), V represents the set of vertices (or nodes) in the graph and E represents the set of edges (or connections) between the vertices. Each node vi ∈ V has a feature vector , where d is the dimension of the feature vector. The GAT model can be defined as follows:
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
The fragmentation-tree data were transformed into a graph data structure. Each node in the graph corresponds to a specific fragment, with the molecular formula (encoded using one-hot encoding) and relative abundance of the fragment being represented within the feature vector. Each edge in the graph represents the relationship between two fragments, with the feature vector of each edge being calculated based on the approach of Yao et al.,30 which is usually used in natural language processing.
For two nodes i and j in the graph, the feature vector of the edge between them can be calculated as follows:
![]() | (5) |
![]() | (6a) |
![]() | (6b) |
![]() | (6c) |
TF-IDFij = TFij × IDFj, | (7a) |
![]() | (7b) |
A total of 16659 bits of molecular-fingerprint data were generated based on the molecular structure information (from SMILES or InChI) of the compounds. Among them, 1024 bits of FP2 fingerprints were generated using Open Babel; 2048 bits of Atom Pair fingerprints, 512 bits of Avalon fingerprints, 166 bits of MACCS fingerprints, 2048 bits of Morgan fingerprints, and 2048 bits of RDKit fingerprints were generated using RDKit; and 1024 bits of CDK fingerprints, 881 bits of PubChem fingerprints, and 4860 bits of Klekota–Roth fingerprints were generated using CDK.
The dataset was constructed by taking the graph data (including node features, edges, and edge features) as input values and the molecular fingerprints as output values.
To ensure the robustness of the dataset, a ten-fold cross-validation method was employed to divide the dataset into ten copies. In each iteration, nine of the ten copies were designated for training, while the remaining one was allocated for testing. A total of 10 training and testing sessions were conducted, and the average of the 10 tests was finally obtained as the final result.
Subsequent to the GAT layer is a pooling layer that utilizes a max-pooling strategy to convert the feature vector of each node into a scalar. Subsequently, the data undergoes two linear fully connected layers, followed by a sigmoid function that transforms the output value into a range between 0 and 1. This value is then employed as the predicted value of the molecular fingerprint.
The training process was optimized through the implementation of batch gradient descent, employing a batch size of 64. The loss function was defined as binary cross-entropy loss (BCELoss). The Adam with decoupled weight decay (AdamW)33 optimizer was employed. Following a preliminary evaluation, a learning rate of 0.0001 and a weight decay of 0.0001 were identified as the optimal parameters.
In order to verify the factors that have the most influence on the model, the model was trained using different strategies. These include: training more epochs, using a different size of datasets, deleting node features or edge features from the dataset and scaling down the number of GAT layers or linear layers.
The receiver operating characteristic (ROC) curve for each model is shown in Fig. 3a and the precision–recall curve is shown in Fig. 3b. A ROC curve is a graphical representation of the true positive rate (sensitivity) against the false positive rate (1 − specificity) for different cut-off points of a diagnostic test. A precision–recall curve is a graphical representation of the precision against the recall for different cut-off points of a classification model. The area under curve (AUC) of the ROC curve is a measure of the model’s ability to distinguish between the positive and negative classes and ranges from 0 to 1, with 1 indicating perfect discrimination and 0.5 indicating no discrimination (equivalent to random guessing). The AUC of the precision–recall curve is a measure of the model’s ability to identify positive samples and ranges from 0 to 1, with 1 indicating perfect identification and 0 indicating no identification.
The FF, FL and FFv models demonstrate optimal performance, exhibiting an AUC of 1.0 in the ROC curve and 0.97 in the precision–recall curve. These results indicate the models’ capacity to accurately differentiate between positive and negative samples and their high precision across a range of recall levels. This suggests that the models excel not only in identifying positive instances but also in doing so with a high degree of accuracy. The LF model demonstrates notable efficacy with an AUC of 0.98 in the ROC curve and 0.88 in the precision–recall curve, indicating robust performance. The FFe and FFn models also demonstrate adequate performance with AUCs of 0.92 and 0.86 in the ROC curve, respectively. However, these models exhibit suboptimal performance with AUCs of 0.54 and 0.37 in the precision–recall curve, indicating that their discrimination capabilities are satisfactory but not exceptional. Consequently, the trade-offs between precision and recall are less effective, resulting in a lower overall performance compared to the other models.
The findings indicate that the elemental information and the mass-to-charge ratio information exert a negligible influence on the model’s performance. Conversely, the relative abundance of node information emerges as the paramount factor contributing to the model’s efficacy. The edge weight information emerges as the most significant factor contributing to the model’s performance. The number of GAT layers is also found to be of significant importance, with a lack of layers potentially hindering the model’s ability to complete training. In contrast, the impact of linear layers is relatively minimal. Notably, the model demonstrates a capacity to attain satisfactory performance even with a reduced volume of training data. Conversely, an augmentation in the number of training epochs has a substantial adverse effect on the model’s performance, signifying an overfitting problem.
To demonstrate our models’ superiority, we compared the performance of our models with MetFID.27 The MetFID model is a machine learning model that can predict molecular fingerprints from MS/MS data. The results for comparison were obtained from a MetFID model trained using spectra with all collision energies, and tested on separated datasets with collision energies less than 30 eV and greater than 30 eV; and from another one trained using combined spectra generated by ion-trap (IT) and higher-energy collisional dissociation (HCD) instruments, and tested on separated datasets of IT and HCD instruments. The accuracy and F1 score of the MetFID model and our models are shown in Table 1.
Model | Data set | Accuracy | F1 score |
---|---|---|---|
MetFID | (⩽30 eV) | 92% | 58% |
(![]() |
94% | 69% | |
(IT) | 94% | 74% | |
(HCD) | 94% | 68% | |
Proposed models | FF | 97.7% | 85.6% |
FFn | 98.8% | 93.3% | |
LF | 95.8% | 71.7% |
Compared to MetFID models with different data sets of the same size, the LF model achieves an accuracy advantage of 1.8 to 3.8 percentage points (pp) and an F1 score difference of −2.3 to 13.7 pp. This demonstrates the advantages of this model over MetFID. For the FF model using a larger data size, there is a further improvement in accuracy of 1.9 pp and an improvement in F1 score of 13.9 pp compared to the LF model, indicating that this model is able to achieve a significant improvement on larger data sets. The FFn model with an increased number of training epochs shows a further improvement in accuracy of 1.1 pp and an improvement in F1 score of 7.7 pp compared to the FF model, indicating that the number of training epochs also has a significant effect on model effectiveness.
The FFn model has a lower AUC on the PR curve, but still achieves better results in terms of accuracy and F1 scores, which is due to the ROC curve being relatively robust to the imbalance of positive and negative samples, and the PR curve reflects the model’s ability to predict positive samples, whereas the accuracy and F1 scores reflect the model’s ability to predict all samples (including positive and negative samples). In the molecular-fingerprint prediction task, positive samples indicate the presence of substructure and negative samples indicate the absence of substructure. For a given molecule, there are only a limited number of substructures and most are absent, so the number of negative samples is much higher than the number of positive samples.
The compound data for the construction of the molecular libraries were obtained from the PubChem database, a free database created and maintained by the National Library of Medicine (NLM) that contains information on more than 100 million compounds. The molecular formula, molecular weight, SMILES, InChI, and other pertinent information of the compounds are included in the PubChem database.
A multitude of similarity-score methods can be utilized to compare the molecular fingerprints of the predicted compounds and the candidate compounds. These include the Tanimoto coefficient, the Dice coefficient, the Cosine coefficient, and the Euclidean distance. In this study, the Cosine coefficient was selected due to its effectiveness in the preliminary experiment. The Cosine similarity score is calculated as follows:
![]() | (8) |
The results based on precursor mass querying are shown in Table 2. The results of CFM-ID, MetFrag, and FingerID are collected from Allen et al.34 These were trained using MassBank data and tested by querying the PubChem database. The FF model’s performance, particularly within the 5 ppm window, exhibits a noticeable discrepancy compared to that of CFM-ID. This disparity can be attributed to the incomplete consistency of the molecular libraries utilized. However, the performance of the results exhibits a substantial enhancement upon further training of the FFn model. The LF model’s page table entries for the limited training set demonstrate comparable performance, indicating the model’s strong generalisation capability. All models proposed in this study show a competitive performance over other models.
Model | Accuracy | Top 1 | Top 5 | Top 10 |
---|---|---|---|---|
CFM-ID | 5 ppm | 7.3% | — | 46.9% |
MetFrag | 5 ppm | 4.7% | — | 20.8% |
FingerID | 5 ppm | 0.5% | — | 5.7% |
FF | 5 ppm | 5.1% | 6.1% | 6.3% |
20 ppm | 7.6% | 9.5% | 9.9% | |
50 ppm | 7.4% | 9.7% | 10.2% | |
100 ppm | 7.8% | 10.2% | 10.6% | |
FFn | 5 ppm | 9.1% | 9.9% | 10.1% |
20 ppm | 9.5% | 10.5% | 10.7% | |
50 ppm | 9.9% | 11.0% | 11.3% | |
100 ppm | 10.3% | 11.3% | 11.7% | |
LF | 5 ppm | 4.0% | 5.7% | 6.3% |
20 ppm | 5.5% | 8.9% | 10.1% | |
50 ppm | 5.0% | 8.3% | 9.5% | |
100 ppm | 4.1% | 7.2% | 8.3% |
The observed discrepancy between the top 5 and top 10 results can be attributed to the presence of inaccuracies in the measured precursor mass values derived from mass spectrometry data. A significant proportion of the sample compounds did not contain the intended molecules when the molecular library query was executed. However, when the correct candidate molecules are incorporated, the model proposed in this study can yield more accurate results. The minor disparity between the top 1 and top 5, as well as the top 10, indicates that the majority of the correct candidate molecules are positioned higher in the ranking.
The observed discrepancy in the outcomes across varying precision ranges can be attributed to the substantial size of the PubChem database. As the precision range is increased, the number of candidate compounds increases considerably. However, the correctly matched candidate compounds are already contained within the smaller precision range. Consequently, the results for the larger precision ranges do not exhibit significant improvement.
The results based on precursor mass querying are shown in Table 3. The results of MetFrag, CSI:FingerID and MetFID are collected from Fan et al.27 The results of the FF and FFn models for the top 1 show a very good performance lead compared to the comparison models. Although the LF model performs poorly compared to the other models, the results still illustrate that the model proposed in this study can still perform even if the number of training sets is limited, further illustrating the model’s ability to generalise.
Model | Top 1 | Top 5 | Top 10 |
---|---|---|---|
MetFrag | 12% | — | — |
CSI:FingerID (2016) | 28% | 55% | 70% |
CSI:FingerID (2019) | 39% | — | 75% |
MetFID | 38% | 72% | 72% |
FF | 43.6% | 55.0% | 58.0% |
FFn | 54.0% | 61.1% | 62.6% |
LF | 22.8% | 37.9% | 42.1% |
In comparison with the CFM-ID model, which utilizes the CFM model, and the MetFID model, which employs an artificial neural network (ANN) model, the GAT model with a multi-attention mechanism, as implemented in this study, and fragmentation-tree data for the prediction of molecular fingerprints demonstrates enhanced performance. The cleavage process of MS/MS exhibits a degree of regularity, manifesting as fragmentation at specific chemical bonds. Consequently, the fragmentation-tree is capable of reflecting structural information with higher precision compared to the use of mass spectrometry data alone. The GAT model’s enhanced ability to prioritize significant nodes within the fragmentation tree, in comparison to the conventional GCN model, ensures a more effective learning of the representation of these nodes. Consequently, the methodology proposed in this study demonstrates superior performance in molecular-fingerprint prediction.
The model proposed in this study has certain limitations, despite its strong performance in several performance indicators. The model’s performance is excessively reliant on the molecular library query results. If the molecular library query results are unsatisfactory, the similarity scoring results performed with the molecular fingerprints predicted by the model are also affected. This is particularly evident in the context of precursor mass-based screening, where the precursor mass, as measured in mass spectrometry data, differs from the exact mass in molecular libraries, which is calculated based on theoretical values. This discrepancy leads to significant variations in the querying process, potentially resulting in the exclusion of correct candidate molecules from the query results. This phenomenon is evident in the screening results based on precursor mass, and other models exhibit a similar low bias in this index. Furthermore, the molecular formulae of the mass spectrometry data are also calculated based on the mass spectrometry data, which is subject to bias but is more effective than the precursor mass. Additionally, this model utilizes the existing definition of molecular fingerprints, and there are overlapping features in different molecular fingerprints, which limits the amount of effective information that can be learnt.
Subsequent iterations of the model can be designed to execute additional tasks in multiple domains to enhance its performance. Initially, there is a potential to utilize a more extensive array of mass spectrometry datasets during the training process, with the objective of refining the model’s performance. Additionally, there is a possibility to optimize the molecular library query to facilitate the investigation and proposal of a more precise screening method for candidate compounds. Furthermore, there is a prospect to transcend the limitations of the prevailing molecular fingerprinting framework, thereby enabling the GAT to autonomously extract features. This approach is expected to yield more efficacious information and elevate the efficacy of compound identification.
This journal is © The Royal Society of Chemistry 2025 |