Open Access Article
Chayanis
Sutcharitchan
,
Boyang
Wang
,
Dingfan
Zhang
,
Qingyuan
Liu
,
Tingyu
Zhang
,
Peng
Zhang
and
Shao
Li
*
Institute for TCM-X, Department of Automation, Tsinghua University, 100084 Beijing, China. E-mail: shaoli@mail.tsinghua.edu.cn; Fax: +86 10 62786911; Tel: +86 10 62797035
First published on 30th September 2025
Natural products offer a vast reservoir of bioactive compounds, playing a crucial role in drug discovery. In this big data era, the annotation of their pharmacological categories holds great potential for accelerating drug discovery and advancing mechanistic studies of herbal medicines. However, a vast majority of natural products' classification remains unannotated. Existing recommendation frameworks for pharmacological categories are predominantly tailored to conventional drugs and frequently require extensive experimental data which are typically lacking for natural products. Traditional cheminformatic approaches based on structural similarity, while widely adopted, often struggle to achieve a satisfactory balance between prediction recall and precision, thereby limiting their overall effectiveness. In this study, a simple and explainable category recommendation framework for drugs and natural products based on multi-representation structural similarity data fusion, AgreementPred, was proposed. The framework utilized PubChem compound annotations which comprised two compound classification systems, Anatomical Therapeutic Chemical (ATC) classification and Medical Subject Headings (MeSH) as category labels, extending the scope of application beyond conventional drugs. The similarity search results using 22 molecular representations were combined to improve prediction recall. The predicted annotations were subsequently filtered by agreement scores to enhance prediction precision. Compared to existing equivalent approaches, AgreementPred achieved superior recall-precision balance in both ATC and category prediction tasks. With an agreement score threshold of 0.1, AgreementPred showed 0.74 and 0.55 of recall and precision, respectively, for the category prediction for 1000 compounds from a pool of 1520 categories. Finally, AgreementPred was applied to 321
605 unannotated drugs and natural products. The resulting prediction is expected to be of contribution to drug discovery, as well as mechanistic study purposes.
For mechanistic studies, the understanding of the chemical composition of each herb, as well as the pharmacological effects of the components is essential. However, relevant data on natural products remain limited.8 Unlike synthetic drugs, which benefit from standardized classification systems, a vast majority of natural products' pharmacological classification remains unannotated. Although several databases, such as ChEMBL and the Natural Product Activity and Species Source (NPASS), provide quantitative biological activity data of natural products on specific targets, inferring the classification of a compound solely from biological targets presents a great challenge, particularly given the inherent incompleteness of available datasets.
The Anatomical Therapeutic Chemical (ATC) classification system, established by the World Health Organization (WHO), provides a hierarchical framework for categorizing medical substances based on their anatomical, pharmacological, and chemical properties.9 As a well-curated and high-quality annotated dataset, it has significantly contributed to the advancement of computational methodologies for predicting new therapeutic applications of existing drugs, thereby facilitating drug repositioning.10–12
Inspired by ATC-predicting methods, this study aimed to develop a category recommendation framework that can be applied to both drugs and natural products, using PubChem compound annotations as category labels. PubChem compound annotations comprised two compound classification systems, Anatomical Therapeutic Chemical (ATC) classification and Medical Subject Headings (MeSH). MeSH database, established by the United States National Library of Medicine, provides controlled vocabulary for indexing, cataloging, and searching of biomedical and health-related information.13 The database curated chemical compounds, including drugs and natural products, related to each MeSH term. Utilizing PubChem compound annotations enabled predictive frameworks to extend its application beyond conventional drug space and provide reasonable annotation for natural products in the database.
Within the domain of natural products, molecular structure remains the most consistently available and reliable source of information for method development. Unlike approved drugs, most natural products lack well-documented data such as chemical–chemical interaction, gene expression, drug target, or side effect profiles utilized in various ATC-predicting methods.11,12,14–16 Moreover, MeSH terms also lack the hierarchical relationships inherent in the ATC classification system, which several ATC-prediction frameworks have leveraged.16–18 Therefore, this study focused exclusively on predicting categories using only molecular structures.
In computational chemistry, a molecular structure can be represented in multiple ways, each capturing different aspects of a molecule.19 Molecular fingerprints are typically employed to represent predefined structural features, such as topological distance between atom pairs, atomic environment within a preset radius, or the presence of specific pharmacophores. Notable examples include atom pair fingerprint (AP), extended connectivity fingerprint (ECFP), and pharmacophore fingerprint (PHFP).19,20 On the other hand, for deep learning implementation, learned representation based on graph neural network has increasingly gained prominence owing to its flexibility, task-specificity, and oftentimes superior prediction performance compared to predefined molecular descriptors, especially on large datasets.21–23
However, to date, there has not been a single molecular representation that outperformed others in all types of tasks and datasets. Previously published ATC prediction frameworks that relied solely on molecular structure as input employed distinct molecular representations.11,24,25 Yang et al.22 discovered that given certain conditions such as small (less than 1000 molecules) and highly imbalanced dataset, models that integrated learned representation with fixed molecular descriptor outperformed those that employed only learned representation. Furthermore, Boldini et al.20 investigated the effectiveness of various molecular fingerprints for characterizing the chemical space of natural products, as well as their applicability on the bioactivity prediction. The study revealed inherent variation of pairwise similarity and prediction performance among different molecular fingerprints, highlighting that each fingerprint offered a different aspect of the same molecule.
In this study, the performance of multiple molecular representations, including 28 molecular fingerprints and 1 unsupervised learned representation, in similarity-based category recommendation was further explored on drug and natural product datasets. Moreover, leveraging the integration of multi-representation structural similarity data, a novel category recommendation framework, AgreementPred, was proposed. After eliminating redundant representations, the framework combined the similarity search results of 22 molecular representations and subsequently filtered the predictions using agreement scores. AgreementPred achieved recall-precision balance superior to previous ATC-predicting frameworks in both ATC and category recommendation tasks and was applied to 321
605 unannotated compounds from drug and natural product databases. A total of 2
888
927 categories were recommended for 321
596 compounds with agreement score higher than 0.1. The resulting prediction is expected to be useful in furthering drug discovery, as well as mechanistic study of herbal medicine and natural products.
31 with collectable PubChem Compound ID (CID). The scope of each database and data from each database used in this study is explained in Table S1.
PubChem record of each compound was obtained by searching concatenated CID lists on PubChem database. The resulting tabular data were composed of names, synonyms, identifiers, chemical properties, and annotations of the compounds. A total of 331
326 PubChem records were collected, in which 9721 compounds contained classification annotations. The annotated records were extracted to construct Annotated-Compound dataset (Table S2).
The drug side effect (SE) dataset was constructed in a similar manner, by mapping compounds in SIDER database to PubChem compounds. Finally, 1376 compounds with obtainable CIDs were incorporated into the Annotated-SE dataset (Table S3).
To reduce computation burden during method development and validation, a sample dataset, AnnoCom1000 was constructed by random sampling 1000 compounds from Annotated-Compound. Moreover, DrugBank1000 and NP1000 datasets were also constructed by random sampling from annotated compounds contained in DrugBank and natural products databases, respectively. The purpose of constructing these two datasets was to compare the prediction performance of each representation on drug and natural product space.
For each compound, the terms contained in the annotations were extracted, stripped of codes, and converted to lower-cased letters. Singular and plural versions of the same terms in the dataset were merged (singular versions were kept, if present), and duplicated terms were eliminated from each record. Finally, the resulting Annotated-Compound dataset contained 54
675 compound–annotation pairs with 1520 unique annotations (Table S4). The sample datasets, AnnoCom1000, DrugBank1000, and NP1000, contained 5612, 6978, 3995 compound–annotation pairs, comprising 872, 971, and 544 unique annotations, respectively.
In this study, minimization of manual manipulation of category labels was intended and rationalized that all unique labels, albeit highly similar, had different positions in the chain of ontology (Table S5) and manual aggregation of the labels could compromise the traceability of the related annotations. For instance, antiparkinsonian agent and antiparkinson agents belong to separate chains of ontology, namely, C78272 – Agent Affecting Nervous System > C38149 – Antiparkinsonian Agent and D002491 – Central Nervous System Agents > D018726 – Anti-Dyskinesia Agents > D000978 – Antiparkinson Agents, respectively. Merging the two terms would obscure distinction between the two ontology systems. In contrast, preserving them as separate terms allows potential connection to be drawn while acknowledging that difference may exist. Thus, several similar labels, such as antidepressant agent and antidepressants and antiparkinsonian agent and antiparkinson agents, were kept as is in the developed framework.
516 drug-SE pairs with 4216 unique SEs (Table S6).
| Name | Abbreviation | Implementation | Category | Type | Specified parameters | Size | Used in AgreementPred | Reference |
|---|---|---|---|---|---|---|---|---|
| Circular fingerprint | CircFP | CDK | Circular | Binary | — | 1024 | Yes | 40 |
| Local path environment fingerprint | LSTAR | jCompoundMapper | Circular | Binary | — | 4096 | Yes | 20 |
| Topological Molprint-like fingerprint | RAD2D | jCompoundMapper | Circular | Binary | — | 4096 | Yes | 20 |
| Extended connectivity fingerprint (1024 bit) | EC1024 | RDKit | Circular | Binary | Radius = 2 | 1024 | Yes | 20 |
| Extended connectivity fingerprint (2048 bit) | EC2048 | RDKit | Circular | Binary | Radius = 2 | 2048 | No | 25 |
| Functional class extended connectivity fingerprint (1024 bit) | FC1024 | RDKit | Circular | Binary | Radius = 2, useFeatures = True | 1024 | Yes | 20 |
| Atom pair 2D fingerprint (implemented in PaDEL) | AP2DFP | CDK | Path | Binary | — | 780 | Yes | 40 |
| CDK fingerprint | CDKFP | CDK | Path | Binary | 1024 | No | 40 | |
| Hybrid fingerprint (CDK fingerprint ignoring aromaticity) | HybridFP | CDK | Path | Binary | 1024 | Yes | 40 | |
| Graph fingerprint (CDK fingerprint ignoring bond orders) | GraphFP | CDK | Path | Binary | 1024 | Yes | 40 | |
| Daylight fingerprint | Daylight | CDK | Path | Binary | Depth = 7 | 1024 | No | 20 |
| Extended CDK fingerprint (includes 25 bits for ring features and isotopic masses) | ExtFP | CDK | Path | Binary | — | 1024 | Yes | 40 |
| Shortest path fingerprint | SPFP | CDK | Path | Binary | — | 1024 | Yes | 40 |
| All shortest path fingerprint | ASP | jCompoundMapper | Path | Binary | — | 4096 | No | 20 |
| Depth first search fingerprint | DFS | jCompoundMapper | Path | Binary | Depth = 7 | 4096 | Yes | 20 |
| Atom pair fingerprint | AP | RDKit | Path | Count | — | 2048 | Yes | 20 |
| Avalon fingerprint | Avalon | RDKit | Path | Count | — | 512 | Yes | 20 |
| RDKit fingerprint | RDKit | RDKit | Path | Binary | — | 2048 | Yes | 20 |
| Topological torsion fingerprint | TT | RDKit | Path | Count | — | 2048 | No | 20 |
| Pharmacophore pair fingerprint | PH2 | jCompoundMapper | Pharmacophore | Binary | — | 4096 | No | 20 |
| Pharmacophore triplet fingerprint | PH3 | jCompoundMapper | Pharmacophore | Binary | — | 4096 | Yes | 20 |
| LINGO fingerprint | LingoFP | CDK | String | Binary | — | 1024 | Yes | 40 |
| Minhased atom pair fingerprint | MAP4 | Ref. | String | Categorical | — | 1024 | Yes | 20 |
| Minhashed fingerprint | MHFP | Ref. | String | Categorical | — | 1024 | No | 20 |
| Electrotopological state fingerprint | EstateFP | CDK | Substructure | Binary | — | 79 | Yes | 40 |
| Klekota–Roth fingerprint | KRFP | CDK | Substructure | Binary | — | 4860 | Yes | 40 |
| PubChem substructure fingerprint | PubChemFP | CDK | Substructure | Binary | — | 881 | Yes | 40 |
| Public MACCS fingerprint | MACCSFP | CDK | Substructure | Binary | — | 166 | Yes | 40 |
| InfoGraph graph feature | InfoGraph | Torchdrug | Unsupervised learned representation | Numerical | Learning rate (lr) = 1 × 10−3; batch_size = 1024 | 300 | Yes | 32 |
![]() | (1) |
| J(A,B) = 1 − Jaccard distance(A,B) | (2) |
![]() | (3) |
Similarity of categorical fingerprints was calculated in a similar manner to Boldini et al.'s study,20 considering two bits as a match only if they possessed the exact same integer.
![]() | (4) |
For the AnnoCom1000, DrugBank1000, and NP1000 datasets, the search for MSCs was performed in batches of 50 compounds. In each batch, a query compound was compared against the remaining 9671 compounds in the Annotated-Compound dataset from which the top N most similar compounds were determined for each query compound. In contrast, the search for MSCs for Annotated-SE dataset which only contained 1376 compounds were conducted in a leave-one-out manner.
![]() | (5) |
![]() | (6) |
Prediction performance based on 1, 2, 3, 4, 5, 10, 15, 20, and 30 MSCs computed using 29 representations was compared against one another and that based on the same number of random compounds to observe the enrichment of correct annotations among top MSCs. Prediction using the same molecular representation based on N compounds of MSCs and the same number of random compounds constitute each comparing pair.
Mann–Whitney U tests were used to compare the performance each comparing pair, whereas Kruskal–Wallis tests followed by Bonferroni-corrected pairwise Mann–Whitney U tests were used to detect statistically significant difference among the performance of 29 representations.
In AgreementPred (Fig. 1), predicted annotations resulting from multi-representation MSC-based prediction (MultiPred) of a query compound q were further filtered by agreement score (AgS), which was computed for each predicted annotations k, according to the equations below:
![]() | (7) |
![]() | (8) |
| AgPredq = {k∣k ∈ MultiPredq, AgSk > t} | (9) |
The prediction performance of AgreementPred was evaluated on AnnoCom1000, DrugBank1000, NP1000, and Annotated-SE datasets, comparing different t and N parameters.
For reasons mentioned in the Introduction section, only methods which adopted molecular structure as the sole input were considered for comparison. SD-ATC employed KRFP as the molecular representation and utilized network-based inference approach to extract the relationship between molecular substructures and ATC classes, whereas iSEA utilized similarity ensemble approach using the average similarity of 3 molecular representations (CDKFP, PubChemFP, and MACCSFP) to quantify the relation of a given drug to each ATC class based on the level of molecular similarity between the drug and drug set belonging to each class.
SuperPred frameworks25,34,35 also adopted molecular structure as the sole input of the models. However, extensive preprocessing of training data was required for SuperPred approaches, especially SuperPred3.0 in which single-label training dataset was mandatory for logistic regression model. Therefore, SuperPred frameworks were not selected to be compared in this section.
The benchmark datasets for second- and fourth-level ATC used in this study were derived from the training set containing 1151 approved drugs provided in iSEA original publication. A subset containing 1107 compounds with obtainable PubChem CIDs and PubChem's canonical SMILES was used in this study. Second-level ATC labels were obtained from the original dataset, whereas fourth-level ATC labels were extracted from the DrugBank database. The ATC datasets were divided into 22 batches containing 50–51 compounds. AgreementPred and SD-ATC were implemented using the same batches of testing data for all datasets.
As iSEA required computing average similarity based on 3 molecular representations with 1000 permutations for every drug-ATC pair, the framework was presumed to be infeasible for a dataset with a large number of classes such as PubChem annotations and fourth-level ATC. Therefore, iSEA was compared with other methods only for the performance on second-level ATC prediction, and the results were directly derived from the original publication without implementation.
605 unannotated compounds from drug and natural product databases. After eliminating redundant representations, 22 molecular representations, namely CircFP, LSTAR, RAD2D, EC1024, FC1024, AP2DFP, HybridFP, GraphFP, ExtFP, SPFP, DFS, AP, Avalon, RDKit, PH3, LingoFP, MAP4, EstateFP, KRFP, PubChemFP, MACCSFP, and InfoGraph, were incorporated (summarized in Table 1), using MSC and agreement score threshold of 1 and 0.1, respectively. The predicted annotations were assigned to each query compound, and the results were further analyzed for plausibility.
The results suggested that similarity-based prediction was somewhat effective for category annotations, comprising ATC and MeSH classification, as the annotations significantly enriched among compounds with high similarity to query compounds, aligning with a well-established concept that chemical compounds with a similar structure tend to possess similar properties.36 The results were also consistent with the findings of Boldini et al.'s which showed that while different molecular fingerprints performed best on different datasets, pharmacophore-based fingerprints tended to underperform other types.20
Comparing the performance of similarity-based prediction on drug and natural product datasets, it was discovered that the overall recall and precision were significantly higher (p-value < 0.05) for NP1000 than the DrugBank1000 dataset (Fig. 2C and E), except for the recall of PH2, and the precision of PH3, EStateFP, AP2DFP, GraphFP, and SPFP at various MSCs. The difference possibly stemmed from a higher number of annotations (971 vs. 544) and compound-specific annotations (339 vs. 234) in DrugBank1000 than in NP1000, indicating that the performance of similarity-based prediction could be compromised by the diversity of annotations. This problem could be mitigated by annotation screening and/or grouping; however, elimination or manipulation of labels might also lead to loss of relevant information.
For the Annotated-SE dataset, the difference between the comparing pairs of MSCs and random compounds were not as noticeable as in AnnoCom1000, DrugBank1000, and NP1000 (Fig. 2G and H), however, Mann–Whitney U tests resulted in a p-value lower than 0.05 for all comparing pairs, except for the precision of PH2, PH3, and EStateFP at various MSCs.
It was noteworthy that the average number of annotations per compound were 5.62 vs. 101.78, and the maximum number of annotations per compound were 47 (dexamethasone) vs. 742 (pregabalin) in Annotated-Compound and Annotated-SE dataset, respectively. In particular, high occurrences of some SEs, such as headache, nausea, and vomiting, were likely to be responsible for high apparent performance of prediction based on random compounds. Nevertheless, MSC-based predictions showed significant difference in recall and precision resulting from 29 representations (Kruskal–Wallis p-value < 0.05) at every MSC, while no difference was shown among random predictions. The pattern of performance of 29 molecular representations also differed from that on Annotated-Compound datasets, with RDKit fingerprint obtaining prominent recall especially at 1 MSC, and only PH2, PH3, and EStateFP showed notably inferior performance to other representations.
The results suggested that molecular similarity might be insufficient to deliver a reliable SE prediction based on currently available data. Unlike pharmacological categories which are established based on experimental results, drug SEs are typically defined based on observation during randomized controlled clinical trials. Consequently, the SEs of each drug vary significantly in frequency, severity, and clinical relevance, adding considerable complexity to the prediction task that may necessitate more sophisticated approaches.
MSCs of diltiazem computed using 29 representations are shown in Fig. 4A with the corresponding structures and representations shown in Table 2. Of 22 compounds, 3 compounds possessed the annotations of diltiazem: antihypertensive agent, cardiovascular agents, and cardiovascular system were among benazepril's annotations, predicted by SPFP and Avalon; cardiovascular agents and vasodilator agents belonged to tadalafil, predicted by AP; while cardiovascular agents and membrane transport modulators were retrieved by EStateFP among the annotations of cocaine. It is also worth mentioning that annotations relating to cardiovascular system were common among the annotations of different compounds, predicted by different molecular representations.
Consequently, it was further hypothesized that integration of multiple representations in similarity-based prediction might lead to improved performance relative to single-representation prediction.
AgreementPred (Fig. 1), a category recommendation framework for drugs and natural products based on multi-representation data fusion, was developed to address these problems. The framework was devised based on the hypothesis that the degree of agreement among different molecular representations in identifying MSCs of a query compound could indicate the overall similarity of the pair of compounds, and the overall similarity could, in turn, indicate the degree of certainty the pair belongs to the same categories. Moreover, annotations that are common among different MSCs predicted using different molecular representations were also more likely to be related to the query compound.
This hypothesis was supported by diltiazem's annotation prediction (see previous section) and the MSC profile computed by 29 molecular representations of levomilnacipran, in comparison to previously mentioned diltiazem. Whereas 29 representations identified 22 different compounds as the MSC of diltiazem, 28 out of 29 representations identified milnacipran, a stereoisomer of levomilnacipran as the most similar compound of levomilnacipran. Milnacipran possessed 10 out of 12 of levomilnacipran's annotations, while benazepril possessed 4 out of 14 of diltiazem's annotations, reflecting that prediction performance increased with degree of agreement.
Leveraging this finding, 22 representations, 1 from each group of representations that were within the same category and were highly correlated (Pearson's correlation > 0.75), as shown in Fig. 3, were incorporated into AgreementPred framework. Seven fingerprints, including TT, ASP, Daylight, CDKFP, EC2048, MHFP, and PH2, were excluded to prevent biased agreement. Annotations predicted by the 22 representations were subsequently filtered by a preset threshold of agreement score which was computed for each of the predicted annotations as the indicator of the degree of agreement (eqn (8)). In this way, prediction recall could be improved through the pooling of predicted annotations resulting from multiple representations, and prediction precision could be enhanced by agreement-based filtering, in which only the annotations of a compound with high overall similarity to the query compound or the annotations shared among multiple MSCs would be predicted for the query compound.
The performance of AgreementPred on 3 sample datasets adopting various N of MSCs and the threshold (t) of agreement score is shown in Fig. 5. At t = 0, the performance of AgreementPred was comparable to the performance of similarity-based prediction using equivalent number of compounds. However, unlike similarity-based prediction, recall and precision of AgreementPred demonstrated convergence with increasing t of agreement score up to a certain point, where precision began to outweigh recall. Thus, by adjusting N and t, the preferred balance of recall and precision could be achieved.
Moreover, as shown in Fig. 5D–F, the agreement score of correct prediction was significantly higher (Mann–Whitney U p-value < 10−30) than that of incorrect prediction in all datasets, confirming the correlation between agreement score and prediction accuracy. Hence, in AgreementPred, predicted annotations could be sorted by their agreement scores as the indicators of prediction confidence.
| AgreementPred (MSC = 1; AgS > 0.0) | AgreementPred (MSC = 1; AgS > 0.1) | AgreementPred (MSC = 2; AgS > 0.0) | AgreementPred (MSC = 2; AgS > 0.1) | EC1024 (MSC = 1) | EC1024 (MSC = 2) | iSEA (L = 10) | SD-ATC (L = 10) | |
|---|---|---|---|---|---|---|---|---|
| a L: prediction length. b Recall and precision in second-level ATC prediction task are computed in the same manner as iSEA,24 by dividing the total number of correct predictions by the total number of labeled classes (recall); and by the total number of predictions (precision), respectively, as specified in the brackets. | ||||||||
| Second-level ATC | ||||||||
| Recallb | 0.745 (1041/1397) | 0.633 (884/1397) | 0.819 (1144/1397) | 0.670 (937/1397) | 0.523 (731/1397) | 0.601 (840/1397) | 0.748 (1128/1509) | 0.671 (937/1397) |
| Precisionb | 0.139 (1041/7481) | 0.363 (884/2436) | 0.090 (1144/12867) | 0.322 (937/2914) | 0.491 (731/1488) | 0.363 (840/2315) | 0.098 (1128/11510) | 0.085 (937/11070) |
![]() |
||||||||
| Fourth-level ATC | ||||||||
| Recall | 0.579 | 0.480 | 0.635 | 0.529 | 0.369 | 0.459 | — | 0.478 |
| Precision | 0.158 | 0.348 | 0.091 | 0.311 | 0.365 | 0.307 | — | 0.060 |
![]() |
||||||||
| AnnoCom1000 | ||||||||
| Recall | 0.833 | 0.739 | 0.875 | 0.772 | 0.607 | 0.685 | — | 0.374 |
| Precision | 0.236 | 0.547 | 0.148 | 0.487 | 0.574 | 0.510 | — | 0.173 |
On the PubChem annotation prediction task (AnnoCom1000), the performance of SD-ATC was shown to be greatly inferior to EC1024 and AgreementPred (Fig. 6). This possibly stemmed from the task-specificity of SD-ATC which was optimized for ATC prediction and inherent difference between the two tasks. In this regard, the prediction performance of SD-ATC and AgreementPred on each PubChem annotation was further explored. Detailed comparison of prediction precision of each annotation by the two methods is shown in Table S7. It was demonstrated that SD-ATC, utilizing a network-based inference approach, suffered greatly from class imbalance in the PubChem annotation dataset, and clearly biased toward annotations with high occurrence. For example, SD-ATC predicted ‘enzyme inhibitor’, which was the annotation with the highest occurrence, for 997 compounds out of 1000 compounds in AnnoCom1000 dataset. As a result, SD-ATC was only able to correctly predict 86 out of 872 unique annotations with prediction length of 10 (10
000 predictions in total). In contrast, AgreementPred was shown to be more tolerant of a highly diverse and imbalanced dataset. It correctly predicted 665 out of 872 unique annotations among 9403 predictions in total. Mean precision across all annotations for SD-ATC and AgreementPred were 0.06 and 0.41, respectively.
Extended connectivity fingerprint (ECFP) is widely accepted for its superior performance in bioactivity prediction to other molecular fingerprints.19 However, similarity-based prediction using ECFP as molecular representation implemented in this study revealed comparable performance to most other molecular representations (see Single-representation similarity-based annotation prediction section). Therefore, EC1024 was employed here as the representative single-representation prediction method.
As shown in Fig. 6, EC1024 and SD-ATC exhibited a pattern in which recall and precision continued to diverge as the number of MSCs or prediction length increased, until eventually reaching a plateau. For both methods, precision peaked at small values of MSCs or shorter prediction length, but this improvement was achieved at the expense of the reduced recall. Notably, EC1024 similarity-based prediction with 2–5 MSCs achieved a balance of recall and precision only slightly inferior to that of AgreementPred. However, owing to the use of agreement scores, AgreementPred demonstrated distinct advantages including greater adjustability, presence of prediction filtering and a confidence indicator. These features are critical, as they help to mitigate poor prediction performance that may arise in single-representation similarity-based methods when annotated compounds with sufficient similarity to the query are not present in the dataset. Moreover, by applying higher agreement score thresholds, AgreementPred could achieve substantially higher precision, further underscoring its superiority over single-representation approaches.
605 unannotated compounds from drug and natural product databases, using 22 selected molecular representations (Table 1). Before agreement-based filtering, 12
691
685 category labels were recommended for 321
605 compounds. Subsequently, 9
802
758 predictions were removed using an agreement score threshold of 0.1, as described in the Material and methods, giving a total of 2
888
927 predicted category labels for 321
596 compounds (Table S8). After the concatenation of the Annotated-Compound dataset with the final prediction result, 2
943
602 category labels were provided for 331
317 compounds (Table S9). The average number of category labels per compound in the final concatenated dataset was 8.9 ± 5.0, increasing from that in the original Annotated-Compound dataset (5.6 ± 4.2).
Predictions were analyzed for a subset of relatively well-studied compounds that remained unannotated in PubChem database, namely apigenin, licochalcone C, and phillyrin. These compounds have been extensively investigated in previous pharmacological studies, providing a valuable reference for external validation. The predicted categories for these compounds were all derived from annotated compounds with high structural similarity. Mean similarity values across MSCs resulting from 22 molecular representations of the three compounds were 0.89, 0.83, and 0.77, respectively, indicating high plausibility of the prediction. Indeed, Table 4 showed that the key pharmacological effects predicted for each compound were consistent with findings reported in previously published literature.
| Compound name | CID | Prediction | Agreement score | Supporting literature |
|---|---|---|---|---|
| Apigenin | 5280443 | Protective agent | 0.41 | 41 and 42 |
| Hormone antagonist | 0.41 | 43–45 | ||
| Anticarcinogenic agents | 0.18 | 42, 46 and 47 | ||
| Tyrosine kinase inhibitor | 0.18 | 41 and 47 | ||
| Angiogenesis inhibitor | 0.18 | 48–50 | ||
| Prostaglandin antagonists | 0.14 | 51–53 | ||
| Anti-inflammatory agents | 0.14 | 41, 42 and 52 | ||
| Licochalcone C | 9840805 | Antineoplastic agent | 0.64 | 54 and 55 |
| Angiogenesis inhibitor | 0.36 | 54 | ||
| Growth inhibitors | 0.36 | 55 | ||
| Phillyrin | 101712 | Antihypertensive agent | 0.32 | 56 and 57 |
| Hypolipidemic agents | 0.32 | 58 and 59 | ||
| Anti-inflammatory agents | 0.23 | 60–63 | ||
| Cyclooxygenase inhibitor | 0.18 | 62 |
Furthermore, in an attempt to relate the pharmacological categories of chemical components to the pharmacological properties of medicinal herbs for further mechanistic study of herbal medicines as mentioned in the Introduction section, the resulting annotations of natural products contained in 3 prominent traditional Chinese medicine (TCM) herbs, Ephedrae Herba (Mahuang), Rhei Radix et Rhizoma (Dahuang), and Salvieae Miltiorrhizae Radix et Rhizoma (Danshen), were investigated. It was discovered that the pharmacological categories widely recognized as the main pharmacological properties of all 3 herbs were among top 20 annotations of highest occurrences.
In detail, Mahuang, a TCM herb well-recognized for its effects on respiratory and cardiovascular systems,37 comprised 42 and 39 compounds in ‘cardiovascular agents’ and ‘respiratory system agents’ categories, ranking top 16 and 18 of annotations with the highest occurrences, respectively. Among these, 35 and 34 compounds were predicted by AgreementPred.
In Dahuang, an herb well-renowned for its strong laxative effect,38 74 compounds in total were predicted to possess pharmacological categories ‘laxative’ and ‘cathartics’, ranking top 9 and 10, respectively. Lastly, in Dashen, an herb well-recognized for its uses in various cardiovascular diseases,39 hematologic agents' and ‘anticoagulants’ were predicted for 150 and 112 compounds, ranking top 9 and 11, respectively.
These results tentatively lent empirical support to AgreementPred's predictive capability and revealed an inherent relationship between the pharmacological properties of herbs and the pharmacological categories of their constituents, offering valuable insights into further mechanistic studies of herbal medicines.
Nevertheless, the proposed framework is far from perfect. Its main limitation lies in its inability to “think outside the box”. Unlike machine learning or network-based approaches that are capable of recognizing latent, complex patterns across high-dimensional data spaces and uncovering non-obvious association, the framework is inherently constrained by its reliance on known and explicitly defined similarity. As a result, this framework is not capable of identifying compound-specific properties or a novel class of bioactivities that are not shared by structurally similar compounds.
To mitigate this limitation, additional approaches could be integrated into the framework. For example, alternative molecular representations, such as physicochemical property profiles or knowledge graph embeddings, could be utilized to provide complementary aspects of a compound beyond its chemical structure. Moreover, natural language processing techniques and large language models could be employed to explore semantic relationships among annotation terms, thereby enabling the extraction of related annotations even when the compounds do not possess sufficient structural similarity. Collectively, these strategies have the potential to improve the framework's generalizability while alleviating the trade-off between interpretability and discovery power.
AgreementPred was applied to predict categories of 321
605 unannotated compounds from drug and natural product databases. A total of 2
888
927 categories were recommended for 321
596 compounds. The results provided preliminary support for the framework's predictive capability, reasonably annotated pharmacological categories for numerous natural products, and outlined a relationship between the pharmacological effects of herbs and their components, offering potential insights into drug discovery and future mechanistic study of herbal medicines.
| ATC | Anatomical therapeutic chemical |
| MeSH | Medical subject headings |
| SE | Side effect |
| FP | Fingerprint |
| ECFP | Extended connectivity fingerprint |
| AP | Atom pair fingerprint |
| PHFP | Pharmacophore fingerprint |
| PH2 | Pharmacophore pair fingerprint |
| PH3 | Pharmacophore triplet fingerprint |
| EStateFP | Electrotopological state fingerprint |
| AP2DFP | Atom pair 2D fingerprint |
| GraphFP | Graph fingerprint |
| SPFP | Shortest path fingerprint |
| AgS | Agreement score |
| MSC | Most similar compound |
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00329f.
| This journal is © The Royal Society of Chemistry 2025 |