Open Access Article
Qisong
Xu
a,
Alan K. X.
Tan
a,
Liangfeng
Guo
a,
Yee Hwee
Lim
ab,
Dillon W. P.
Tay
*a and
Shi Jun
Ang
*acd
aInstitute of Sustainability for Chemicals, Energy and Environment (ISCE2), Agency for Science, Technology and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros Building, Singapore 138665, Republic of Singapore. E-mail: dillon_tay@isce2.a-star.edu.sg
bSynthetic Biology Translational Research Program, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117597, Republic of Singapore
cInstitute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632, Republic of Singapore. E-mail: ang_shi_jun@ihpc.a-star.edu.sg
dDepartment of Chemistry, National University of Singapore, 3 Science Drive 3, Singapore 117543, Republic of Singapore
First published on 23rd September 2024
Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGB) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133
092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of NPs that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure–taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.
![]() | ||
| Fig. 1 Previous studies12,13,21,25,26 have focused on substructure and atom-pair information to predict up to three natural product (NP) kingdoms. In this work, composite machine learning models are developed taxonomical classification of NPs in up to five different kingdoms. | ||
Graph convolutional neural networks (GCNNs)22,27,28 are first employed to effectively extract key structural features of NPs as molecular fingerprints that are then used as the input for traditional machine learning algorithms29 like Support Vector Machines (SVMs)30 or eXtreme Gradient Boosting (XGB).31 The combination of molecular fingerprints generated by GCNNs and XGB yielded the most robust classification models (97.4% balanced accuracy), providing improvements of ∼15% in balanced accuracy over incumbent model architectures.21 Our composite models could also be used to characterize complex molecular targets32 or molecules crafted through generative chemistry.33
518 NPs retrieved from the official LOTUS website (https://lotus.naturalproducts.net/download). The SMILES (Simplified Molecular Input Line Entry System)34 of NPs were first canonicalized, giving rise to 276
499 unique isomeric SMILES that illustrate the diversity of NPs with varying stereochemistry. Seven kingdoms were identified from the annotations of the original LOTUS dataset: Animalia, Archaea, Bacteria, Chromista, Fungi, Plantae, and Protozoa. 266
663 isomeric SMILES (96.44% of the original 276
499 isomeric SMILES) only held a single kingdom label (Fig. 2A). A detailed breakdown of the 266
663 single kingdom isomeric SMILES showed that the largest population of characterized NPs originated from Plantae, followed by Fungi, Bacteria, and Animalia kingdoms (Fig. 2B).
Removing isomeric information from the 266
663 single kingdom isomeric SMILES reduced them to 133
876 unique non-isomeric SMILES (Fig. 3A) of which 133
233 (99.52% out of 133
876) hold single-kingdom labels. This suggests that despite the presence of stereochemistry, different stereoisomers of the same non-isomeric SMILES originate mostly from the same kingdom. Similarly, the kingdom distribution for the 133
233 non-isomeric SMILES is also dominated by NPs from the Plantae kingdom (Fig. 3B). The final curated dataset for multiclass classification and structural analysis comprised of SMILES from the top five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae), totalling 133
092 unique non-isomeric entries, as the kingdoms of Protozoa and Archaea each contributed less than 1% to the dataset.
Subsequent machine learning models were trained on these 133
092 unique, single kingdom label, non-isomeric SMILES for multiclass classification to five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). It is important to note the limitation that models trained on this non-isomeric SMILES dataset do not consider chirality information when performing taxonomical classification. Addressing this limitation and incorporating chirality and multiple kingdom labels into future models represents a possible avenue for further research, potentially further enhancing the accuracy of NP taxonomical classification.
092 non-isomeric SMILES). The dimension of the MAP4 fingerprints was 1,024, while both MPN and last_FFN fingerprints have dimensions of 1,100, which is the default in Chemprop. The seven ML classifier algorithms explored include Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Quadratic Discriminant Analysis (QDA), Random Forest (RF), Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGB), and Support Vector Machine (SVM) with linear kernel. The hyperparameters of the GCNN models were optimized and the molecular fingerprints calculated using the respective chemprop.hyperparameter_optimization and chemprop.fingerprint objects from the Chemprop package.28 During the optimization of the GCNN models, a total of 200 epochs with batch size of 50 were found to be sufficient. On the other hand, the seven ML models were trained using the scikit-learn package.29 For all learning algorithms, a five-fold cross validation via stratified sampling of the five kingdoms was implemented to evaluate the multiclass classification performance of the training and validation set. The performance of all multiclass classification models were evaluated by balanced accuracy, Matthews correlation coefficient (MCC), and F1 score. The mean and standard errors of the metrics from five-fold cross validation were also used to compare the classification performance. The classification metrics were calculated based on the counts of true positive (TP), false positive (FP), true negative (TN), false negative (FN), as follows:![]() | ||
| Fig. 4 Using the molecular structure of 6-methoxyisatin as an example, the generated MAP4, MPN and last_FFN fingerprints are illustrated.22,25 | ||
092 non-isomeric SMILES. Through t-distributed stochastic neighbor embedding (t-SNE), the high-dimensional MAP4 fingerprints as well as MPN and last_FFN fingerprints from the final GCNN model were dimensionally reduced and elucidated. A series of four different perplexity parameters [102, 103, 104 and 105] were explored to ensure impartiality when visualizing the two-dimensional (2D) projection clustering of NPs from different kingdoms. To evaluate the performance of t-SNE for different perplexity values, the Kullback–Leibler (KL) divergence between the original and fitted distribution of molecular fingerprints was assessed while the separation among clusters was quantified by Davies–Bouldin (DB) score. Molecular embeddings and their evaluation were performed using the t-SNE and DB_score objects in scikit-learn package.29
Finally, we analyzed the substructures of the NP molecules to identify critical structural fragments and their combinations that are characteristic of their kingdom source. The Monte Carlo Tree Search was employed to determine critical substructures using the chemprop.interpret object in Chemprop package.28 By analyzing the Bemis-Murcko scaffolds of NPs,35 the top critical scaffolds from each kingdom were identified.36
092 non-isomeric SMILES dataset to perform taxonomical classification of NPs from their structures. Subsequently, different ML models and molecular fingerprints were investigated to evaluate their influence on multiclass classification performance. To assess the transferability of the developed classification models, the best models were applied to NP taxonomical classification beyond the training set. Finally, we structurally analyzed NPs through dimensionally reduced molecular embeddings to identify and compare critical substructures of NPs in each kingdom.
092 non-isomeric SMILES dataset gave a slightly poorer balanced accuracy of 82.2% and MCC of 87.0% for the validation set.21 Pursuing further improvement, we explored additional ML models to supplement the prediction capabilities of simple GCNN models for more accurate taxonomical classification.
| Algorithm | MAP4 | GCNN-MPN | GCNN-last_FFN | |||
|---|---|---|---|---|---|---|
| BA | MCC | BA | MCC | BA | MCC | |
| a BA = Balanced Accuracy. MCC = Matthews Correlation Coefficient. GCNN = Graph Convolutional Neural Network. NB = Gaussian Naïve Bayesian. QDA = Quadratic Discriminant Analysis. RF = Random Forest. SVM = Support Vector Machine. LGBM = Light Gradient-Boosting Machine. XGB = eXtreme Gradient Boosting. Error of 1 standard deviation shown. | ||||||
| GCNN | — | — | — | — | 85.6 ± 0.8 | 87.3 ± 0.5 |
| NB | 20.0 ± 0.0 | 0.0 ± 0.0 | 30.8 ± 0.4 | 39.0 ± 0.7 | 67.5 ± 0.8 | 83.9 ± 0.5 |
| QDA | 29.4 ± 0.1 | 12.8 ± 0.4 | 63.1 ± 0.3 | 79.6 ± 0.4 | 96.5 ± 0.4 | 96.5 ± 0.1 |
| KNN | 54.5 ± 4.5 | 59.8 ± 4.4 | 82.5 ± 0.9 | 87.0 ± 0.6 | 93.9 ± 0.6 | 96.8 ± 0.1 |
| RF | 58.1 ± 0.9 | 67.6 ± 0.2 | 84.8 ± 0.9 | 91.5 ± 0.3 | 96.8 ± 0.5 | 97.8 ± 0.1 |
| LGBM | 68.3 ± 0.8 | 72.5 ± 0.3 | 90.2 ± 0.8 | 93.5 ± 0.2 | 97.2 ± 0.3 | 97.9 ± 0.1 |
| XGB | 72.1 ± 0.9 | 76.7 ± 0.3 | 91.1 ± 0.6 | 93.8 ± 0.1 | 97.4 ± 0.2 | 97.9 ± 0.1 |
| SVM | 82.2 ± 0.5 | 87.0 ± 0.3 | 90.9 ± 0.5 | 92.4 ± 0.2 | 97.1 ± 0.3 | 97.9 ± 0.1 |
Seven ML algorithms (NB, QDA, KNN, RF, LGBM, XGB, and SVM) based on three different types of fingerprints (MAP4, MPN, and last_FFN) were further explored as a composite strategy to supplement GCNN classification capability (Tables 1 and S1†). Classification models developed from MPN and last_FFN fingerprints provided better classification performance than those constructed from MAP4 fingerprints. This is evident from the high balanced accuracy and MCC in training and validation for all algorithms using MPN and last_FFN fingerprints. Unlike MAP4 fingerprints that comprise of circular substructure and atom-pair information,25 both MPN and last_FFN fingerprints are based on more detailed, information-rich graph representations of NPs suitable for accurate classification (Fig. 4). Furthermore, ML models based on last_FFN fingerprints offer the best classification performance that could potentially be attributed to its feed-forward neural network that facilitates additional learning drawn from MPN molecular embeddings.22
The classification performance of NPs also depended on the nature of the ML algorithm. For all three types of fingerprints, classification models based on NB and QDA generally performed poorly, as observed from their low accuracies and MCC scores (Tables S1† and 1). This may be due to the probabilistic nature of NB and QDA, which is sensitive to the kingdom populations and skew higher probabilities toward the major class.37,38 For KNN, the balanced accuracy and MCC improved for all three fingerprints, together with comparable classification performance for both training and validation sets. RF and SVM are two high-performing models that provided significant improvements in both classification accuracy and precision. This is because RF combines results from multiple trees to describe complex decision boundaries,39 while SVM is resilient to outliers by identifying optimal hyperplanes that maximize class separation.30 Finally, ensemble learning strategies involving tree-based models such as LGBM40 and XGB31 demonstrated good performance, in-line with their ability to handle imbalanced classes well and prevent overfitting with regularization. To this end, the prediction performance of the GCNN-XGB composite model developed based on last_FFN fingerprints significantly outperformed those from simple GCNN models and the MAP4-SVM model from previous studies (Fig. 5).21
![]() | ||
| Fig. 5 Comparison of overall performance of SVM classification model using MAP4 fingerprints from Capecchi and Reymond21 and current work (GCNN and GCNN-SVM developed using last_FFN fingerprints). | ||
In addition, the influence of traditional ML algorithms and molecular fingerprints on individual classification performance for each kingdom is reported in detail (Fig. S5–S10†). High accuracies and F1 scores were observed for each kingdom when ML models were constructed with MPN and last_FFN fingerprints, demonstrating the advantages of MPN and last_FFN fingerprints over MAP4 fingerprints. In terms of ML algorithms, NB and QDA models performed poorly in classification (low accuracy and F1 score) for most kingdoms. The classification accuracies and F1 scores decrease from Plantae to Chromista, again mirroring their population sizes in the dataset. On the other hand, SVM30 classified accurately for each kingdom despite the differences in kingdom populations. This is because SVMs provide multiple class separation despite the difference in occurrences. RF39 demonstrated excellent training classification performance across different kingdoms due to its ability to handle complex, high-dimensional data. Finally, ensemble learning strategies involving tree-based models such as LGBM40 and XGB31 also performed well due to their leaf-wise growth strategy focusing on the most significant splits and in-built regularization respectively. Overall, the composite strategy of layering XGB on top of last_FFN fingerprints provided the best classification model for the accurate taxonomical classification of NPs.
372 NPs present, 13
136 NPs (7446 from Bacteria and 5690 from Fungi) are not found in the LOTUS database used for training our models. GCNN-SVM, GCNN-LGBM, and GCNN-XGB composite models with comparable performance were evaluated on the NP Atlas test set.
The composite GCNN-XGB model performed markedly better at classification compared to simple GCNN and composite GCNN-MPN models (Table 2). However, it trades bacterial NP classification accuracy for fungal NP accuracy when compared to the literature benchmark SVM model using MAP4 molecular fingerprints.
136 NP Atlas test set
| Model | Molecular fingerprint | Bacterial NP accuracy (%) | Fungal NP accuracy (%) |
|---|---|---|---|
| SVM | MAP4 | 89.9 | 81.6 |
| GCNN | last_FFN | 81.1 | 86.0 |
| GCNN-SVM | MPN | 80.2 | 82.9 |
| GCNN-SVM | last_FFN | 83.2 | 86.5 |
| GCNN-LGBM | MPN | 82.7 | 82.9 |
| GCNN-LGBM | last_FFN | 82.7 | 86.3 |
| GCNN-XGB | MPN | 84.0 | 82.6 |
| GCNN-XGB | last_FFN | 82.8 | 86.6 |
![]() | ||
| Fig. 6 Visualization of the 2D projection using t-SNE (perplexity value of 104) using (A) MAP4, (B) MPN and (C) last_FFN fingerprints from the final trained GCNN model. | ||
Next, the critical substructures in NPs contributing to the classification of kingdom origins were determined. A Monte Carlo Tree Search (MCTS) was used to identify critical chemical fragments in the molecular structures of NPs. The top ten critical substructures deemed by the trained GCNN model as the most informative for NP taxonomical classification are listed for each kingdom (Fig. 7). Critical chemical substructures were identified as possessing a rationale score of more than 0.8, calculated from the chemprop.interpret object in Chemprop package.28
![]() | ||
| Fig. 7 Combinations of critical substructures identified in the five kingdoms of (A) Plantae, (B) Fungi (C) Chromista, (D) Bacteria, and (E) Animalia. | ||
Interestingly, the identified NP scaffolds also share structural similarities with essential starting fragments for drug discovery.42 The critical substructures for NPs in Plantae, Fungi and Chromista mainly consist of oxygen-based heterocycles (Fig. 7A–C). For plant NPs, the critical substructures tend to be simpler in nature, including furan-like43 (A5, A9), pyran-like44 (A3, A4, A8), and lactones45 (A6). In fungal NPs, molecular systems of fused rings (B5, B8) and linked rings (B10) were found to be critical. For chromists, the critical substructures identified in their NPs tended toward more complex fused ring systems (C3, C7, C8) and macrocycles (C2, C9). On the other hand, bacterial NPs typically consist of nitrogen-based heterocycles,46 including pyridine (D6), thiazole (D5), phenazine (D2) and phenoxazine-like moieties (D10) (Fig. 7D), with a few critical lactone fragments also identified (D4, D7, D9). Nitrogen-fused heterocycles such as pyrrole (E5) and indole (E3) were found to be important for critical substructures in animal NPs (Fig. 7E),47 on top of three- (E9) and four-fused (E1, E7, E9) ring systems resembling steroids.48 The benzene ring is a highly common and critical substructure across all five kingdoms (A1, B2, C1, D1 and E2). Owing to the high structural stability conferred by resonance, the planar aromatic rings offer stable building blocks that are ubiquitously found in nature. As fragments such as benzene and furan (A9, E8) are shared between kingdoms, individual fragments cannot inform taxonomical classification. Instead, it is the unique combination and connectivity of these fragments that drive differentiation between kingdoms. This underscores the importance of analyzing the broader structural context of NP structures via the right molecular fingerprinting technique rather than relying solely on the presence of individual substructures. All of the substructures described above are critical to the synthesis of stable NPs with differing levels of structural complexity.2,49
Structural analyses such as these provide valuable insights into the key fragments and potential fragment combinations characteristic of each kingdom, supporting in silico bioprospecting efforts to systematically identify the biochemical origins of novel NPs.17 Furthermore, the identified relationships between critical fragments and the corresponding kingdoms from which the NPs originate can prompt future genomic and phylogenetic analyses of different organisms to reveal the fundamental biosynthesis pathways of NPs occurring in nature.50 Overall, by leveraging on GCNNs, the structural features of NPs are effectively captured through molecular graphs, facilitating the formation of well-separated clusters corresponding to the five kingdoms. Identifying these critical substructures also enhances the explainability and interpretability of our composite machine learning models, offering a clearer understanding of how they utilize structural information for taxonomical classification.
092 non-isomeric SMILES across these five kingdoms were found to classify with a slightly superior performance to those of previous studies. Notably, the classification performance within each kingdom were found to increase with NP populations (i.e. data quantity). Three types of molecular fingerprints (MAP4, MPN, and last_FFN) were explored using seven different ML algorithms (NB, KNN, QDA, RF, LGBM, XGB, and SVM). The composite GCNN-XGB model merging last_FFN fingerprints with XGB yielded the best classification performance of 97.4% balanced accuracy on the validation set. When extended to classifying NPs outside of the training set from the NP Atlas database, the composite GCNN-XGB model achieved accuracies of 82.8% for Bacteria and 86.6% for Fungi. t-SNE embeddings of the three different molecular fingerprints revealed that last_FFN fingerprints gave the most well-separated clusters of NPs that resulted in remarkable classification performance. Finally, the top critical substructures characteristic for NPs in each kingdom were identified and compared to provide insights to structure–taxonomy relationships. Overall, this study demonstrates the potential of a composite machine learning strategy for taxonomically classifying NPs and to provide structural insights. Adopting this approach not only accelerates the classification of NP origins to screen for novel bioactive candidates but can also highlight kingdom-unique structural features of NPs to guide future efforts in virtual screening for bioprospecting as well as genomic and phylogenetic analyses of different organisms. Future avenues to enhance taxonomy classification include adopting advanced strategies such as hybrid data-based learning,51 multi-level learning,52 or meta-learning53 to further extend the generalizability of trained models across various dimensions, such as molecular size, functional groups, and structural complexity.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00155a |
| This journal is © The Royal Society of Chemistry 2024 |