Open Access Article
Jiaquan
Huang‡
a,
Qiandi
Gao‡
a,
Ying
Tang
b,
Yaxin
Wu
a,
Heqian
Zhang
*a and
Zhiwei
Qin
*a
aCenter for Biological Science and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China. E-mail: z.qin@bnu.edu.cn; zhangheqian@bnu.edu.cn
bInternational Academic Center of Complex Systems, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China
First published on 30th August 2023
Natural products are important sources for drug development, and the accurate prediction of their structures assembled by modular proteins is an area of great interest. In this study, we introduce DeepT2, an end-to-end, cost-effective, and accurate machine learning platform to accelerate the identification of type II polyketides (T2PKs), which represent a significant portion of the natural product world. Our algorithm is based on advanced natural language processing models and utilizes the core biosynthetic enzyme, chain length factor (CLF or KSβ), as computing inputs. The process involves sequence embedding, data labeling, classifier development, and novelty detection, which enable precise classification and prediction directly from KSβ without sequence alignments. Combined with metagenomics and metabolomics, we evaluated the ability of DeepT2 and found this model could easily detect and classify KSβ either as a single sequence or a mixture of bacterial genomes, and subsequently identify the corresponding T2PKs in a labeled categorized class or as novel. Our work highlights deep learning as a promising framework for genome mining and therefore provides a meaningful platform for discovering medically important natural products. The DeepT2 is available at GitHub repository: https://github.com/Qinlab502/deept2.
![]() | ||
| Fig. 1 Illustration of the representative type II polyketides and their biosynthetic intermediates (A) and the overall architecture of DeepT2 developed in this study (B). | ||
Several artificial intelligence (AI)-based natural product discovery models have been proposed in recent years due to rapid data accumulation and digital transformation as well as the accelerated development of AI technology.12–17 Among these frameworks, deep learning has demonstrated exceptional performance in classification tasks, specifically in the area of distinguishing new and unseen data.18–20 For instance, certain deep learning-based tools have demonstrated high efficiency and scalability in predicting natural product classes.13,21 However, these tools have limitations when it comes to identifying enzyme sequences that may be involved in the biosynthesis of natural products with novel carbon skeletons. More recently, protein language models (PLMs) based on self-supervised learning have shown remarkable ability to convert individual protein sequences into embeddings that describe the homology between multiple protein sequences and potentially capture physicochemical information not encoded by the existing methods.22–24 The application of general PLMs to convert sequences into embeddings, which serve as inputs for deep learning models, effectively overcomes the few-shot learning challenge for specific biomolecular property predictions.25,26 In addition, leveraging the large amount of unlabeled data available through a semi-supervised framework can further improve model performance.27,28 Finally, conducting novelty detection on the distribution of sequence-to-chemical feature vectors is a viable approach.29 These advancements inspired us to move forward in understanding T2PKS with PLM, training a robust model with unexplored sequences stored in metagenome data, and eventually finding an effective linker to connect their biosynthetic enzymes and probable chemical structures.
To gain a better approach for the discovery of T2PKs, an end-to-end model named DeepT2 was developed in this study. This model employs multiple classifiers to expedite the translation from protein sequences to the T2PK product class and identify any potential new compounds beyond the established groups. Notably, the model is free of sequence alignment and comprises four main components: (i) sequence embedding: the protein sequences were converted into vector embeddings using a pretrained PLM called EMS-2;24 (ii) data labeling: the KSβ dataset with known corresponding chemical structures was initially split into five classes for labeling based on the total number of biosynthetic building blocks, which was later reclassified into nine classes through dimension reduction and clustering processes; (iii) classifier development: this was used for both KSβ sequences and T2PKs classification; and (iv) novelty detection: Mahalanobis distance-based novelty detection30 was applied to identify any potential new compounds beyond the nine established groups. Remarkably, we leveraged DeepT2 to detect KSβ from microbial genomes and successfully identified four T2PKs as categorized in our classifiers. This work paves a promising avenue to further explore the potential of the existing reservoirs of T2PK biosynthetic gene clusters (BGCs) and therefore expand the chemical space of this medically important natural product family.
302 non-KSβ sequences were then split into training, validation, and test datasets, as described in the materials and methods section. Prior to constructing the classifier, we employed five general PLMs, including SeqVec, ESM-1b, ESM-2, ProtT5-XL-U50 and ProtBert-BFD respectively, to vectorize each protein sequence for embedding and we observed that the learned representations of ESM-2 and ProtT5-XL-U50 exhibited the best performance compared with the others in distinguishing KSβ from non-KSβ sequences, as revealed by the results of dimension reduction (Fig. S1†). In this study, we favored ESM-2 because it has the largest parameter size (over 3 billion) and has better performance on the T2PK classifier (see lateral session). Next, we trained the KSβ embeddings obtained by the PLMs using four machine learning algorithms, including random forest, XGBoost, support vector machine (SVM) and multilayer perceptron (MLP), and found that MLP and SVM achieved the best results, with an AUROC of 1 and an F1 score of 1 in the classification on the test dataset (Table S3†).
As illustrated in Fig. 2A, the restructuring process of KSβ embeddings initially reset the counts of predefined labels to 3, followed by the automatic adjustment of the n_neighbors and min_dist parameters to refine the space and improve its alignment with these 3 class labels. HDBSCAN was then applied to cluster the similar data points and assign new class labels to them. Certain previously assigned KSβ sequences were bifurcated and merged into a new label. For example, in class label A column, label I is from partial initial label 8 and 9; label II is from partial initial label 8 and 9, and entire initial 12 and 13; and label III is from entire initial label 10 (Fig. 2A, Table S4†). Upon increasing the number of predefined labels, the approach tuned the n_neighbors and min_dist parameters to decrease their values, which allowed HDBSCAN to recognize and assign new labels to smaller and more localized features in the data space (Fig. 2C and S3†). This iterative process continued until the number of class labels assigned to the 163 KSβ embeddings reached 9, after which no more labels could be generated or some data points were identified as noise. To evaluate the distribution patterns of the 9 autogenerated labels, the supervised UMAP technique was employed again, revealing that these labels could represent the T2PK biosynthetic logics more accurately in the real world (Fig. 2D).
| Classifier | Class number | TPR% | FPR% | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| Initial classifier | |||||||
| Random forest | 5 classes | 55.00 | 8.88 | 0.76 | 0.49 | 0.55 | 0.51 |
| 9 classes | 76.11 | 2.81 | 0.79 | 0.80 | 0.76 | 0.77 | |
| XGBoost | 5 classes | 57.67 | 8.88 | 0.76 | 0.68 | 0.58 | 0.59 |
| 9 classes | 76.11 | 2.73 | 0.79 | 0.75 | 0.76 | 0.73 | |
| SVM | 5 classes | 62.67 | 7.02 | 0.79 | 0.80 | 0.63 | 0.66 |
| 9 classes | 96.76 | 0.68 | 0.94 | 0.91 | 0.97 | 0.92 | |
| MLP | 5 classes | 67.56 | 5.25 | 0.79 | 0.67 | 0.68 | 0.67 |
| 9 classes | 93.15 | 1.51 | 0.88 | 0.87 | 0.93 | 0.89 | |
![]() |
|||||||
| Enhanced classifier | |||||||
| MLP | 5 classes | 66.67 | 6.38 | 0.82 | 0.91 | 0.67 | 0.70 |
| 9 classes | 97.78 | 0.43 | 0.97 | 0.99 | 0.98 | 0.98 | |
To leverage the 2566 unlabeled KSβ embeddings, a consistency regularization-based semi-supervised learning framework was adopted to train an enhanced MLP classifier based on the initial MLP classifier with two distinct sets of class labels. The enhanced MLP classifier was trained on both labeled and unlabeled data. In this process, the cross-entropy loss function was applied to the disturbed labeled data via Gaussian noise, while the mean-square error loss function was applied to both disturbed labeled and unlabeled data (Fig. 3A). This approach promoted the model to produce consistent predictions over time and resulted in a smoothed decision boundary, thereby enhancing the model's generalization performance on unseen data and alleviating overfitting. A clear disturbance in the initial classifier at the beginning is shown in Fig. 3B. Overall, the performance evaluation demonstrated that the enhanced MLP classifier trained with nine refined class labels attained an F1 score of 0.98 on the test set, an increase of 0.09 compared with the initial one, indicating state-of-the-art performance.
and covariance
of the training samples; ⊙ represents the utilization of the one-class SVM to detect M(KSβ) for each feature layer.
Of note, we designated 163 KSα sequences as ODD data and 163 KSβ sequences as in-distribution (ID) data and extracted feature vectors from the neural network consisting of input, hidden (n = 3), and output layers of the MLP classifier to compute the MDS for both ID and ODD data (Fig. 4A). We evaluated the classification performance of each layer using a one-class SVM on the ID and ODD datasets and found that the hidden layer 1 exhibited superior performance in detecting ODD data (Fig. S4†). Further, we present the MDS distribution of 163 labeled KSβ sequences (ID data) and 2566 non-labeled KSβ sequences within hidden layer 1. In Fig. 4B, some data points within the unlabeled KSβ sequences conspicuously diverge from the cluster of the ID data.
Next, we utilized the MDS data generated from hidden layer 1 to train an isolation forest model. This model was then employed to detect ODD data within a larger dataset comprising unlabeled sequences:
To facilitate a more comprehensive visualization of the distribution of abnormal datasets, we combined the labeled and abnormal datasets and conducted UMAP dimension reduction. Through this process, we were able to identify three clusters, labeled as ODD clusters, which encompassed a total of 164 abnormal data points. On one hand, certain data points from ODD clusters 2 and 3 were found to overlap with the labeled dataset, as indicated by the grey dotted box in Fig. 4C. This occurrence does not imply inaccuracies in the detection process; rather, it suggests that the KSβ from these two clusters share some similarities in their embeddings with the labeled sequences. This observation further suggests that the corresponding chemical structures of these data points may possess common characteristics with the structurally known T2PKs. On the other hand, we noticed that ODD cluster 1, comprising 13 sequences, was completely separated from the labeled dataset and situated at a considerable distance, as shown in Fig. 4C. Based on our hypothesis, the KSβ proteins within this ODD cluster may exhibit novel catalytic domains that differ from the previously labeled KSβ proteins. It is plausible to assume that these novel domains are potentially involved in the biosynthesis of previously undiscovered T2PKs.
To test this hypothesis, we employed ESMFold to conduct in silico predictions of protein structures for all data points within ODD cluster 1. We then calculated the root mean square deviation (RMSD) between the predicted structures of the 13 unlabeled KSβ proteins and the 163 labeled KSβ proteins. The average RMSD value between the protein structures in ODD cluster 1 and those in classes IV, V, VI, and VII was found to be 0.58 Å (Fig. S5 and Table S5†). Table S6† provides the calculated RMSD values between ODD cluster 1 and classes IV–VII, and based on these values, a threshold of 0.4 can be set to distinguish between intra-class and inter-class structures. This indicates that, for the exploration of novel T2PKs, particular attention should be given to KSβ proteins with an RMSD value exceeding 0.4 between the ODD cluster and other classes, as depicted in Fig. 4D.
In this way, we selected three top closest ID sample as predicted T2PK for the unknown KSβ input. As an overall result, 10 KSβ protein sequences were detected from 5 Streptomyces genome that fell into four classes (Table S7†), and the corresponding T2PKs were closest to alnumycin, granaticin and frenolicin in class I,34–36 polyketomycin, dutomycin and LL-D49194 in class VI,37–39 fasamycin, formicamycin and Sch in class VIII,40–42 and lysolipin, BE-24566B and anthrabenzoxocinone in class IX,43–45 respectively (Fig. 5 and Table S7†). To confirm this prediction, the strains were then inoculated and their metabolites were analyzed by liquid chromatography high-resolution mass spectrometry. As the result, alnumycin from WY86, polyketomycin from WY13 and lysolipin from PS14 were observed (Fig. 5 and S6†), whereas fasamycin from S. kanamyciticus was not detected under laboratory conditions as described in a recent study.46
![]() | ||
| Fig. 5 T2PK prediction from bacterial genomes. DeepT2 was performed and the identified compounds alnumycin A, polyketomycin, and lysolipin were subsequently confirmed via high-resolution mass spectra (Fig. S6†). Euclidean distances for each predicted candidate with top 3 similar T2PKs were annotated beside the dash lines (red, T2PKs to be detected; orange, the most similar T2PKs experimentally confirmed; green, other similar T2PKs that have close KSβ structures but different natural products). Further information can be found in Table S7.† The ground-truth T2PK of bacteria is denoted by the orange point in each figure. Bold bonds in each chemical structure indicate the building block units incorporated into the polyketide backbone, while the black dot indicates a single carbon from the build block unit in which the adjacent carbon from the same build block is lost during the polyketide biosynthesis via decarboxylation. | ||
The task of few-shot supervised learning requires an approach that transcends traditional supervised neural networks.47 In this context, our work adopts the concept of transfer learning, where ESM-2 is utilized to explore the connection between KSβ embeddings and T2PK structures. While the dimension reduction results indeed indicate that the embedding vectors obtained by ESM-2 closely fit with the compound class labels, it is important to note that certain embeddings still necessitate further labeling refinement and correction. Consequently, we performed label reconstruction using supervised UMAP instead of unsupervised UMAP to ensure that the resulting reduced-dimensional space consistently captures the features of the compound skeleton. This approach differs from traditional unsupervised learning for clustering,48 as it strives to strike a balance between the sequence embeddings and the compound class labels to improve the model's accuracy. For example, T2PK AQ-256-8 consists of 8 building blocks, but its KSβ is confirmed as ancestral nonoxidative, which differs from other KSβs that involve the biosynthesis of T2PKs with 8 building blocks.6 Clearly, the state-of-the-art performance of the model trained with 9 refined class labels suggests that the classification effect is unsatisfactory when simply using five biosynthetic building blocks as labels. This finding suggests that KSβ not only affects the counts of building blocks but also determines a rough topology prior to cyclization or aromatization. To the best of our knowledge, this is the first algorithm for T2PK classification and prediction in such a manner, which, as an alternative to sophisticated protein sequence alignment, might showcase a paradigm shift in genome-mining approaches for natural product discovery.
As shown above, we improved the generalization ability of the softmax-based T2PK classifier by employing a consistency-regularization-based semi-supervised learning framework that utilized 2566 KSβ whose corresponding natural product structures currently remain unknown. However, such models may demonstrate overconfidence in discerning novel KSβ sequences in the real world.32 To address this concern, an ODD framework based on the Mahalanobis distance was implemented for multiclass novelty detection.49 Notably, certain samples (from 2566 KSβ sequences) are proximal to the labeled data (from 163 KSβ sequences) because such labeled T2PKs with entirely novel carbon skeletons have only been discovered in recent years, such as formicamycin40 and dendrubin.50 Therefore, to avoid false positives in novelty detection, we selected only 13 potential new class samples that are distant from the labeled samples for demonstration. Greater details regarding the enzymatic information and chemical structures for these T2PKs will be studied in future work.
This study demonstrates the capacity of DeepT2 to predict T2PKs from single or mixed genomic datasets. However, some limitations must be acknowledged. While the training data included bacterial genomes from different phyla, certain biases may hinder the model's ability to detect novel T2PKs in poorly characterized bacterial sources within complex microbiomes. Although the model was validated using Streptomyces genomes as a showcase in this study, it is essential to expand the collection of bacterial genomes to enhance the overall performance of the model. Additionally, the current version of DeepT2 is capable of predicting T2PKs from single genes as input, but it requires complete sequences of at least 300 amino acids (the average length of KSβ is around 400 amino acid). For predicting other tailoring modifications, such as methylation or halogenation, supplementing DeepT2 with antiSMASH or DeepBGC is recommended. Nonetheless, despite these limitations, the DeepT2 model outperforms other methods and represents a valuable algorithm for KSβ identification and T2PK discovery. This study also inspires future research to identify which catalytic domains in KSβ contribute to chemical differences through PLM and thus provides more insights into the KSβ evolution and T2PK biosynthetic mechanisms, and this is currently ongoing in our laboratory. Moreover, as the application of language models in prompt tuning for zero-shot prediction, as well as the generative models such as autoregressive neural networks is gradually emerging,51–53 we are now prompted to explore such models for KSβ studies. We therefore anticipate that this work will aid in the application of genome mining approaches to discover new KSβ and novel T2PKs and have important clinical implications for transforming microbiome data into therapeutic interventions.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00107e |
| ‡ These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2023 |