Cheng
Chen‡
a,
Ledu
Wang‡
a,
Yi
Feng‡
a,
Wencheng
Yao
b,
Jiahe
Liu
a,
Zifan
Jiang
a,
Luyuan
Zhao
a,
Letian
Zhang
a,
Jun
Jiang
a and
Shuo
Feng
*a
aState Key Laboratory of Precision and Intelligent Chemistry, University of Science and Technology of China, Hefei, Anhui 230026, China. E-mail: sfeng18@ustc.edu.cn
bMOE Key Laboratory of Resources and Environmental System Optimization, College of Environmental Science and Engineering, North China Electric Power University, Beijing 102206, China
First published on 13th March 2025
Machine learning models have emerged as powerful tools for drug discovery of lead compounds. Nevertheless, despite notable advances in model architectures, research on more reliable and physicochemical-based descriptors for molecules and proteins remains limited. To address this gap, we introduce the Fragment Integral Spectrum Descriptor (FISD), aimed at utilizing the spatial configuration and electronic structure information of molecules and proteins, as a novel physicochemical descriptor for virtual screening models. Validation demonstrates that the combination of FISD and a classical neural network model achieves performance comparable to that of complex models paired with conventional structural descriptors. Furthermore, we successfully predict and screen potential binding ligands for two given protein targets, showcasing the broad applicability and practicality of FISD. This research enriches the molecular and protein representation strategies of machine learning and accelerates the process of drug discovery.
The performance of VS models depends not only on the quantity and quality of the training dataset, but also on the model architecture and the molecular representation strategies adopted. Nevertheless, in recent research progress, while significant achievements are primarily attributed to the evolution of machine learning models, these models still rely heavily on traditional structural descriptors to represent molecules and proteins.18,19 To enhance the precision of predictions through the development of more informative descriptors, we endeavour to integrate additional useful information into the descriptors. The spectrum, as a typical carrier of chemical information, offers an indirect glimpse into crucial molecular properties such as spatial configuration and electronic structure information,20,21 potentially bolstering our predictive capabilities.22 Naturally suited for representing molecules, the spectrum is also inherently suitable as an input for machine learning (ML) models due to its controllable and uniform dimensionality. Consequently, the spectrum can be a valuable tool for enhancing the efficiency of molecular and protein representation.
We propose a spectra-descriptor-based approach, which can be used for protein–molecule interaction prediction and ligand screening. To begin with, our study defines one kind of spectra-descriptor, based on infrared (IR) spectra, as the input representation for virtual screening models, with sufficient physicochemical information for both molecules and proteins. We also develop a model named MLMS (ML Molecular Spectra model) capable of rapidly calculating and extracting molecular spectra-descriptors, which are then applied together with MLPS (ML Protein Spectra model),20,21 a model capable of outputting protein spectra-descriptors. MLMS and MLPS can generate input features for the virtual screening model named VS2Net (Vibrational Spectra & Virtual Screening Net). Furthermore, we design case studies to test the ability of our proposed descriptors and models in screening unknown ligands. Through case studies, we successfully showcase the potential of this approach in addressing specific drug design problems, which are identifying potential ligand molecules for new target proteins, thereby providing valuable clues and insights for subsequent lead optimization and synthesis procedure. The detailed illustration of the ligand-finding process can be seen in Fig. 1.
To intuitively present the ability of the model, we also randomly selected four representative data points for detailed comparison (as shown in Fig. 2b). In these instances, the blue lines represent the DFT calculated FISD, while the green lines showcase the MLMS predicted FISD. Notably, they exhibit excellent consistency in shape trends and specific values, providing strong evidence for the MLMS model's accuracy and robustness in mimicking FISD characteristics. Additionally, we utilize the MLPS model derived from work by Ye et al. to predict the FISD of proteins.20,21
Method | Algorithm | AUC | EF0.5% | EF1.0% | EF2.0% | EF5.0% |
---|---|---|---|---|---|---|
NN score25 | DNN | 0.58 | 4.16 | 2.98 | 2.46 | 1.89 |
RF score26 | RF | 0.62 | 5.62 | 4.27 | 3.49 | 2.67 |
3D-CNN27 | CNN | 0.86 | 42.55 | 26.65 | 19.36 | 10.71 |
PocketGCN28 | GCN | 0.88 | 44.40 | 29.74 | 19.40 | 10.73 |
GCN29 | GCN | 0.94 | — | — | — | — |
VS2Net (our model) | DNN | 0.94 | 86.55 | 57.22 | 33.56 | 15.71 |
AttentionSiteDTI30 | Transformer | 0.97 | 101.74 | 59.92 | 35.07 | 16.74 |
The graph representation method follows a similar logic to MLMS but skips the process of obtaining FISD. Instead, the molecular graph is encoded into an unrestricted 50-dimensional vector using a GCN. As for the representation method of Cheminfo-Descriptor, we used the MLDescriptors module in RDKit for calculation, resulting in 206-dimension descriptors. Thus, each molecule can be represented by 206 chemical information descriptors (Table S3†), which are then concatenated with the FISD of the protein and input into the DNN for classification just like the FISD case and the graph case. The classification results of the three models are presented in Table 2.
Method | AUC | Recall | EF0.5% | EF1.0% | EF2.0% | EF5.0% |
---|---|---|---|---|---|---|
FISD | 0.94 | 0.82 | 86.55 | 57.22 | 33.56 | 15.71 |
Graph | 0.65 | 0.71 | 85.83 | 55.47 | 29.15 | 12.39 |
Cheminfo | 0.68 | 0.67 | 84.73 | 44.49 | 23.08 | 9.53 |
When comparing model performance, in addition to utilizing the AUC value to reflect the overall model performance and the EF value to indicate the model's early enrichment capability, we have also incorporated the recall metric to assess the model's ability to identify all positive samples comprehensively. Given the highly imbalanced distribution of positive and negative samples in the DUD-E dataset, where one positive sample roughly corresponds to 200 negative samples, the capacity to correctly identify positive samples is of greater importance than identifying negative ones. Therefore, the recall value is employed to evaluate the model's performance in this regard.
The FISD method demonstrates robust performance, achieving the best results across all six metrics, as shown in Table 2. In contrast, the performance of models employing the Graph method and those using the Cheminfo-D method is worse than that of models employing the FISD, mainly reflected in the two metrics of the AUC value and Recall value. For the EF values, they all exhibit similar results.
The Graph model yields an AUC of 0.65, a recall value of 0.71, and a good EF value, indicating its strong early enrichment capability. Initially, the model exhibited almost no convergence during training. Subsequently, we pre-trained the model using a small subset of the training set molecules until it achieved a good fit, and then utilized the entire dataset for training to achieve convergence. We hypothesize that the potential reason for this is that FISD can encode molecules in a suitable latent space while serving as a continuous structural descriptor. FISD may enhance the model's performance by restricting the result-generating space and using vibrational spectroscopy as the fitting target during MLMS training, which enables it to learn the correlation between the structure and spectrum, effectively encode molecular electronic information and spatial structure, and thus obtain better classification results. Theoretically, Graph could achieve the same effectiveness as FISD. However, in practice, without specific target space constraints, it is challenging for GCN to quickly locate the ideal feature space. Pre-training with MLMS maps molecules into the fragment integral spectrum descriptor space, improving representation efficiency and predictive capability.
The Cheminfo-D model achieves an AUC of 0.68, a recall rate of 0.67, and an EF value slightly lower than those of the other two models. We speculate that the primary reason for the performance differences is the discreteness of the cheminfo-descriptors. The description based on chemical structures is typically discrete, as atoms are discontinuous. This discreteness can lead to issues, such as the phenomenon of activity cliffs, making it challenging to capture finer differences between structures.
Notably, our virtual screening model was trained solely on the dataset comprising 82 unique targets and their corresponding small-molecule ligands as well as decoys from the DUD-E database. Therefore, it is unrealistic to directly apply the model to other protein targets for ligand screening and expect excellent results. However, the FISD and model proposed still hold practical application value. Since spectra reflect the structural and interactional properties of substances, theoretically, proteins and molecules with similar FISD should also possess similar interactional characteristics. Therefore, we initially conducted a K-means clustering analysis on the 82 known targets from the DUD-E dataset, along with two additional protein targets, pre-setting the number of clusters to eight. This step aimed to categorize protein targets into clusters with similar FISDs, and the clustering results are detailed in Fig. S2.† Subsequently, we can attempt to screen ligands within clusters of similar proteins. The new targets clustered with existing protein targets from DUD-E enabled us to obtain the corresponding ligands of these DUD-E proteins as a ligand library for virtual screening. We also calculated ligand similarity to validate our proposal, as shown in Fig. S3.† The process is universal, and when given requests of new proteins, a ligand library can always be established through a similar process for further screening. After that, we applied our VS2Net model to predict protein–ligand interactions. We analysed the top-10 molecules for two target proteins and found that for tau protein and spike protein, there were 6 and 5 molecules, respectively, with model-predicted interaction probabilities higher than 0.99999. The screening results and the protein structures are illustrated in Fig. 3a and 4a.
To verify if the screened molecules serve as potential ligands for their respective protein targets, AutoDock Vina was employed for molecular docking.33 In this process, eight independent docking conformations were generated for each candidate molecule, and the one with the highest score was chosen as the representative docking result, ensuring both accuracy and representativeness.
For the tau protein, docking analysis of the six selected candidate ligand molecules was performed, with results presented in Fig. 3b. Encouragingly, all six molecules achieved a binding free energy threshold below −7 kcal mol−1, indicating effective docking to tau protein, thereby confirming the predictive accuracy and broad applicability of our screening model. Notably, tau-5 (in Fig. 3a) exhibited the lowest affinity value while maintaining effective docking, and its detailed binding conformation is shown in Fig. 3c, further elucidating the potential of our strategy in identifying ligands for given protein targets.
Regarding the five candidate ligand molecules for spike protein, their corresponding docking scores are presented in Fig. 4b. Based on the same empirical criterion for successful docking, three molecules met this condition, validating their potential to bind with spike protein. Notably, while all successfully docked molecules exhibited good affinity, spike-4 (in Fig. 4a) stood out with the lowest score, and its specific binding mode with spike protein is illustrated in Fig. 4c, visually demonstrating the interaction interface between the molecule and the protein target.
To better validate the ability of our model to identify ligands and further assess the stability and intermolecular interactions of candidate compounds binding to proteins, we conducted molecular dynamics (MD) simulations using GROMACS for the molecules with the best affinity for the two target proteins, spike-4 and tau-5. After the simulations, the trajectories were corrected, and representative structures from the equilibrium trajectories were extracted to display the binding conformations between the molecules and proteins. The binding conformations of these molecule-protein complexes are shown in Fig. 3d and 4d. It can be seen that although the contact positions of molecules are slightly different from those obtained via Autodock Vina (Fig. 3c and 4c), the molecules are also binding well to the proteins, where the residues that may interact with the ligand are annotated. We also calculated the RMSD of the ligand molecules over time to evaluate the simulation process, with the results shown in Fig. 3e, and 4e. In the later stages of the simulations, the RMSD of the ligands remained basically stable, indicating that the protein–ligand complexes were in a relatively stable state at this time.
Moreover, we extracted 10 ns trajectory files from the stable trajectories and calculated the binding free energy using the Molecular Mechanics/Poisson–Boltzmann (Generalized Born) Surface Area (MM/PBSA) method to assess the binding strength.34 The MM/PBSA values for tau-5 binding to tau protein and spike-5 binding to spike protein were found to be −56.25 kcal mol−1 and −20.85 kcal mol−1, respectively. These results indicate strong binding abilities and further prove that the method we proposed effectively identified good ligands for the given proteins, and has the potential to be widely applied.
In addition, we utilized ligands of two previously reported protein targets for model validation. Specifically, we retrieved experimentally validated ligands for tau protein from the binding DB, and acquired a set of spike protein ligands from the work conducted by Timoteo et al.35,36 Following the same selection criteria as previously mentioned, we selected the top-ranked molecules corresponding to tau/spike-like proteins and conducted MD experimental validation on them. Furthermore, we also chose the lowest-ranked molecules obtained during our case studies as representatives of inactive molecules for MD validation. The top-ranked and lowest-ranked molecules are listed in Fig. S8.† The results from VS2Net and MD align well with each other. Both in terms of quantitative validation through RMSD (Fig. S9 and S10†) and MM/PBSA (Table S4†), VS2Net's predictions are in good agreement with the MD simulations as well as the experimental data reported in the literature.
While our research presents promising advancements, it is not without limitations. First, we did not focus our efforts on optimizing the FISD dimension parameter, so 50 dimensions may not necessarily be the best choice. However, based on current results, selecting a 50-dimensional sequence can accomplish the intended tasks. If there is a desire to optimize this parameter, we believe that the selected dimension has to satisfy the requirements of retaining adequate spectral information for accurate predictions by subsequent classification models, while also striving to reduce dimensionality to decrease model complexity and lower the risk of overfitting. Besides, the training data encompass a limited number of protein targets amidst the vast number of molecules, leading to a constraint in model generalizability. Although the existing models and descriptors cannot directly support zero-shot learning, we have made the following attempts to demonstrate their few-shot learning capabilities: after training the model with data corresponding to 17 targets from DUD-E, we randomly selected one remaining target (HIVPR) for transfer learning. We progressively expanded the training data to validate the model's ability to predict protein–molecule interactions. As shown in Fig. S7,† we found that with only a small amount of data (660 samples for training, including 60 active and 600 inactive samples), the model was able to achieve a certain level of performance, yielding satisfactory results. Consequently, when confronted with novel targets, retraining the model with pertinent data becomes necessary and doable to ensure accurate predictions. Fortunately, the limited open-source datasets do not hinder the application of FISD in tackling diverse tasks. In practical drug design scenarios, where the target protein is often unique and predefined, it is advisable to train a single-target model leveraging the pre-accumulated data tailored to the FISD approach, thereby maximizing the utility of our framework.
For protein samples, we adopt the MLPS method proposed by Ye et al., which takes protein structural files as the input and outputs fragmental spectral information. Thus, we utilize MLPS to extract the spectral range from 1550 cm−1 to 1750 cm−1 in the infrared spectra of proteins, covering a total spectral range of 200 cm−1. Similarly, after being normalized, the spectral interval is divided into 50 equally spaced fragments, and each fragment undergoes integration, resulting in an FISD for the protein. The resulting FISD is also a 50-dimensional sequence, ensuring consistency and comparability in data formats.
As a dataset for quantum chemical properties, QM9 provides detailed quantum chemical information for 127468 small molecules containing up to 9 heavy atoms (C, N, O, F), stored in SMILES format. Aiming at obtaining vibrational spectra, molecules were optimized at the level of b3lyp/TZVP and frequencies were calculated using the Gaussian16 software program. After obtaining the vibrational frequencies and intensities of the QM9 molecules, we can pre-process them into FISD following the process mentioned above, obtaining FISD sequences for each molecule. Based on these data, we partitioned the QM9 dataset into training, validation, and test sets (100
000/20
000/7468) for the MLMS model.
The DUD-E dataset, a benchmark for evaluating molecular docking algorithms and model performance, comprises exhaustive data for multiple target proteins, each associated with approximately 200 active ligands and 50 decoy molecules per ligand. Despite their similarities in physicochemical properties and 2D topological structures, these molecules exhibit distinct biological activities.
For the DUD-E dataset, we selected a subset of 2200 molecules that met specific criteria. We used the Gaussian16 program to optimize their structures and calculate vibrational frequencies under identical parameters. This subset of data was used in preliminary experiments, and the relevant content can be found in the Related DUD-E Subsets part in the ESI.†
For the FISD of proteins, we employed the MLPS to attempt to obtain the FISD for the target proteins corresponding to the DUD-E dataset (encompassing 82 targets, excluding 20 targets that were incompatible with the MLPS model). For the molecules, we processed the ligands and decoy molecules corresponding to each target protein separately. Their SMILES strings were converted into a format readable by MLMS using RDKit (Graph), and the FISD of the molecules was obtained through MLMS.
For VS2Net model training, validation, and testing, we allocated 80% of the active molecules (14259 small molecules) corresponding to each available target protein, along with ten times that number of decoy molecules (142
590 small molecules), as the training set. We designated 10% of the active molecules (2822 small molecules), along with ten times that number of decoy molecules (28
220 small molecules), as the validation set, and the remaining molecules (2822 active molecules and 1
007
742 decoy molecules) were used as the test set.
While the training protein structures all originated from the DUD-E dataset, we designed two case studies to demonstrate model capabilities when facing the scenarios of introducing new target proteins. We selected two targets which are not included in the DUD-E dataset: the spike protein (PDB ID: 7lm9), and the tau protein (PDB ID: 8q96). Their FISDs are obtained using the MLPS model.
Specifically, the MLMS model processes SMILES strings into graph structures composed of atomic nodes and chemical bond edges via RDKit. Each atomic node is endowed with a 45-dimensional feature vector encompassing atom type, formal charge, hybridization state, etc., and the detailed information can be seen in Table S1.† The optimized MLMS model directly generates FISD representing input molecules.
The MLMS model makes rapid virtual screening possible, because the VS2Net's input requires two parts of features: FISD for molecules, either from DFT calculated vibrational frequency or from MLMS outputs, and corresponding FISD for protein targets, obtained by MLPS. These descriptors are concatenated into a 100-length joint feature vector, which serves as the input to VS2Net. Trained under supervised learning, the model outputs a value, which can be translated into a binary classification: values closer to 1 indicate higher likelihood of active molecule–protein interactions, while those closer to 0 suggest inactive (decoy) interactions, predicting the active or inactive status of protein–ligand interactions.
The first two components of the MLMS model employ a four-layer GCN architecture, each layer utilizing the ReLU activation function to exploit the topological structural information of molecular graphs. A global max pooling layer follows the GCNs to capture the overall graph representation and facilitate subsequent prediction tasks. This is further refined through two dense layers, outputting preliminary FISD. During training, we employ Mean Squared Error (MSE) and Cosine Embedding Loss (CosEmbeddingLoss) as loss functions, training the two components separately to capture distinct aspects of the descriptors.
The third component involves fusing and optimizing the results from the previous two components. Initially, a scaling factor is derived from the numerical values of these results, and the model output trained with CosEmbeddingLoss is multiplied by this factor to align its scale with that of the MSE-trained model. Subsequently, the outputs from both models are concatenated into a 100-dimensional sequence and fed into dense neural networks, yielding the final optimized 50-dimensional FISD. This step integrates the strengths of models trained with different loss functions, enhancing the accuracy and robustness of the descriptors. For the training of the MLMS model, we utilized the QM9 dataset, specifically the training set data as delineated in the pre-processing stage. In the first two components, we used molecular graphs as the input and the processed FISD as labels for supervised learning. In the third component, we took the output values from the first two models as the input and the processed FISD as labels for further supervised learning. Ultimately, we obtained the predictive model.
Thus, the MLMS model is capable of swiftly generating FISD for molecules by inputting SMILES strings. Although the MLMS model is trained solely on the QM9 dataset, it remains capable of achieving a certain level of consistency with DFT-calculated spectra when applied to predict molecules in the larger DUD-E dataset, which can be seen in Fig. S4.†
We designed and implemented a virtual screening model named VS2Net based on a DNN, consisting of five DNN layers with ReLU as the inter-layer activation function to enhance the network's nonlinear processing capabilities. The process of using VS2Net to conduct virtual screening is shown in Fig. 6c. The model's input integrates 100-dimensional FISD (50 for molecule-FISD and 50 for protein-FISD). At the output layer, a sigmoid activation function is applied to output a probability value between 0 and 1, and based on the results, we can classify the molecules as either active molecules or inactive molecules. During training, the Binary Cross-Entropy Loss (BCEloss) is employed as the loss function to minimize the discrepancy between predicted values and true labels.
VS2Net was primarily trained and tested on the DUD-E dataset. Additionally, we conducted pre-experiments and transfer-learning cases, where the models shared a similar structure with VS2Net but were trained on different data. The detailed results and a transfer learning case result can be seen in Fig. S5–S7.† Their training processes are similar. Both take the FISD of proteins and molecules as the input and the classification results of whether the molecule is an active or inactive ligand for the protein as the output for supervised learning. The model that performs best on the validation set is saved and used for testing.
In future endeavours, one can harness the predictive prowess of this methodology and seamlessly integrate it with Artificial Intelligence Generated Content (AIGC) technology, thereby enabling a more profound and efficient exploration of chemical space. The integration will facilitate the intricate design of novel ligands, advancing the frontier of drug design. Furthermore, the model's generalization capabilities and versatility can be augmented by enriching it with an even broader spectrum of high-quality data, ensuring its robustness across diverse chemical contexts. The study provides physicochemical insights that are valuable to researchers in the field of explainable AI, shedding light on novel perspectives regarding the intricate interplay between model performance and reasonable feature engineering. It not only deepens our understanding of these complexities but also serves as a foundational cornerstone for drug discovery research, inspiring the development of innovative spectroscopic descriptor methodologies that will revolutionize the drug design landscape.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sc00451a |
‡ C. C. L. W. and Y. F. contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2025 |