Open Access Article
Tieu-Long Phan
ab,
The-Chuong Trinh
c,
Van-Thinh To
d,
Thanh-An Pham
d,
Phuoc-Chung Van Nguyen
d,
Tuyet-Minh Phan
d and
Tuyen Ngoc Truong
*d
aBioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany
bDepartment of Mathematics and Computer Science, University of Southern Denmark, Odense M, DK-5230, Denmark
cFaculty of Pharmacy, Grenoble Alpes University, La Tronche, 38700, France
dFalcuty of Pharmacy, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, 700000, Vietnam. E-mail: truongtuyen@ump.edu.vn
First published on 2nd May 2024
HIV-1 (human immunodeficiency virus-1) has been causing severe pandemics by attacking the immune system of its host. Left untreated, it can lead to AIDS (acquired immunodeficiency syndrome), where death is inevitable due to opportunistic diseases. Therefore, discovering new antiviral drugs against HIV-1 is crucial. This study aimed to explore a novel machine learning approach to classify compounds that inhibit HIV-1 integrase and screen the dataset of repurposing compounds. The present study had two main stages: selecting the best type of fingerprint or molecular descriptor using the Wilcoxon signed-rank test and building a computational model based on machine learning. In the first stage, we calculated 16 different types of fingerprint or molecular descriptors from the dataset and used each of them as input features for 10 machine-learning models, which were evaluated through cross-validation. Then, a meta-analysis was performed with the Wilcoxon signed-rank test to select the optimal fingerprint or molecular descriptor types. In the second stage, we constructed a model based on the optimal fingerprint or molecular descriptor type. This data followed the machine learning procedure, including data preprocessing, outlier handling, normalization, feature selection, model selection, external validation, and model optimization. In the end, an XGBoost model and RDK7 fingerprint were identified as the most suitable. The model achieved promising results, with an average precision of 0.928 ± 0.027 and an F1-score of 0.848 ± 0.041 in cross-validation. The model achieved an average precision of 0.921 and an F1-score of 0.889 in external validation. Molecular docking was performed and validated by redocking for docking power and retrospective control for screening power, with the AUC metrics being 0.876 and the threshold being identified at −9.71 kcal mol−1. Finally, 44 compounds from DrugBank repurposing data were selected from the QSAR model, then three candidates were identified as potential compounds from molecular docking, and PSI-697 was detected as the most promising molecule, with in vitro experiment being not performed (docking score: −17.14 kcal mol−1, HIV integrase inhibitory probability: 69.81%)
Machine learning has revolutionized many fields, including drug discovery. In this field, AI (artificial intelligence) has been used to create predictive models for ADMET, drug response,10 toxicity,11 and anticancer activity.12 These models allow virtual screening and prediction of compound activity.13 Several studies have been conducted on building models based on machine learning to predict the activity of compounds against IN, including those by A. Kurczyk et al. (2015), Y. Li et al. (2017), and L. A. Machado et al. (2022).14–16 In contrast to traditional QSAR models,17 which solely focused on linear equations to correlate the molecular descriptor with biological activities, the machine learning approach enables the implementation of non-linear models for QSAR.
This study aimed to discover promising candidates for organic synthesis plans by building computational models and using those models to screen the dataset of repurposing compounds from the DrugBank database for potential IN inhibitors. In this study, a novel approach was taken to select fingerprints or descriptors using the Wilcoxon signed-rank test.18 Likewise, the model selection process for determining the optimal model was conducted differently from the approach used by Y. Li et al. (2017).14 The study of Y. Li et al. utilized ECFP_4 fingerprint as the model input along with Support Vector Machine, Decision Tree, Function Tree, and Random Forest for machine learning model.
235 structures collected from the DrugBank database were prepared for virtual screening, with repurposing and repositioning strategies.
:
20) with stratification principle resulted in 1995 compounds in the training set and 499 compounds in the external validation set with the imbalance ratio of 0.404 between the active and inactive classes.
Then, the data mining process was conducted on the training set and similar methods were applied the external validation set. First, 1995 compounds with molecular features underwent data removal to eliminate duplicate rows and columns, followed by missing values handling utilizing KNNImputer from the scikit-learn library (only for the 3D-Mordred dataset) before ending up with low variance removal using a threshold of 0.05. Next, the Local Outlier Factor (LOF) technique was employed with a parameter setting of “n_neighbors = 20” to remove outliers in the training set and novelty in the external validation set. The final step in the data mining process is data normalization using a rescaling method, in which MinMaxScaler was applied to map all data into the range [0,1].
Feature extraction was executed using algorithms from the Random Forest (RF) algorithm, followed by applying 10 different machine learning algorithms to the molecular features: Logistic Regression (Logistic), k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Extra Tree (ExT), Adaboost (Ada), Gradient Boosting (Grad), XGBoost (XGB), CatBoost (CatB), and Multilayer Perceptron (MLP). A meta-analysis was conducted to identify the optimal feature set, employing the Wilcoxon signed-rank test and utilizing a 3 × 10 Repeated Stratified K-Fold cross-validation approach with the F1-score serving as the primary evaluation metric. To account for multiple comparisons in the Wilcoxon test, the Holm24 method was applied, as indicated by the p_adjust = ‘holm’ parameter in the scikit-posthocs package,25 crucial for controlling the family-wise error rate amidst numerous pairwise comparisons. The Holm method systematically adjusts each p-value from the pairwise tests, reducing false positives and enhancing the validity of the significant results, illustrating a detailed statistical analysis approach and ensuring the reliability of the molecular feature set selection findings.
The reduced-dimensional data was performed for model selection to select an optimal algorithm for model development. Ten different algorithms were selected for this step, including Logistic Regression (Logistic), k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Extra Tree (ExT), Adaboost (Ada), Gradient Boosting (Grad), XGBoost (XGB), CatBoost (CatB), and Multilayer Perceptron (MLP). 3 × 10 RepeatedStratifiedKFold cross-validation with Wilcoxon signed-rank test26 was also performed for these two stages.
Moreover, the Tree-structured Parzen Estimator (TPE) algorithm from the Optuna library was utilized for Bayesian Optimization (BO) to optimize the hyperparameters. BO is a global optimization technique that builds a surrogate model of the objective function and uses an acquisition function to suggest the next sample point.27,28
The models' performance in this study was evaluated using statistical parameters such as F1-score, average precision, precision, and recall. Precision is calculated as the ratio of true positive predictions to the sum of true positive and false positive predictions.30
Recall is a statistical measure that quantifies the proportion of true positive instances that are correctly identified by a predictive model.30
Average Precision (AP) is calculated as the weighted mean of precision at each threshold, the weight is the increase in recall from the prior threshold.31
The F1-score is calculated as the harmonic mean of precision and recall, providing a measure of the trade-off between them.30
000 iterations, the ligands were saved in .pdb format. Further preparation with MGLTools 1.5.7 added hydrogens and Gasteiger charges, resulting in the final.pdbqt files for the ligands. Finally, the grid box was defined as a cube of 60 × 60 × 60 grid points, with coordinates x = 143.399 Å, y = 159.402 Å, and z = 177.382 Å. Molecular docking was performed using AutoDock-GPU, employing specific parameters to guide the process. The seed parameter was set to a random number seed of 42 to ensure reproducibility. We used 1000 runs of the Lamarckian genetic algorithm (nrun) to thoroughly explore the potential binding configurations. Additionally, a maximum of 2 × 109 score evaluations per LGA run (nev) was specified, allowing an extensive assessment of the docking interactions within each run.
The performance of the docking model was validated by redocking to validate docking power (RMSD ≤2 Å) and retrospective control (enrichment analysis) to validate screening power. DeepCoy was utilized to generate decoy (active: decoy = 1
:
50) for the latter step,33 and the performance was measured by the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC). Additionally, the Geometric Mean (G-Mean) was employed to determine the optimal cut-off point for the ROC curve.34 The G-Mean is a metric measuring the balance between classification performance for majority and minority classes.
From Fig. 2, the RDK7 fingerprint experienced the highest average F1-score in accordance with cross-validation (0.811), calculated from 10 models with 300 observations the whole. Meta-analysis was also conducted for pairwise comparison among fingerprints and descriptors using the Wilcoxon signed-rank test, illustrated in Table S2 (ESI).† As shown in Fig. 3, RDK7 showed a statistically higher significance of average F1-score compared to other datasets (p < 0.05). Therefore, RDK7 was selected as the optimal molecular feature to develop a machine learning model.
![]() | ||
| Fig. 3 The heatmap illustrated the Wilcoxon signed-rank test 16 types of fingerprints and descriptors for meta-analysis. | ||
According to the box and whisker plot in Fig. 4, feature selection methods were stable except for mutual information, Logistic Regression, and AdaBoost returning F1-score outliers after 30 times cross-validation. While using the Wilcoxon signed-rank test for F1-score comparison among 8 models, only the Chi-squared and Mutual information gave statistically significantly lower results than the baseline model (p < 0.05). Other models had no statistically significant difference compared to the baseline model, so the feature extraction methods did not meet the first criterion.
On the other hand, the XGBoost and the Logistic Regression achieved the highest average F1-score among all the models, except for the baseline model. However, according to pairwise assessment of these two models (Fig. S1 ESI†), there was no statistically significant difference (p > 0.05). The second criterion, aimed at reducing computational resources by minimizing the number of features, was taken into consideration. The XGBoost algorithm had 533 bits, a lower number of features than the Logistic Regression algorithm, which had 744 bits. Therefore, the XGBoost algorithm was selected to reduce the dimension of the RDK7 dataset. The results of the features selection comparison are illustrated in Table S3 (ESI†).
Based on the box and whisker plot in Fig. 5, XGBoost (0.842 ± 0.039) and CatBoost (0.842 ± 0.044) achieved the highest average F1-score. However, when the Wilcoxon signed-rank test was applied, these differences were not statistically significant (p > 0.05) compared to Logistic Regression, Random Forest, Gradient Boosting, and Multilayer Perceptron (Fig. 6).
We continued to use average precision (AP) to evaluate model performance (Fig. 7). Two models, including XGBoost (0.929 ± 0.029) and CatBoost (0.927 ± 0.030) remained the best performers. In this case, XGBoost showed significant differences when compared to most of the others (p < 0.05). In addition, XGBoost had a shorter training time than CatBoost. Thus, in terms of the second criterion for this step, XGBoost model was selected for the optimization.
The results derived from the cross-validation and external validation were consistent. This external validation result was highly generalizable and can be applied in virtual screening.
| Cross-validation | External validation | |||||
|---|---|---|---|---|---|---|
| AP | F1-score | Recall | AP | F1-score | Recall | |
| Baseline | 0.929 ± 0.030 | 0.842 ± 0.038 | 0.835 ± 0.045 | 0.921 | 0.856 | 0.878 |
| Optimize | 0.928 ± 0.027 | 0.848 ± 0.041 | 0.855 ± 0.052 | 0.921 | 0.889 | 0.921 |
According to Fig. 8, the CV-recall values are significantly higher after optimization than the default hyperparameter model, with a p-value ≤ 0.01. However, the average CV-AP and CV-F1 scores do not show a statistically significant improvement after optimization, with p-values of 0.33 and 0.06 respectively.
![]() | ||
| Fig. 8 The internal cross-validation results of the model, both before and after hyperparameter optimization, (A) average precision, (B) F1-score, (C) recall. | ||
Compared with the study of L. A. Machado et al.,16 a state-of-the-art machine learning model targeting HIV integrase utilizing Mordred descriptor, our model could not outperform in external validation, with F1-score being 0.89 lower than the 0.93 of their study. However, our model development procedure is more rigid, with several decision-making stages, supported by the Wilcoxon signed-rank test of cross-validation. Moreover, our study performed cross-validation in the development pipeline for selection and optimization. At the same time, external validation was conducted in the final stage to prove the generalization of the machine learning model. The applicability domain was also investigated to remove five substances in the external validation set to ensure the interpolation of our model (Fig. 9). From Fig. 9 the red point was detected as a novelty, or outside applicability domain, which was far from the training set (grey points). This could solve the problem of sparse space in the bounding box approach.
According to the AUC-ROC curve in Fig. 10B, the AUC value was 0.876 for the most negative conformation (ad_gpu_min) with the G-Mean value reaching 0.841. The docking threshold extrapolated from the G-Mean was −9.71 kcal mol−1.
235 substances from DrugBank were subjected to a medicinal chemistry filter, incorporating Lipinski's Rule36 of 5 (RO5 = 4), SAscore,37 and PAINS,38 yielding 8333 structures. Subsequent screening through a 2D-QSAR classification model identified 44 compounds as active. These compounds were further analyzed using molecular docking, leading to the identification of three notable compounds: two existing medicines and one hit compound (PSI-697). The outcomes of this virtual screening process are depicted in Fig. 11.
In general, the results obtained from the molecular docking process showed that all three compounds in Fig. 12 formed electrostatic interactions with both Mg2+ ions. This is the first and highly important pharmacophore characteristic shared by all currently available INSTIs on the market. Additionally, all three selected conformations formed hydrogen bonds with the sidechain carbonyl group of Asp64, while fitting within the binding pocket between the two subunits of HIV integrase (matching the binding site of Bictegravir in the initial protein-ligand complex with PDB ID 6PUW).
![]() | ||
| Fig. 12 The binding modes of three potential candidates from the QSAR model. Blue: Elvitegravir (−17.32, 98,82%). Red: Dolutegravir (−11.42, 98,63%). Green: PSI-697 (−17.14, 69,81%). | ||
Regarding PSI-697 (green) in Fig. 13, the binding mode was similar to Bictegravir, but the docking score was more negative, with the figures being −17,14 kcal mol−1 and −11 kcal mol−1, respectively. This could be explained by the hydrogen bonds with the sidechain of Glu152, which was observed in the complex of Bictegravir and protein. Moreover, the docking score of PSI-697 was also approximately equal to Elvitegravir (−17,32 kcal mol−1). The HIV integrase inhibitory probability of PSI-697 from the QSAR model was also good, with the figure being around 69%. As a result, PSI-697 was the most promising candidate targeting HIV integrase for both inhibition and binding ability.
The RDK7 fingerprint proved the most suitable and XGBoost was the best model. External validation yielded impressive results with an F1-score of 0.889, average precision of 0.921, and recall of 0.921. These findings are highly generalizable and valuable for the virtual screening of potential HIV-1 integrase inhibitors.
The repurposing structure library was screened, resulting in the identification of one potential compound. We recommend the synthesis and biological activity testing of this potential compound.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ra02231a |
| This journal is © The Royal Society of Chemistry 2024 |