Gabriela
Valle-Núñez
,
Raziel
Cedillo-González
,
Juan F.
Avellaneda-Tamayo
,
Fernanda I.
Saldívar-González
,
Diana L.
Prado-Romero
and
José L.
Medina-Franco
*
DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico. E-mail: medinajl@unam.mx; Tel: +52-55-5622-3899
First published on 4th April 2025
Viral infections represent a significant global health concern. Viral diseases can range from mild symptoms to life-threatening conditions, and the impact of these infections has grown due to increased contagious rates driven by globalization. A prime example is the SARS-CoV-2 pandemic, which emphasized the urgent need to design and develop new antiviral drugs. This study aimed to generate a curated data set of compounds relevant to respiratory infections, focusing on predicting their antiviral activity. Specifically, the study leverages ML classification models to evaluate focused and on-demand compound libraries targeting pathways associated with viral respiratory infections. ML models were trained based on the antiviral biological activity related to respiratory diseases deposited on a major public compound database annotated with biological activity. The models were validated and retrained to classify and design antiviral-focused libraries on seven respiratory targets.
Despite ongoing efforts, developing effective antivirals for most viruses remains a significant challenge due to several key obstacles in antiviral discovery. These include the identification of specific targets, narrow treatment windows, vector spread and control, and the emergence of mutations that contribute to antiviral resistance.7 In response, there has been a growing focus on developing structurally diverse antivirals with enhanced safety profiles, as well as those that retain efficacy against drug-resistant strains. This shift in focus has led to renewed interest in compounds with novel mechanisms of action.8
Acute respiratory disease (ARD) represents a significant portion of acute illnesses and fatalities worldwide. Acute viral respiratory tract infections alone are responsible for approximately 80% of ARD cases.9 Key viral pathogens in this category include influenza, respiratory syncytial virus (RSV), coronaviruses, adenovirus, and rhinovirus, all of which are related to some of the most highlighted diseases on the WHO's prioritized list (Table S1†). While viruses like adenovirus and rhinovirus typically result in lower mortality rates, they contribute substantially to morbidity and place a significant economic burden on healthcare systems.10
The emergence of highly pathogenic coronaviruses, such as the SARS-CoV-2 virus, responsible for the COVID-19 pandemic, has highlighted the severe threat posed by these pathogens. Other coronavirus strains, including those that caused the Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks, persist as significant public health risks, and place substantial pressure on healthcare systems, especially in regions with high comorbidity rates and limited financial resources.11,12
The COVID-19 pandemic represented one of the most significant threats to global health and stability in recent history, triggering an unprecedented surge in antiviral drug and vaccine research, alongside broader innovations in healthcare and daily life. Antiviral development integrates a diverse array of strategies, spanning well-established therapeutic approaches and emerging targeted interventions.13 This field draws upon both synthetic and natural sources, yielding compounds that exhibit a wide range of chemical structures and mechanisms of action, including direct inhibition of viral replication, immune system modulation, and disruption of host–virus interactions.14–16
Guo et al. reviewed recent advances in natural products (NPs) for antiviral research, with a particular focus on addressing drug resistance.16 Various NPs target essential viral enzymes such as integrase, reverse transcriptase, and protease.17 Flavonoids and polyphenols constitute the largest group of antiviral NPs, followed by diterpenes and triterpenes, with fewer examples found among alkaloids.18 Examples of plant-derived compounds with demonstrated antiviral properties are quercetin, curcumin, and baicalein. Quercetin has shown effectiveness against RSV, MERS-CoV, influenza, and rhinoviruses through inhibition of viral entry and replication.19 Curcumin has been proven to inhibit the SARS-CoV-2 spike glycoprotein, ACE2 receptor, and proteases.20Scutellaria baicalensis root extract is traditionally used in Asia, as an antiviral, antioxidant, and anti-inflammatory. This extract contains baicalein, which has demonstrated inhibition of SARS-CoV-2 main protease (Mpro) activity and viral replication in vitro (Fig. 1).21,22
Drug repositioning of approved drugs and advanced stages developing molecules has also played a key role in the development of novel antivirals with known molecules, as is the case of SARS-CoV-2.23,24
Computer-aided drug design (CADD) has significantly advanced antiviral discovery. Liao et al. identified five natural compounds – narcissoside, kaempferol-3-O-gentiobioside, rutin, vicin-2, and isoschaftoside – as potential SARS-CoV-2 Mpro inhibitors.25 Generative topographic mapping (GTM) has aided in identifying antiviral motifs and screening virtual chemical libraries, as demonstrated in the design of anti-herpes compounds (herpes simplex virus type 1).26,27 CADD methods have also identified several promising antiviral compounds. These include baricitinib, galidesivir, and molnupiravir. Baricitinib was predicted by artificial intelligence (AI)-driven analysis to inhibit viral entry and inflammation in SARS-CoV-2.28 Galidesivir, an antiviral for Ebola and Zika, was evaluated through structural modeling as a potential inhibitor of SARS-CoV-2 RdRp.29 Molnupiravir (EIDD-2801), a prodrug of β-D-N4-hydroxycytidine, was optimized through docking and molecular dynamics to interfere with SARS-CoV-2 replication (Fig. 1).30 Approved antivirals such as remdesivir, favipiravir, and ritonavir have been repurposed for respiratory viruses through virtual screening (VS), further confirming their potential to inhibit RNA polymerases (Fig. 1).31
Focused virtual libraries of compounds are valuable resources for bioactive compound discovery. These libraries compile data on molecules with potential biological activity, identified through ligand-based and structure-based drug discovery approaches. They play a crucial role in prioritizing candidates for synthesis, biological evaluation, and efficient allocation of resources. Notable recent examples of disease-specific focused virtual libraries include those targeting neglected infectious diseases,32 SARS-CoV-2,33,34 Sirtuin-1 dysregulation,35 and type 2 diabetes mellitus.36
Given the ongoing demand for respiratory-focused antivirals, extensive research has generated a wealth of structure–activity data available in public repositories such as ChEMBL.37,38 This data serves as a crucial input for machine learning (ML) models to design focused libraries for further experimental screening.
The main goal of this study was to design antiviral libraries focused on molecular targets related to respiratory diseases. To achieve this, we trained, retrained and validated ML classification models using bioactivity data from ChEMBL 33.37,38 The predictive models were used to filter compound libraries from diverse sources. As part of the data preparation to train the ML models, the chemical data sets were analyzed and characterized in terms of chemical diversity and coverage in chemical space using chemoinformatics methods. The resulting antiviral-focused chemical libraries, which are freely available in the public domain, offer valuable starting points for further computational and/or experimental screening, which is the next logical step of this study.
Family | Virus | Acronym | ChEMBL target ID |
---|---|---|---|
Coronaviridae | Feline coronavirus | FCoV | CHEMBL612744, CHEMBL4295624 |
Human coronavirus 229E | HCoV-229E | CHEMBL613837, CHEMBL4888440 | |
Human coronavirus NL63 | HCoV-NL63 | CHEMBL3232683 | |
Middle East respiratory syndrome-related coronavirus | MERS-CoV | CHEMBL4296578, CHEMBL4295557 | |
Severe acute respiratory syndrome coronavirus | SARS-CoV | CHEMBL4802007 | |
Severe acute respiratory syndrome coronavirus 2 | SARS-CoV-2 | CHEMBL4888460, CHEMBL5169223, CHEMBL4303835 | |
Picornaviridae | Enterovirus A71 | HEV-71 | CHEMBL612436, CHEMBL4295606, CHEMBL4295525 |
Human rhinovirus | HRV | CHEMBL613760, CHEMBL2857, CHEMBL612470 | |
Paramyxoviridae | Human parainfluenza virus 1 | HPIV-1 | CHEMBL1764934 |
Pneumoviridae | Human respiratory syncytial virus | HRSV | CHEMBL4635143, CHEMBL2364165, CHEMBL4630897 |
Orthomyxoviridae | Influenza A virus | IAV | CHEMBL613740, CHEMBL612610, CHEMBL2367089 |
Influenza B virus | IBV | CHEMBL613129, CHEMBL4295840, CHEMBL2028641 | |
Paramyxoviridae | Henipavirus nipahense | NiV | CHEMBL6047, CHEMBL615055 |
For compounds with multiple recorded biological activity values (“standard value”) against a target, we ranked these values from smallest to largest to ensure consistency in the data. The pIC50 was calculated for each compound, and compounds were classified according to the following criteria:
(a) IC50 ≤ 10 μM: were labeled as “Inhibitor”.
(b) 10 μM < IC50 < 20 μM: were labeled as “Unknown”.
(c) IC50 ≥ 20 μM: were labeled as “No_Activity”.
If a single category represented at least 80% of the recorded data for a compound, that category was used; otherwise, the label “Mixed” was assigned. Compounds with fewer than five data points retained their original classification, with “Mixed” assigned if they were labeled across multiple categories.
Additionally, one supplementary compound database was assembled, containing approved antivirals from DrugBank 5.1.12.40 Compounds from ChEMBL associated with each viral target, along with those from DrugBank, were compiled into two collections.
A comprehensive data curation process was applied to data sets to ensure data integrity. Compounds with null values, empty entries, or duplicates were removed, resulting in a final count of 4521 compounds from ChEMBL 33, and 92 approved antivirals from DrugBank. Molecular structures were standardized using RDKit version 2024.03.5,41 and MolVS,42 following a well-established and used standardization protocol.43 Data sets and code notebooks are publicly accessible through DIFACQUIM's GitHub repository at https://github.com/DIFACQUIM/antiviral_ML.
Target selection for predictive model development was guided by the criteria established by Sánchez-Cruz and Medina-Franco.47 According to these guidelines, a target was deemed suitable for predictive modeling if it included at least 30 active and 30 inactive compounds, and if it had a MODI score of 0.7 or higher for at least one molecular representation. Based on these criteria, we selected the seven targets listed in Table S2† for the construction of predictive models.
Only data corresponding to the seven selected targets (Table S2†) was filtered for model building. To evaluate and reduce multicollinearity, the Pearson correlation between descriptors was computed. The second descriptor was discarded if any pair showed a correlation above 0.8, prioritizing drug-likeness relevance.
For supervised binary classification modeling, we employed PyCaret version 3.3.2 for Python to develop models using 15 different ML algorithms (Table S3 in the ESI†).53 Each model was trained on ChEMBL data for the selected targets, associated with a binary activity label (active/inactive). Morgan Chiral of radius 2 (2048 bits) fingerprint and physicochemical descriptors were used as molecular representation, with PyCaret's default hyperparameter settings.
Normalization, fold generation, and imbalance correction were achieved using z-score normalization, stratified k-fold cross-validation, and the Adaptive Synthetic Sampling (ADASYN) algorithm, respectively.54 Additionally, we mitigated the risk of overfitting by enabling an early stopping mechanism to ensure that the models remain capable of generalizing well to new data points.
For both validations, we obtained the MCC and calculated its average for all models, so as to select the three best architectures for the data sets studied in this analysis (Table S4†). MCC is a robust metric for assessing the quality of binary classification models, ranging from −1 to 1. A value of 1 indicates perfect classification, 0 corresponds to random predictions, and −1 represents completely inverse predictions. The architectures selected for modeling were retrained on the complete data set corresponding to each target, aiming to significantly enhance the MCC and improve model generalization by optimizing performance. This retraining process utilized the same hyperparameters as those in the initial model construction and was applied to predict the final antiviral activity class of the data set assembled for VS, as described in Section 2.7. Cross validation was performed to assess the effectiveness of this retraining by obtaining the MCC retraining value.
Database | Acronym | Description | Compounds after curation | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a Number of compounds before curation: 396![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ChemDiv coronavirus library58 | ChD_covL | Collection of small molecules with potential antiviral activity against coronavirus | 20![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ChemDiv antiviral library59 | ChD_AvL | Collection of small molecules with potential antiviral activity, targeting over 50 key proteins in viruses | 64![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OTAVA drug-like green collection60 | OT_DLGC | Drug-like green collection compound library, curated based on screening compounds for prompt delivery and pre-formatted according to Lipinski's rule of five | 169![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Enamine antiviral library61 | Ena_AvL | Collection of molecules designed for discovery of new nucleoside-like antivirals | 3200 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ChemSpace discovery diversity set62 | ChE_DDS | Collection of small molecules that are synthesized from in-house building blocks using carefully developed and optimized reactions | 10![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals helicase targeted library63 | LC_HTL | Collection of structurally diverse molecules with potential activity against key helicase-related drug targets, selected by a chemoinformatics team through in silico molecular docking | 3291 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals helicase focused library63 | LC_HFL | A curated collection of compounds targeting helicases, including viral and genetic disorder-related enzymes like hepatitis C NS3 and Werner syndrome helicases. Compounds were selected based on structural similarity (84% Tanimoto threshold) | 3665 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals 2019-nCoV papain-like protease (PLP) targeted library64 | LC_plpL | A curated collection of drug-like compounds designed to target the PLP of SARS-CoV-2, using docking-based screening without constraints. Compounds were filtered for binding accuracy and removed if they were PAINs, toxic, or reactive | 1736 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals DNA polymerase targeted library65 | LC_dptL | Library of structurally diverse compounds targeting DNA polymerase-related drug targets, developed using pharmacophore-driven screening | 628 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals polymerase focused library 15 polymerase assays65 | LC_polL | A library of molecules identified for potential polymerase inhibition, created by screening 4567 active compounds from 15 polymerase assays targeting RNA and DNA polymerases. Compounds were selected using Tanimoto similarity from the life chemicals HTS compound collection and ranked by predicted activity | 15![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals polymerase focused library similarity to ChEMBL database65 | LC_polsL | A library of drug-like screening compounds selected through a 2D fingerprint similarity search (Tanimoto index > 85%) against a reference set of 20![]() |
13![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals SARS coronavirus focused library66 | LC_covL | A curated collection of small-molecule compounds selected through a 2D fingerprint similarity search, targeting key SARS-CoV proteins. The compounds were chosen based on activity criteria from a reference set of 300 known SARS inhibitors | 436 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals 2019-nCoV main protease targeted library67 | LC_mproL | A curated collection of drug-like compounds designed to target the main protease of SARS-CoV-2, using docking-based screening without constraints. Toxicophore filters were applied, while peptide-like structures were retained to enhance binding potential | 2338 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals antiviral targeted library68 | LC_AVL | A curated library of diverse compounds identified through structure-based screening, targeting antiviral proteins like hepatitis B core protein and influenza A PA endonuclease. Developed using phase modeling and life chemicals' HTS collection, with customization options available | 1350 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals merged antiviral screening superset69 | LC_MASS | Data set of small-molecule compounds consolidated into individual screening subsets for various viral diseases, providing a comprehensive resource in one collection | 45![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals antiviral library combined ligand-based and structure-based approaches70 | LC_ALCLBSBA | A curated library of potential antiviral agents, designed using protein crystal structures of key viral targets. Selected through glide docking and UNITY pharmacophore searches, with PAINs and reactive compounds excluded | 3514 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals antiviral screening compound library 2D similarity70 | LC_ASCL2DS | The antiviral screening compound library was designed using a 2D fingerprint similarity search against a reference set of 46![]() |
15![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals bioactive compound library71 | LC_BCL | A collection of structurally diverse screening compounds, each with confirmed biological activity against approximately 600 pharmaceutical targets | 9897 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals EF1A targeted library GDP site72 | LC_gdp | Library that includes compounds selected through docking-based VS of GDP site on the eEF1A protein. The compounds have high predicted affinity, are Ro5-compliant, and exclude PAINS, toxic, or reactive groups. Subsets for each binding site are provided with docking scores | 1267 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals EF1A targeted library EF1B site73 | LC_ef1b | Library that includes compounds selected through docking-based VS of EF1B site on the eEF1A protein. The compounds have high predicted affinity, are Ro5-compliant, and exclude PAINS, toxic, or reactive groups. Subsets for each binding site are provided with docking scores | 1544 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals pre-plated coronavirus COVID-19 screening set-384 well73 | LC_cov19 | Screening set consists of drug-like compounds from the 2019-nCoV Mpro targeted library, designed to support anti-coronavirus drug discovery efforts | 2300 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LifeChemicals preplated helicase screening set 6080 cmpds 384 well73 | LC_PHSS384 | Screening sets that include drug-like small-molecule compounds with potential helicase-related activity for drug discovery targeting infectious diseases and cancer. Alternatively, two smaller, non-overlapping subsets of 3520 and 2560 helicase-focused molecules are also available for separate purchase | 6080 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General screening antiviral data seta | VS data set | Data set containing only unique structures from all chemical libraries | 339![]() |
Target | Organism | Count | pIC50 median | Active | Inactive | MODI MACCS keys (166 bits) | MODI Morgan Chiral of radius 2 (2048 bits) |
---|---|---|---|---|---|---|---|
M2 proton channel | IAV | 92 | 5.45 | 68 | 24 | 0.82 | 0.83 |
Mpro | SARS-CoV | 197 | 4.52 | 77 | 120 | 0.66 | 0.77 |
SARS-CoV-2 | 815 | 6.35 | 651 | 164 | 0.88 | 0.91 | |
Neuraminidase | IAV | 1123 | 5.72 | 733 | 390 | 0.88 | 0.91 |
IBV | 202 | 5.47 | 132 | 70 | 0.72 | 0.71 | |
Polymerase (PA) | IAV | 256 | 5.40 | 151 | 105 | 0.84 | 0.88 |
Protease | HRV | 389 | 5.96 | 298 | 91 | 0.83 | 0.85 |
Targets such as SARS-CoV-2 Mpro and IAV neuraminidase exhibited the highest MODI scores (0.88 for MACCS keys (166 bits) and 0.91 for Morgan Chiral of radius 2 (2048 bits)), indicating strong modelability. In contrast, targets like SARS-CoV Mpro, which was selected for its relatively higher MODI score with Morgan Chiral of radius 2 compared to MACCS keys, or IBV neuraminidase, which had lower MODI scores (0.72 for MACCS keys and 0.71 for Morgan Chiral of radius 2), suggest potentially more challenging modelability. Notably, when comparing the performance of the two fingerprints, most targets showed slightly better modelability with Morgan Chiral of radius 2, suggesting that these fingerprints capture relevant chemical features more effectively for these targets.
The ratio of active to inactive compounds also appears to influence the MODI score. For instance, IAV neuraminidase, with a high number of both active (733) and inactive (390) compounds, had a correspondingly high MODI score. Additionally, the relationship between the median pIC50 values and the MODI scores is worth highlighting. For example, SARS-CoV-2 Mpro, with the highest median pIC50 value (6.35), aligned with its strong modelability, whereas targets with lower pIC50 values may exhibit more variable performance. Similarly, the total number of compounds may serve as another important indicator of modelability. For example, SARS-CoV-2 Mpro, with the second largest data set, likely benefits from a richer data set for model training. Conversely, smaller data sets, such as IAV M2 proton channel, may result in reduced predictive performance due to limited data diversity, as discussed further in the next sections. It is also notable that targets, such as IBV neuraminidase, had lower MODI scores despite a reasonably balanced data set. This discrepancy could be attributed to factors such as structural complexity or compound heterogeneity, which may pose additional challenges for predictive modeling.
It is important to emphasize that the ML models were constructed using different sets of physicochemical properties tailored to each target. This variation could significantly influence the data modelability, as certain properties might be more relevant for specific viral targets. For future research, exploring the impact of these individual properties on the performance of the models could provide deeper insights and help refine predictive frameworks.
More details about the most frequent scaffolds in the training data sets for each target are illustrated in Fig. S2† after following the curation and standardization processes.
Antiviral drugs commonly exhibit diverse structural scaffolds, including adenine derivatives and privileged frameworks that facilitate interactions with multiple viral mechanisms. These compounds frequently feature key atoms such as nitrogen, oxygen, and carbon, which play critical roles in their biological activity and contribute to their structural diversity.78
As presented in the ESI, Fig. S3† shows the most frequent ring systems in the training data sets, while Fig. S4 and S5† depict the predominant scaffolds and ring systems in the VS data set, respectively. Not surprisingly, the benzene ring was the most frequent scaffold and ring system, reflecting its ubiquitous presence across the chemical data set. Notably, the top ring systems in the VS data set has a high prevalence of nitrogen-containing rings, a feature that holds significant promise for enhancing antiviral activity due to their well-established role in molecular recognition and binding to viral targets.79,80
According to the discussion of data modelability in Section 3.2, as expected, the IAV M2 proton channel target had the lowest MCC values for all three top models in each validation phase. This result emphasized the importance of the minimum amount of compounds and reinforced the relevance of implementing several criteria while selecting and analyzing molecular targets to develop reliable and useful ML models.
As illustrated in Fig. 6, some models were robust for a few targets at different validation phases but not in all of them. For instance, the AdaBoost classifier performed best for IBV neuraminidase during retraining while others like the Support Vector Machine (SVM – linear kernel) performed best for IAV polymerase (PA) during the internal validation of the training process. Moreover, this model showed consistent MCC values across training and retraining. This consistency suggests robust model performance and generalizability.
![]() | ||
Fig. 6 Comparison of MCC per target, best-fit model (highlighted in yellow in Table S6†), and validation phase. |
Target | Number of predicted active compounds bya | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 models | 2 models | 1 model | Top best model | Active | Second best model | Active | Third best model | Active | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a Number in parenthesis is the relative percentage frequency. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IAV_polymerase (PA) | 9234 (2.72) | 11![]() |
26![]() |
SVM – linear kernel | 22![]() |
Extra trees classifier | 26![]() |
Gradient boosting classifier | 27![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HRV_protease | 225![]() |
69![]() |
24![]() |
Gradient boosting classifier | 302![]() |
Extreme gradient boosting | 284![]() |
AdaBoost classifier | 253![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IAV_M2 proton channel | 13![]() |
174![]() |
146![]() |
Quadratic discriminant analysis | 326![]() |
K neighbors classifier | 26![]() |
Dummy classifier | 181![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SARS-CoV_Mpro | 16![]() |
46![]() |
103![]() |
Linear discriminant analysis | 49![]() |
K neighbors classifier | 112![]() |
SVM – linear kernel | 85![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SARS-CoV-2_Mpro | 169![]() |
79![]() |
52![]() |
Light gradient boosting machine | 245![]() |
Extreme gradient boosting | 244![]() |
Decision tree classifier | 228![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IAV_neuraminidase | 8101 (2.39) | 21![]() |
120![]() |
Logistic regression | 133![]() |
Random forest classifier | 33![]() |
Extra trees classifier | 20![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IBV_neuraminidase | 1054 (0.31) | 32![]() |
137![]() |
AdaBoost classifier | 162![]() |
K neighbors classifier | 3167 (0.93) | Extra trees classifier | 39![]() |
Range | IAV_polymerase (PA) | SARS-CoV-2_Mpro | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 4855 | 310 | 125 | 1162 | 881 | 593 |
Q1 | 11 | — | — | 3 | 2 | 2 |
Q2 | 660 | 44 | 16 | 3 | 2 | 1 |
Q3 | 14![]() |
926 | 410 | 19![]() |
13![]() |
9622 |
Q4 | 319![]() |
21![]() |
8683 | 318![]() |
231![]() |
159![]() |
Range | HRV_protease | IAV_neuraminidase | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 2883 | 2569 | 1936 | 1162 | 476 | 35 |
Q1 | 404 | 371 | 268 | 3 | 1 | — |
Q2 | 3552 | 3224 | 2469 | 20 | 8 | 1 |
Q3 | 28![]() |
25![]() |
19![]() |
28![]() |
10![]() |
611 |
Q4 | 303![]() |
270![]() |
201![]() |
309![]() |
121![]() |
7454 |
Range | IAV_M2 proton channel | IBV_neuraminidase | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 5277 | 5086 | 3537 | 320![]() |
152![]() |
997 |
Q1 | 2879 | 2776 | 1981 | — | — | — |
Q2 | 119![]() |
115![]() |
79![]() |
— | — | — |
Q3 | 149![]() |
144![]() |
99![]() |
2 | 1 | — |
Q4 | 61![]() |
59![]() |
41![]() |
18![]() |
9095 | 57 |
Range | SARS-CoV_Mpro | ||
---|---|---|---|
Total | Best model (actives) | Consensus (actives) | |
Out | 1506 | 210 | 55 |
Q1 | 83 | 12 | 2 |
Q2 | 11![]() |
1676 | 597 |
Q3 | 195![]() |
28![]() |
9526 |
Q4 | 130![]() |
19![]() |
6480 |
Range | IAV_polymerase (PA) | SARS-CoV-2_Mpro | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 1360 | 95 | 34 | 85 | 56 | 44 |
Q1 | 78![]() |
5356 | 2298 | 102![]() |
74![]() |
51![]() |
Q2 | 83![]() |
5609 | 2276 | 54![]() |
39![]() |
27![]() |
Q3 | 97![]() |
6262 | 2614 | 85![]() |
62![]() |
42![]() |
Q4 | 78![]() |
4973 | 2012 | 95![]() |
69![]() |
47![]() |
Range | HRV_rotease | IAV_neuraminidase | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 909 | 817 | 628 | 846 | 332 | 17 |
Q1 | 74![]() |
66![]() |
49![]() |
58![]() |
22![]() |
1386 |
Q2 | 87![]() |
77![]() |
57![]() |
86![]() |
33![]() |
2054 |
Q3 | 90![]() |
80![]() |
60![]() |
101![]() |
40![]() |
2428 |
Q4 | 86![]() |
76![]() |
57![]() |
91![]() |
36![]() |
2216 |
Range | IAV_M2 proton channel | IBV_neuraminidase | ||||
---|---|---|---|---|---|---|
Total | Best model (actives) | Consensus (actives) | Total | Best model (actives) | Consensus (actives) | |
Out | 7 | 7 | 4 | 13![]() |
6500 | 37 |
Q1 | 46![]() |
44![]() |
30![]() |
84![]() |
40![]() |
262 |
Q2 | 56![]() |
54![]() |
37![]() |
55![]() |
26![]() |
176 |
Q3 | 74![]() |
72![]() |
49![]() |
87![]() |
42![]() |
257 |
Q4 | 161![]() |
155![]() |
107![]() |
97![]() |
46![]() |
322 |
Range | SARS-CoV_Mpro | ||
---|---|---|---|
Total | Best model (actives) | Consensus (actives) | |
Out | 1709 | 274 | 77 |
Q1 | 66![]() |
9898 | 3346 |
Q2 | 85![]() |
12![]() |
4255 |
Q3 | 90![]() |
13![]() |
4407 |
Q4 | 95![]() |
14![]() |
4575 |
Quartile | HRV_protease | IAV_M2 proton channel | IAV_neuraminidase | IAV_polymerase (PA) | IBV_neuraminidase | SARS-CoV_Mpro | SARS-CoV-2_Mpro |
---|---|---|---|---|---|---|---|
Q1 | 268 | 126 | — | — | — | 2 | 2 |
Q2 | 2469 | 4484 | 1 | 16 | — | 597 | 1 |
Q3 | 19![]() |
5948 | 611 | 410 | — | 9526 | 9622 |
Q4 | 201![]() |
2518 | 7454 | 8683 | 57 | 6480 | 159![]() |
Out | 1934 | 220 | 35 | 125 | 997 | 55 | 592 |
In general, fewer compounds are considered “Out” when the distance is calculated with physicochemical properties. This could be due to the preservation of drug-like properties for the commercial and focused libraries, and the ChEMBL compounds, as observed in Section 3.3.1. The results for IBV neuraminidase had the greatest number of compounds labeled as “Out” for both distances. This is aligned with the MODI results since IBV neuraminidase was the target with the lowest value for fingerprints.
For all 339040 compounds in the VS data set (see Table 2) forty-one ADMET properties were calculated using ADMET-AI (see Methods Section 2.9 for further details). This detailed profiling is included in the structure file of the VS data set to facilitate a comprehensive evaluation of candidate molecules. The user of the newly assembled and designed libraries is free to use other tools to estimate the ADMET profile of the newly assembled and designed libraries (see Section 3.9).
To provide a comprehensive overview of the compounds and take all analysis and predictions together, eight distinct libraries focused on respiratory viruses were designed: the VS data set in conjunction with its predictions and seven subsets for each target. Each library is annotated with: “Canonical SMILES, Murcko SMILES, identifier (ID), database (DB), number of repetitions, prediction label (for each model), prediction score (for each model), Quartile Pairsim (structural), Quartile (structural), Quartile Distance (physicochemical properties), Quartile (physicochemical properties)” plus the ADMET profile in the VS data set library.
Fig. 7 shows the chemical structures of representative compounds included in the newly generated antiviral-focused libraries. Specifically, the figure highlights predicted active compounds for four selected antiviral targets, with high predictive confidence (e.g., predictions within the first quartile (Q1) of the distance to model as described in Section 3.6). Notably, the presence of nitrogen atoms across all illustrated compounds aligns with the findings discussed in Section 3.3.2, emphasizing its relevance to antiviral activity.
![]() | ||
Fig. 7 Chemical structures of Q1 compounds (top five for targets HRV_protease and IAV_M2 proton channel) attired by Morgan Chiral of radius 2 (2048 bits) fingerprint. The label below each structure represents the acronym of each library as stated in Table 2. |
Fig. S8 in the ESI† illustrates the chemical space distribution of top predicted active compounds, specifically those with the highest prediction confidence (Q1, as detailed in Table 7), across the newly designed target-focused libraries. These visualizations, generated using t-SNE based on the Morgan Chiral of radius 2 (2048 bits) fingerprint, highlight the structural diversity of Q1 compounds.
Notably, certain libraries lack Q1 compounds entirely, while libraries such as HRV protease and IAV M2 proton channel exhibit broader structural diversity, as reflected by the dispersed distribution of Q1 compounds, moreover, SARS-CoV-2 Mpro library, shows clustering, indicating structural similarity among predicted actives. This analysis underscores variability in structural diversity across libraries, suggesting that improvements in library design and model training could enhance the identification of high-confidence active compounds.
For the seven antiviral target data sets with high modelability, we developed ML predictive models, which showed improved MCC values after retraining. Among these, the AdaBoost model for IBV neuraminidase demonstrated the best performance. Overall, consensus ML models outperformed individual models, particularly for targets with larger and more balanced data sets. Compounds within the top confidence predictions from the seven newly designed antiviral-focused libraries represent strong candidates for further screening, including biological testing, which is the next step of this study from the wet lab experimental point of view. All seven antiviral-focused libraries developed in this study are freely available at https://github.com/DIFACQUIM/antiviral_ML for the scientific community to select, acquire, and biologically test the chemical libraries. To facilitate the use of these databases, each compound is annotated with confidence predictions and ADMET property profiles.
ADASYN | Adaptive synthetic sampling |
ADMET | Absorption, distribution, metabolism, excretion, and toxicity |
AI | Artificial intelligence |
API | Application programming interface |
ARD | Acute respiratory disease |
AUC | Area under the curve |
BA | Balanced accuracy |
CADD | Computer-aided drug design |
Cl | Clearance |
CSP3 | Fraction of sp3 carbon atoms |
ECFP | Extended connectivity fingerprint |
FCoV | Feline coronavirus |
GTM | Generative topographic mapping |
HBA | H-bond acceptors |
HBD | H-bond donors |
HEV-71 | Enterovirus A71 |
HCoV-229E | Human coronavirus 229E |
HCoV-NL63 | Human coronavirus NL63 |
HIV | Human immunodeficiency virus |
HPIV-1 | Human parainfluenza virus 1 |
HRV | Human rhinovirus |
HRSV | Human respiratory syncytial virus |
IAV | Influenza A virus |
IBV | Influenza B virus |
JAK | Janus kinase |
log![]() | Partition coefficient octanol/water |
MACCS | Molecular ACCess system |
MCC | Matthew's correlation coefficient |
MERS | Middle East respiratory syndrome |
MERS-CoV | Middle East respiratory syndrome coronavirus |
ML | Machine learning |
MODI | Modelability index |
MOE | Molecular operating environment |
Mpro | Main protease |
MW | Molecular weight |
NiV | Henipavirus nipahense |
NPs | Natural products |
PD | Pharmacodynamics |
PK | Pharmacokinetics |
PLP | Papain like-protease |
PHEIC | Public health emergencies of international concern |
R&D | Research and development |
RDKit | Rational discovery kit |
RdRp | RNA dependent RNA polymerase |
RSV | Respiratory syncytial virus |
RVs | Rhinoviruses |
SARS | Severe acute respiratory syndrome |
SARS-CoV | Severe acute respiratory syndrome coronavirus |
SARS-CoV-2 | Severe acute respiratory syndrome coronavirus 2 |
SMILES | Simplified molecular input line entry system |
TPSA | Topological polar surface area |
t-SNE | t-Distributed stochastic neighbor embedding |
VD | Volume of distribution |
VS | Virtual screening |
WHO | World Health Organization |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00037h |
This journal is © The Royal Society of Chemistry 2025 |