Latin American databases of natural products: biodiversity and drug discovery against SARS-CoV-2

In this study, we evaluated 3444 Latin American natural products using cheminformatic tools. We also characterized 196 compounds for the first time from the flora of El Salvador that were compared with the databases of secondary metabolites from Brazil, Mexico, and Panama, and 42 969 compounds (natural, semi-synthetic, synthetic) from different regions of the world. The overall analysis was performed using drug-likeness properties, molecular fingerprints of different designs, two parameters similarity, molecular scaffolds, and molecular complexity metrics. It was found that, in general, Salvadoran natural products have a large diversity based on fingerprints. Simultaneously, those belonging to Mexico and Panama present the greatest diversity of scaffolds compared to the other databases. This study provided evidence of the high structural complexity that Latin America's natural products have as a benchmark. The COVID-19 pandemic has had a negative effect on a global level. Thus, in the search for substances that may influence the coronavirus life cycle, the secondary metabolites from El Salvador and Panama were evaluated by docking against the endoribonuclease NSP-15, an enzyme involved in the SARS CoV-2 viral replication. We propose in this study three natural products as potential inhibitors of NSP-15.


Introduction
Natural products (NPs) and their derivatives continue to be an important source of chemical entities for the design and development of drugs, reporting a total of 355 molecules approved for clinical use between the years 1981 to 2019 and placing them at an intermediate point between biological products (262). 1 Databases of publicly available NPs have been used in process discoveries, and to develop drug design and contain metabolites of plants, fungi, marine organisms, and bacteria. 2 These databases contain a variety of chemical types that have various pharmacological activities. 3 Some natural products from these databases have been evaluated by chemo-and bioinformatic methods, concluding that they have large structural and scaffold diversities. 4 Latin American database of natural products (LATAM_DBS_NPs) has been evaluated using chemoinformatics and bioinformatics tools. Such analysis allowed us to create and manipulate molecules by representing and visualizing the chemical structures of small molecules. Furthermore, we calculate physicochemical properties based on chemical descriptors, chemical space, ngerprints, similarity studies, the content of cyclic systems, fragments of aliphatic and alicyclic skeletons applied to the bases of natural products. [5][6][7][8][9][10][11][12][13][14] The increasing use of cheminformatics in NPs research has led to the sub-discipline "natural product informatics." 15 In this study we analyzed and compared four databases of Latin American NPs: LAIPNUDLSAV, UPMA_2V, BIOFACQUIM_2V, NUBBE_2V and eleven databases of natural products available for free at http://zinc12.docking.org/ browse/catalogs/. These include the following datasets: AfroDb NPs; Herbal ingredients in vivo metabolism database (HIM); Herbal Ingredients' Targets (HIT); ICC Indone NPs DBs (Indone Chemical Company); Naturally occurring Plant based Anticancerous Compound-Activity-Target Database (NPACT Database); AnalytiCon Discovery NP (ACDISC_NP); Specs Natural Products; The Nuclei of Bioassays, Ecophysiology and Biosynthesis of Natural Products Database (NuBBE DB 2013 and 2017); Inter-BioScreen Ltd (IBScreen NP); Chemical Entities of Biological Interest (ChEBI); and Princeton BioMolecular Research. This is the rst systematic cheminformatic and bioinformatic study of LATAM_DBS_NPs against NPS-15 endoribonuclease of SARS CoV-2. The acronym refers to severe acute respiratory syndrome (SARS) and the CoV-2 refers to type 2 coronavirus.
NPs have been found to be highly potent in blocking enzyme function and membrane receptors of human coronavirus. 16 In addition, the evolution of COVID-19 is featured with uncontrolled inammation, threeway anti-inammatory herbal compounds will be a potential tool to repress such fatal symptoms. 17 Nsp-15, a hexamer endoribonuclease that cleaves the remains of uridines. This protein is homologous to the Nsp15 endoribonucleases of SARS-CoV and MERS-CoV but differs from them in that it can contribute to the increased virulence of SARS-CoV-2. 18

Databases of natural products
LATAM_DBS_NPs and REF_NPs were analyzed for their physicochemical properties, chemical space, scaffold content, and molecular complexity. These include two datasets prepared in house and another two NPs DBs published in journals: LAIP-NUDELSAL, UMPA_NP_2V, BIOFACQUIM_2V, and NUBBE_2V   Table 2 shows the description and sources of the databases used in this work.

Molecular ngerprints
The molecular ngerprints have been used as indicators of structural diversity in different data sets. [33][34][35] Two ngerprint keys were calculated: Extended Connectivity Fingerprints of Radius six (ECFP-6) and Molecular Access System (MACCS) Key (166-bits) with the Python script. 36 The correlation of similarity pairs was calculated with the correlation-like-similarity index: Tanimoto coefficient and cosine-like similarity index were analyzed with a cumulative distribution function (CDF).

Molecular scaffolds diversity
The Bemis-Murcko scaffold references were calculated with molecular equivalent indices (MEQI) [37][38][39] and DataWarrior. 31 MEQI was used to obtain the cyclic system ring of the compounds in the different databases of NPs for their scaffold content and diversity. [40][41][42][43] The scaffold distribution of the NPs dataset was explored, evaluated, and plotted in RStudio 44,45 for the cyclic system retrieval curves (CSR) for indicated the scaffold diversity of different databases. [7][8][9][10][11][12][13][14] These curves were plotted with the fraction of scaffolds in the x-axis and the fraction of compounds in the y-axis. The CSR curves provide information on the scaffold diversity of all the data sets.

Molecular complexity
The molecular complexity of NP datasets was explored and calculated using the descriptors: a fraction of sp3 hybridized carbons (Fsp3), fraction of chiral carbon (FCC), fraction of atoms aromatic (Fa_Aro), fraction of aromatic bond (Far_b), molecular exibility, shape index and globularity molecular. 11,14,46,47 Molecular docking simulation Preparation of the structure of endoribonuclease NSP-15. The endonucleases NSP-15 (PDB ID: 6WXC) were obtained from  the Protein Data Bank. 48 The protein preparation procedure consisted of eliminating the water molecules, then we checked and corrected possible errors in the primary amino acid sequences using the sequence editor window in the MOE module. Then, using the Quickpre option, hydrogens were added, the protein-ligand complex was corrected, protons were added to the ligand and the protein in 3D model; we selected the MMFF94x force eld, then the partial charges were established, and the minimum conformation was generated. This protein was optimized; it was saved in dock_moe format. The binding site was checked with the option (Site Finder), which is listed by the size of the possible binding sites of the ligand to the protein, in the rst four rows appear the amino acid residues that maintain the interaction with the cocrystallized ligand. This step is important since other sites where there is no compound can be selected to simulate docking molecular. Finally, we isolated the atoms and the region of proximity to the selected site in the protein, reviewed the structure of the ligand in CPK format, and generated the surface area of the binding site. 49 Preparation of natural product databases. The preparation of NPs was carried out with the module "preparing of small molecule dataset".
The database was created in the excel format and later transformed into a CVS le. This was imported into the MOE window, where we checked if the structures were correctly drawn in their 2D model. Then, we saved it in a format with moe extension. This constituted the working database for chemoinformatic analysis mbd format.
For curing of the databases, we removed the compound duplicates, washed for removing protonated forms of acidic and basic, disconnected the metals present, reviewed the presence of partial loads, and nished minimizing energy state  structures and generated conformations 3D, for subsequent analyzes. 49 Protocol of docking molecular. The molecular docking simulation protocol was performed using the following procedure: rst, we opened the protein that we have in moe format and aer the dock panel in windows MOE. Herein, we selected the MOE option with de parameter: receptor, atoms receptor. In the box site, we chose atoms ligand and the databases of NPs in format mdb.
In the tab that indicates the method and scoring function, we chose put in the placement method: Triangle Matcher con the score London dG and number pose in thirty, while that in renement, we selected a rigid receptor and the scoring GBVI/WSA with ve poses. The forceeld used were MMFF94x. 49,50 Results and discussion

Chemical space
The space chemicals in 46 413 unique compounds were visualized using PCA of descriptors physiochemical properties. The visualization of space chemicals shows that LATAM_DBS_NPs, and REF_NPs occupy similar chemical spaces. Fig. 1 and 2 show a visual representation of the chemical space of 3440 compounds of LATAM_DBS_NPs. The difference in the variance of the six physicochemical properties analyzed by PCA, is present in ESI. † These data indicate that the partition coefficient (S log P) and TPSA have the greatest effect on the variance statistic with a value of 66.69% and 22.39%, respectively, while others descriptor of physicochemical properties has a much lower inuence on the envelope, showing a value of 6.27%.

Fingerprint-based diversity
The molecular diversity of LATAM_DBS_NP and REF_NPs were calculated using the MACCS and ECFP-6 ngerprint (FP) and the Tanimoto and Cosine similarity index. The graphic of diversity analysis on the x-axis is the similarity value pairs based on ECFP-6/Tanimoto, ECFP-6/Cosine, MACCS/Tanimoto/, MACCS/Cosine index, while on the axis of the ordinate we plotted the CDF values for each database evaluated. The CDF realized with Tanimoto/ECFP-6 indicated that LAIPNUDELSAV and PRINCETON_NP datasets are the most diverse, while that with Cosine/ECFP-6 the INDOFINE_NP, LAIPNUDELSAV, PRINCETON_NP and UPMA_2V databases demonstrated better diversity based on ngerprint (ESI †).
The analysis of MACCS ngerprint and Tanimoto similarity indicated that LAIPNUDELSAV, PRINCETON_NP and INDOFI-NE_NP are more diverse and HIT NP and CHEBI are the least diverse; meanwhile, the metric MACSS/Cosine similarity index, HIMNP indicated greater diversity as the database reference while LATAM_DBS_NPs and REF_NP showed median values greater than 0.50, except for NUBBE_2V, SPECSNP and UPMANP2V which were less diverse, according to MACCS keys/ Cosine similarity index. Tables 3 and 4 show summary statistics of the pairwise similarity computed with MACCS/Tanimoto and MACCS/Cosine.

Scaffold diversity
The diversity of scaffolds of Latin American NPs was based on the Murcko scaffold and Murcko skeleton was obtained with RStudio. 45,46 The Murcko scaffold contains all plain ring systems of the given molecules, plus all direct connections between them. Substituents, which do not contain ring systems are removed from rings and ring connecting chains, while that in the Murcko skeleton was a generalized Murcko scaffold, which has all heteroatoms replaced with carbon atoms (Fig. 3). 31 Table 5 summarizes the results of the scaffold diversity of the een databases. In this table, the number and fraction of scaffolds are reported, the number and fraction of scaffolds containing only one compound (singletons), and metrics were obtained from CSR curves (AUC and F 50 ). According to these metrics, databases: NPACT_NP, AFRODB, BIOFACQUIM_2V, and UPMA_2V are those having the largest fractions of scaffolds (FN/M) with values equal to or greater than 0.46. Fig. 4(a) shows the metric of Murcko scaffold (ring system) in the een databases that correlated with CSR curves, F 50 (a fraction of chemotypes that contains 50% of the data set), and areas under the curve (AUC) are reported in Table 5. These curves indicate that the Latin American dataset UPMA_2V and NUBBE_1V are those with the greatest structural diversities with Meanwhile, Table 6 summarizes the diversity of the Murcko skeleton (unskeleton cyclic system) of the een databases. This table presents the number and fraction of unskeleton   cyclic systems (Mucko skeleton), the number and fraction of containing only one compound (singletons), and metrics obtained from CSR curves (AUC: area under the curve; F 50 : fraction of chemotypes that contains 50% of the data set.)

Scaffold content in Latin American databases of natural product
The content of the scaffold in LATAM_NPs has a great structural variety. The chemotypes: DY5K9, 8A6GX, A5VEV and 0857T are exclusively found in LAIPNUDELSAV_DBs. They possess the pentacyclic triterpene skeleton as shown in Fig. 5. In contrast, Fig. 6 shows some of the similar scaffolds present in LATAM_DBs: YSB4M and 1X4VP that have a benzopyran-4-one. However, these skeletons are contained in other chemotypes such as RPJBH, 63RBH, and KHXQM, which have a low frequency in BIOFACQUIM and UPMA_2V, respectively. Other chemotypes present is benzopyran-2-one identied by the codes are Q874P, 3P6AH and X02HF that are shown in Fig. 6. These nuclei are present in CT1G5, QGHLF and PLMLM, identied in the databases BIOFACQUIM, NUBBE_2V and UPMA_2V. This group constitutes the most abundant compounds in LATAM_NPs, which are shown in Fig. 7. Murcko skeletons included in the four Latin American natural product databases are similar to the nuclei present in Fig. 5 and 6. However, the bagarofuran skeletons: X5FZP, 5G7H1, 1KZK3, LXFSH and ACLIM represent the second group of compounds in LAINPU-DELSAV DBs. Fig. 8 shows two beta-agarofurans present in high frequency in the databases of the Natural Products Laboratory of the University of El Salvador, we observe their high structural complexity.

Molecular complexity
The molecular complexity of LATAM_DBS_NPs and eleven reference databases were calculated by means of ve metrics of molecular properties: a fraction of sp 3 hybridized carbons (Fsp 3 ), fraction of chiral carbon (FCC), fraction of atoms aromatic (Fa_Ar), fraction of aromatic bond (Far_b), globularity molecular (glob_m), exibility in molecular and shape index. Fig. 9 and 10 show box plots for the distributions of Fa_aro, F_sp 3 , Fb_aro, and FCC. The globularity molecular, molecular   The analysis of the molecular exibility and the shape index of LATAM_DBS_NPs and REF_NP suggests that the compounds have high structural rigidity and spherical shape (non-linear) since they present values less than 0.50 for most of the analyzed databases (ESI †).
The globularity molecular is not a good metric of complexity because it does not differentiate the data sets analyzed in this work (ESI †).

Molecular docking
Protein selection. SARS-CoV-2 is a virus that has four structural proteins and sixteen nonstructural proteins, that are essential for its replication cycle. Structural proteins include: spike, membrane, nucleocapsid and envelope. Nonstructural proteins (NSP) of the CoV include NSP-1 protein to NSP-16 protein with the functions of cutting, splitting and joining RNA. The NSP-5 has the protease types 3CLpro and Mpro while NSP-9 is an RNA replicase. Meanwhile, NSP-12 is an RNA dependent RNA polymerase, NSP-13 is a helicase, NSP-14 is an exonuclease,     NSP-15 is an endoribonuclease and NSP-16 is a 2 0 Omethyltransferase. 51 NSP-15 encodes a uridylate-specic endoribonuclease enzyme (EndoU), indispensable for the viral cycle of coronavirus . This facilitates the breakdown of RNA at the ends of the uridylates. The loss of NSP-15 directly impacts RNA replication processes and their pathogenesis. Therefore, NSP-15 constitutes a potential therapeutic target for the development of inhibitors of endoribonuclease-dependent viral replication. 52 NSP-15 (endoribonuclease) was selected for this work since there is little research on this target of the coronavirus (SARS-CoV-2).
The structure NSP-15 (endoribonuclease) complex with tipiracid was obtained from the Protein Data Bank. This viral protein of the severe acute respiratory syndrome coronavirus 2 (SARS CoV-2) was encoded as 6WXC PDB and possesses uridylate-specic endoribonuclease with A and B chains. 53 Validation of docking molecular protocol with endoribonuclease-NSP-15. The docking protocol was validated by redocking, we used the viral protein (ID 6WXC) with the structure of the co-crystallized compounds (tipiracid ¼ CMU: 5chloro-6-(1-2-iminopyrrolidinyl)-methyl-uracil). This was subjected to a simulated auto mode coupling-induced t. In this study, we observed good reproducibility of the co-crystallized    inhibitor conrmation with an RMSD value of 2.8021 and a relative docking score for the ligand (CMU) of À5.4824. The binding pocket was dened as the set of amino acids within 1.85Å of the co-crystallized ligand CMU. This active site of endoribonuclease in the model of 2D is shown in Fig. 11 (6WXC ID PDB) and has residues of amino acids, Gln245, Ser294, and Lys345 that show four hydrogen bonds and one pi-pi interaction with Tyr343 with CMU, that possess calculated binding energy of 27.2 kcal mol À1 (MOE Dock Panel).
Simulation of docking. For the docking analysis, we selected riboendonuclease NSP-15 (viral protein of SARS CoV-2) and evaluated 130 compounds of de LAIPNUDELSAV and 352 compounds of UPMA_2V DB. These databases LAIPNUDELSAV and UPMA_NP_2V were selected by taking results of analysis cheminformatics, which indicates that these two databases are more diverse based on ngerprint and Tanimoto similarity index Cosine. Further, this group of compounds was selected using the following criteria of restriction, scaffold diversity, molecular complexity, and Lipinski rules.
These data indicate that LAIPNUDELSAV and UMPA_2V are the two most diverse, have more fraction contents of Murcko scaffold and Murcko skeleton, and overly complex ring systems. These 482 compounds were selected using inclusion criteria, one violation of the Lipinski rules. Fig. 12 shows the comparison of drug-like properties and the violation of Lipinski's rule for the een databases of natural products evaluated, meanwhile, Fig. 13 presents an illustration of distribution by the number of compounds in all databases of natural products that meet and break the Lipinski rule (Fig. 14). Fig. 15 and 16 visualize the binding mode between the ligands, LAIPNUDELSAV_029, LAIPNUDELSAL_031 and UPMA_2V_0266 in the active site of endoribonucleases NSP-15 in model 2D and 3D.
LAIPNUDELSAV_029 and LAIPNUDELSAV_031 corresponding a 4-phenylcoumarins with O-glycosidic bonds in C-7 and the hydroxyl in C-1 0 of sugar residues, 54 while UPMA-2V_0266 a derivative of caffeic acid. 55 These three compounds share with the cocrystallized ligand (CMU) the hydrogen bridge interaction with Ser294A and Lys345A residues, while we observed additional interactions with Val292, Leu346 and His 250 residues with these natural products.
Phenylcoumarins and rosmarinic acid interact in a similar way at the binding site of the NSP-15 protein but occupy a greater volume at this site, therefore, we propose that these substances could act as inhibitors of this viral protein, but these statements must be validated in subsequent studies of this work. Table 11 shows the functional groups of the three ligands that interact with amino acid residues in the active site of NSP-  15. In addition, it includes the docking score, RMSD, and binding energy values obtained from the molecular docking simulation with the MOE v2019.01 soware. 49 These skeletons have biological activity against HIV, and tuberculosis, and has antioxidant activity.
The docking study has allowed the identication of two 4phenylcoumarins and catechol derivatives as potential inhibitors of riboendonuclease NPS-15 of SARS-CoV-2. A variety of NPs (alkaloid, catechol derivative, benzofuran, benzopyran, polyphenols) 56 and 4-methyl-coumarin derivatives 57 have been reported as inhibitors of structural and no structural proteins of COVID-19.

Conclusions
In this cheminformatics study, we analysed 46 413 compounds from four unique NPs from Latin American natural products and eleven published NPs databases used as references. These NPs occupy similar chemical spaces, therefore share properties of interest in the bioprospecting project of NPs for the discovery and development of lead molecules of therapeutic potential.
The databases of the University of El Salvador and the University of Panama present the greatest structural diversities based on the ngerprint. However, when comparing the chemotype relationship, NPs of Africa, Mexico, and Panama have the highest scaffold diversity. The University of El Salvador database, herbal ingredients active targets, and ICC indone present the greatest diversity in the Murcko skeleton content bases at F 50 . The content of the scaffold in three LATAM_NPs predominated compound of types benzopyran-one with modi-cation in positions 2 and 4 of this cyclic system with antiradical and antioxidant effects, meanwhile pentacyclic triterpenes and b-agarofuran prevailed in LAIPNUDELSAV. This last group of compounds has a reported an anti-inammatory effect by inhibiting NOS.
From the analysis of molecular complexity, it is concluded that the compounds of the University of Panama are the ones with the highest aromaticity and structural rigidity. The opposite case occurs with NUBBE_1V_2013, which has a large fraction of sp3 carbons, while the databases of the University of El Salvador, Mexico, and Panama have fewer types of carbons; additionally, the LAIPNUDELSAV contains the highest number of chiral carbons. By contrast, the comparison of the molecular exibility and the shape index of the Latin American NPs and the reference NPs show that all evaluated compounds have high structural rigidity.

Author contributions
M. J. N. participated in the conceptualization, design, and creation of databases of the University of El Salvador. He was also actively involved in the analysis of results, writing, reviewing, and editing of the manuscript. B. I. D. E. T. was involved in data curation, computational analysis, designing computer program, supporting algorithms and validation of the results.