Open chemoinformatic resources to explore the structure, properties and chemical space of molecules

New technologies are shaping the way drug discovery data is analyzed and shared. Open data initiatives and web servers are assisting the analysis of the large amounts of data that we are now able to produce. The ﬁ nal goal is to accelerate the process of moving from new data to useful information that could lead to treatments for human diseases. This review discusses open chemoinformatic resources to analyze the diversity and coverage of the chemical space of screening libraries and to explore structure – activity relationships of screening data sets. Free resources to implement work ﬂ ows and representative web-based applications are emphasized. Future directions in this ﬁ eld are also discussed.


Introduction
During the past few years, there has been an important increase in open data initiatives to promote the availability of free research-based tools and information. 1 While there is still some resistance to open data in some chemistry and drug discovery elds, the availability of information has been a necessity for other research elds such as genomics, proteomics and bioinformatics. The Human Genome Project was paramount to the open-source movement in proteomics and genomics, demonstrating that a global community can be more successful and efficient in analyzing data than a single individual can. 2 Computer-aided drug discovery has a large impact for the pharmaceutical industry by helping during the drug development process to reduce time and costs, in order to achieve a desired result. However, researchers from the pharmaceutical and medicinal chemistry elds oen lack training on informatics. The creation of free and easy to use chemoinformatic tools for drug development will help investigators avoid having to spend time acquiring programming and development skills, in the already complex and multidisciplinary eld of drug discovery. At the same time, the resources will assist research teams to focus on solving problems that are specic to their elds of expertise. In this context, chemoinformatics has an important role helping to mine the chemical space of the almost innite number of organic drug-like molecules available for drug discovery. The outcome allows researchers to nd connections between biological activities, ligands and proteins. 3 Herein we review representative chemoinformatic tools essential to explore the structure, chemical space and properties of molecules. The review is focused on recent and representative free web-based applications. We also discuss KNIME as an open resource broadly used in chemoinformatics for automatization of data analysis. The review is organized in eight major sections. Aer this introduction, open sources of chemical biology data are discussed. Section 3 discusses online servers for the generation of molecular properties, diversity analysis, and visualization of the chemical space. The next section focuses on web-based application to predict ADME and toxicity properties, which are essential in drug discovery programs. Section 5 presents online applications to analyze structure-activity relationships (SAR) and structure-multiple activity relationships (SmAR). The section aer that discusses web-servers aim to assist drug discovery and development efforts focused on a particular disease or target family. Section 7 covers open resources to implement workows for data analysis. In contrast to most web-based applications discussed in Sections 3-6, the workows presented in Section 7 can be highly customizable by the user. The last section presents Conclusions and future directions.

Open chemical biology data
Essential to medicinal chemistry and drug discovery is the availability to generate and retrieve relevant experimental data of screened compounds. Relevant experimental data implies curated information with enough quality for later SAR analysis. There is a large and still growing amount of molecules with bioactivity data available for the public domain, which is summarized in Table 1. It has over 5000 genetic variants annotations, with over 900 genes related to drugs and over 600 drugs related to genes PharmGKB captures pharmacogenomic relationships in a structured format so that it can be searched, interrelated, and displayed according to the researchers' interests. The knowledge base is valuable both to the researcher who is interested in a specic single nucleotide polymorphism and its inuence on a particular drug treatment and to the researcher interested in a disease or drug and looking for candidate genes which may affect disease progression or drug response Of note, although the availability of this data is important to build new models and make in silico predictions, the data and content in these databases is rather heterogeneous.
Perhaps the most common and widely used databases are ChEMBL, which contains 1.6 million distinct compounds and 14 million activity values, 4 PubChem 5,6 with more than 93 million compounds and more than 233 million bioactivities, and Binding Database with 490k small molecules and 1.1 million measured protein-small molecule affinities. 7 Other resources are CARLSBAD, a bioactivity database with 435 343 compounds and 932 852 bioactivities. The advantage of CARLSBAD is that only one activity value of a given type (Ki, EC 50 , etc.) is stored for a given structure-target pair. 8 ExCAPE-DB is a comprehensive chemogenomics dataset with 998 131 compounds and 70 850 163 biological activity data. 9 BRENDA is an enzyme information system of enzyme and enzyme-ligand information obtained from different sources; functional and structural data of more than 190 000 enzyme ligands are stored within this system. 10 The knowledge on bioactivity could help to identify potential targets for a specic molecule.
DrugCentral is a database that integrates structure, bioactivity, regulatory, pharmacologic actions and indications for active pharmaceutical ingredients approved by FDA and other regulatory agencies. 11 The probes and drugs portal is a public resource putting together focused libraries of bioactive compounds (877 probes and 12 190 drugs) with commercially available screening libraries. The rationale behind it is to reect the current state of bioactive compound space and to enable its exploration from different points of view. 12 Finding new uses for old drugs could be economically advantageous, therefore the development of databases like DrugCentral and probes and drugs will be benecial for polypharmacology. 16

Online servers for exploring chemical space
The concept of chemical space can be understood in a simplistic manner as the number of possible molecules to be considered when searching for new drugs, the knowledge and understanding of this space is of great relevance in drug discovery, several approaches used for its analysis have been reported extensively for many authors. [17][18][19] The chemical space can be divided in two main groups: the known chemical space, that considers the organic molecules reported thus so far, which are mostly covered by the resources discussed in the previous section, and the unknown chemical space, larger by tens of orders of magnitude compared to the rst group and refers to molecules that have been never synthesized yet. Several advances and applications on the enumeration of those virtual molecules are discussed in other works. 20,21 One of the central points to the concept of chemical space is molecular representation i.e., the set of descriptors used to dene the space of the chemicals that will be analyzed. A second major point is the visual representation and mining of that space, e.g., analysis of the diversity and coverage. Those aspects are important to consider when dealing with the analysis and interpretation of data, because distinct approaches may lead to representations that in most cases are not comparable to each other and the best one is usually dened by the he nature of the data analyzed. Web servers to explore chemical space usually incorporate one or more of the following operations: calculation of descriptors, visualization, and diversity analysis. Table 2 summarizes recent online servers for generating and mining the chemical space of compound databases using different approaches. Representative servers are further commented in this section.
ChemMine is an online portal with ve main application domains: compounds visualization, similarity quantication, a search toolbox to retrieve similar compounds from PubChem, clustering, data visualization and molecular properties calculation. 22 ChemBioServer is a free-web based tool that can aid researchers on compound ltering and clustering. Compounds that survive the ltering process can be visualized using molecular properties and principal component analysis. 23 ChemDes is a free web-based platform for the calculation of molecular descriptors and ngerprints. It contains more than 3679 molecular descriptors that are divided into 61 logical blocks. In addition, ChemDes provides 59 types of molecular ngerprint systems. 26 BioTriangle can calculate a large number of molecular descriptors of individual molecules, structural and physicochemical features of proteins and peptides from their amino acid sequences, and composition and physicochemical features of DNAs/RNAs from their primary sequences. 25 FAF-Drugs3, now FAF-Drugs4, is a web server that applies an enhanced structure curation procedure that lters compounds based on physicochemical properties, ADMET rules and generally unwanted molecules also known as pan assay interference compounds (PAINS). 24 This server can be used to generate and analyze ADMET-relevant chemical spaces. 19 The visualization of the chemical space of molecular databases has been proved to be relevant to measure molecular diversity and biological properties. webMolCS is a web-based interface to visualize sets of user-dened molecules in 3D chemical spaces, using different molecular ngerprints and selecting subsets. 27 The visualization of the chemical space can offer a good idea on how diverse the datasets are, however, since the diversity criteria depends on the molecular representation employed, a tool to compute different diversity metrics would be useful to researchers with different backgrounds. Platform for Unied Molecular Analysis (PUMA) is a web server developed to visualize the chemical space and measure the molecular properties and structural diversity of datasets.
PUMA addresses the issue of the dependence of chemical space on structure representation. In this server the user can analyze a user-supplied data set using molecular scaffolds, properties of pharmaceutical relevance and ngerprints of different design. Fig. 1 illustrates a screenshot of the server PUMA. The gure exemplies the analysis done with the chemical space tab available in the main top menu of the application. The website uses MySQL server to store the data and PHP and HTML codes to implement the main interface. The Python script is used to produce independent sub-processes to generate input to the prediction program and data processing 32 Molecular diversity of compound data sets can be evaluated employing molecular scaffolds, structural ngerprints and physicochemical properties. Consensus Diversity Plot (CDP) is a novel method to represent in low dimensions the diversity of chemical libraries considering simultaneously multiple molecular representations and to facilitate the classication of data sets into diverse or not diverse. 29 A recent application of CDPlots is the analysis and quantication of the global diversity of 354 natural products from Panama. The diversity of those compounds was compared against the diversity of natural products from Brazil, natural and semi-synthetic molecules used in high-throughput screening, and compounds used in Traditional Chinese Medicine. 39 The CDPlots rapidly led to the conclusion that natural products from Panama have a large scaffold diversity as compared to other databases.

Servers to predict ADME and toxicity properties
Computational methods are being used to lter and select compounds based on different molecular characteristics that are considered to be relevant to predict the drug-likeness of molecules. Without the aid of computational methods, the drug development process would be more time-consuming and less efficient, however, it is important to mention that the ltering rules employed by these methods are not absolute answers to the problem and that experimental conrmation is compulsory. A number of compounds fail during clinical phases due to poor pharmacokinetic and safety properties, therefore, the growing number of public and commercial in silico tools to predict ADMET (absorption, distribution, metabolism, excretion and toxicity) parameters is not surprising. SwissADME is a web tool to compute fast but robust predictive models for physicochemical properties, pharmacokinetics, drug-likeness and identifying PAINS. 30 Other web Chembench is a Java-based system. The front end of the website uses Java Server Pages with JavaScript. The struts 2 framework provides the interface between data on the JSPs and Java objects servers used to predict toxicity are based on the prediction of metabolites formation. This is the case for MetaTox, which can also be used to predict toxicity endpoints, 31 and SOMP, a webservice for the prediction of metabolism by human cytochrome P450. 32 Among various toxicological endpoints, the carcinogenicity of potential drugs is of interest because of its serious effects on human health. In general, the carcinogenic potential of a compound is evaluated using animal models that are time-consuming, expensive, and ethically concerning. The use of computational approaches such as CarcinoPred-El, which predicts carcinogenicity based on chemical structure properties, is an appealing alternative. CarcinoPred-El uses different molecular ngerprints and ensemble machine learning methods to predict the carcinogenicity of diverse organic compounds. 34 The use of animals for cosmetic experiments is forbidden in Europe, therefore there is a strong need to develop alternative tests to evaluate skin sensitization. Pred-Skin is an app developed to predict the skin sensitization potential of chemicals based on binary QSAR models of skin sensitization potential from human (109 compounds) and murine local lymph node assay (LLNA, 515 compounds) data. 35

Online applications for exploring SAR and SmAR
The increasing availability of chemical biology data (discussed in Section 2) allows researchers to create models capable of predicting the potential chemical and biological behavior of compounds. There is a limited number of public tools available that are able to create models to understand the advantages and disadvantages behind the SAR concept, those models are highly dependent on the quality and quantity of data available, so these models should be selected based on the problem of interest and when available, oriented approaches could be the best choice, but the results obtained by any methodology must be interpreted carefully.
Most of the web-sites developed to perform SAR analysis are focused on QSAR models ( Table 2). This is the case of ChemSAR and Chembench. Both are web-based platforms to generate SAR and QSAR classication models employing machine learning methods. 37,38 Activity Landscape Plotter is an R-based web tool developed to analyze SAR using the concept of activity landscape modeling. The objective of activity landscape modeling is to explore the relationship between structure similarity and activity similarity (or potency difference) of screening data sets. 40,41 There are a number of numerical and visual methods useful for activity landscape modeling. In particular, Activity Landscape Plotter generates structure-activity similarity, dualactivity difference maps and identies activity cliffs in a data set with biological activity. 36 Dual-activity difference maps are particularly attractive to analyze SAR of data sets with activity data for two biological endpoints. Therefore, these maps are tools to explore SmARs. Fig. 2 shows a screenshot of the Activity Landscape Plotter. It is illustrated the functionality Dual-Activity Difference (DAD) functionality available in the main menu of the server.

Disease and target oriented webservers
There is an increasing need to develop effective chemogenomic tools focused on integrating the large and growing amount of data available for specic health conditions. Multifactorial diseases that involve many genes, proteins and their interactions would be easier to study with the aid of web servers with databases that integrate and validate reported active compounds, molecular mechanisms and genetic association. This information could be easily reused to accelerate the discovery of novel compounds.
There are a number of servers that are focused on specic target families or diseases. These are summarized in Table 3. For complex diseases such as Alzheimer and cancer, useful web servers containing information regarding important targets and their ligands. AlzPlatform 42 and AlzhCPI 43 are web tools implemented for target identication, polypharmacology and virtual screening of active compounds for the treatment of Alzheimer disease. CDRUG, 44 CancerIN, 45 and CanSAR 46 are web servers developed to predict the anticancer activity of compounds. All these disease-oriented web tools contain valuable information such as genes, related proteins, drugs approved and in clinical trials, compounds associated with biological activity, as well as information on biological assays. Similar web servers have been implemented for specic targets. Kinase and GPCR SARfari are chemogenomic tools implemented on ChEMBL to incorporate and link GPCR and kinase sequences, structures, compounds and screening data. 47 Other web servers such as KIDFamMap were developed to design selective kinase inhibitors. 49 This has been a challenging task given the evolutionary conserved ATP binding site where Fig. 2 Screenshot of the Activity Landscape Plotter server. This server is focused on the analysis of structure-activity relationships of compound data sets. The screenshot illustrates the generation of the Dual Activity Difference (DAD) map for a data set of compounds tested with two biological endpoints. Full description of the server and access to the example data set are freely available at http:// www.difacquim.com/d-tools/. the majority of inhibitors are expected to bind. GLIDA is a public GPCR-related chemical genomics database, it provides chemical information on the ligands as well as biological information regarding GPCRs or G-protein coupled receptors, which represent one of the most important families of drug targets in pharmaceutical development. 50 Epigenetics became of great importance for researchers when it was discovered that gene function could be altered by more than just changes in sequence. Today a number of diseases have been linked to amplication, mutation, and other alterations of epigenetic enzymes. Therefore, analyzing the most appropriate epigenetic enzymes involved in different diseases is a prerequisite for epigenetic therapeutics. HEMD is a web server that provides the utilities to display, search and analyze the structure, function and related annotation of human epigenetic enzymes and chemical modulators focused on epigenetic therapeutics. 51

Data automatization with customizable workflows
In addition of web servers that are being increasingly used by experts and non-experts in chemoinformatics, there are open source applications that enable the generation of workows and highly facilitate the automatization of data analysis. Among the advantages of these workows is their customizability and adaptability to meet specic needs. KNIME is perhaps the most widely used such environment that is open access, and it is further described in this section. KNIME's modular workow design, along with its ability to automatically parallelize many operations, free distribution, and simplicity to communicate analysis pipelines, has made it widely successful in diverse areas of analytics. It is also quite exible and allows integration of different soware and tools. 52 For a detailed explanation of the "workow" concept, as well as other soware following this approach, see the review by Tiwari and Sekhar. 53 In the following subsections, the issues that can be addressed through chemistry applications or plugins implemented in KNIME are presented.

Data curation
It has not escaped the attention of chemoinformaticians that there is a vital necessity to produce reliable libraries prior to computational modeling. [54][55][56] Therefore, there are emerging several tools useful for processing and assessing chemical data (e.g., parsing molecules, removing mixtures, and salts, optimizing pH and pK a , standardizing chemotypes, managing tautomers, standardizing synonyms, and visualizing chemical graphs). 54 KNIME includes plugins able to perform these operations. Some of these are open source (e.g., RDKit, Indigo, CDK), while others are commercial, though available at no additional cost to anyone holding a license for the standard soware (e.g., Schrödinger, MOE, ICM, ChemAxon).
A prior step to data curation involves, of course, reading a chemical database. There are many kinds of les in which chemical information may be stored, including CSV, SDF, SQL and XML. KNIME provides extensions able of reading most, if not all, of them. Regarding data curation pipelines, a recent publication by Gally et al. proposed a workow for preliminary molecule preparation in KNIME. 57 Also, a useful and comprehensive tutorial for KNIME application into chemical data curation has been recently published elsewhere. 58

Chemical properties and calculations
A variety of chemical features can be assessed through the KNIME chemoinformatics extensions mentioned above, such as physicochemical (e.g., atomic molecular weight, SlogP, topological polar surface area, number of hydrogen bond acceptors and donors, rotatable bonds) and complexity (e.g., fraction of sp 3 atoms, number of chiral atoms) descriptors, enumeration of heteroatoms, a wide variety of chemical ngerprints, similarity calculations, virtual screening, R group decomposition and so forth. Also, tautomer lists, 3D functionalities such as 3D optimization, conformer generation and 3D similarity assessment are available in both free and commercial extensions. Docking is available mostly from commercial packages (GLIDE, ICM, MOE, Schödinger, etc.), although using AutoDock within KNIME is also an option. 59 Of note, 3D-e-Chem-VM, a recently developed application, integrates KNIME with public domain resources for analyzing protein-ligand interaction data. Its tools aid in virtual screening, metabolism prediction and rational ligand design in kinases and G-coupled protein receptors. 60

Machine learning and SAR analysis
An interesting feature from KNIME is the incorporation of scalable machine learning. Some of these algorithms perform virtual screening by similarity searching or naïve Bayesian models with some options given, but mostly predetermined (see Fig. 3). Nonetheless, an option to enhance exibility in KNIME workows is to integrate scripts of programming languages with libraries specialized in machine learning (such as R and Python). Murcko scaffolds can be computed as well, followed by enrichment factor calculations. 61 There are even specic nodes for studying activity cliffs. 59 Notably, deep learning nodes have been recently incorporated. 62

Examples of applications and a published KNIME workow
In this section we describe two applications of KNIME to chemoinformatics. A more comprehensive review by Mazanetz et al. has been published, including also applications for data analysis applied to next generation sequencing and high throughput screening. 59 PAINs lter workow. Identication of PAINs (pan assay interference compounds) is becoming increasingly relevant, as they are thought (not without controversy) 63 to have higher rates of false-positives and unspecic promiscuity in screening studies. 64 Therefore, for many screening purposes it is widely preferred to sort them out, or at least identify them. Saubern et al. made available a KNIME workow for identifying PAINS, aer adequate molecule preprocessing. 65 They incorporated a previously published list of structural features intended to identify PAINS, 66 converted it to SMARTS format and used them to iteratively search through a chemical library of 10 000 compounds. The algorithm outputs a le with structures that do not match any of the features, as well as and another le with structures that match, along with the labels of the matching PAINS features. They compared the results of using Indigo or RDKit KNIME nodes for substructure search versus the hits from the original reference, 66 nding a higher overlap when Indigo nodes were used.
Rule of 0.5 of an approved drug's metabolite-likeness. Given prior insights that metabolites and approved drugs share chemical features, 67 O'Hagan et al. evaluated this hypothesis using KNIME nodes. 68 They pre-processed DrugBank approved drugs database and a human metabolites chemical database, calculated MACCS-166 bits ngerprints, and then evaluated the similarity among both datasets. They discovered that most ($90%) of the approved drugs have a Tanimoto similarity of 0.5 of higher to their 'nearest' metabolite. Therefore, they suggested a '0.5 metabolite-likeness rule' that characterizes post marketed drugs.

Conclusions and future directions
The amount of information in drug discovery continues to increase rapidly. This is true for both the size of the screening libraries and the biological activity data. Therefore, the increasing amount of information i.e., big data (particularly in the public domain), has boosted the development of tools for the comprehensive assessment of the coverage and diversity of the chemical space of compound libraries. Likewise, there is a need to develop automatized applications for the rapid exploration of SAR and SmARTs, and to simplify the communication of the results across research teams. There are numerous chemoinformatic resources available to implement protocols that analyze different aspects of chemical space and SAR/SmART. These resources are being implemented in open web servers or workows. These tools benet not only chemoinformaticians but also to members of the multidisciplinary teams working on drug discovery projects that are non-experts or lack time to generate their own code or workows from scratch. It is anticipated that these tools will continue to evolve Fig. 3 An example of KNIME workflow for reading a chemical dataset and performing target prediction. and improve. Importantly, it is desirable that the easy-to-use web server applications do not become black boxes. It is of great importance that the user is fully aware of the calculations that are done, in order to fully maximize the interpretation of the results and that he/she is aware of the approximation and eventual limitations of the application or workow. It is also expected a continuous development of web servers dedicated to explore the SAR and chemical space of a disease or target family. The improvement and renement of these servers will certainly benet from the constant increase of chemical biology information available in the public domain.

Conflicts of interest
There are no conicts to declare.