Mariana González-Medinaa,
J. Jesús Navejaab,
Norberto Sánchez-Cruza and
José L. Medina-Franco*a
aDepartment of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico. E-mail: medinajl@unam.mx; jose.medina.franco@gmail.com; Tel: +52-55-5622-3899 ext. 44458
bPECEM, School of Medicine, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico
First published on 24th November 2017
New technologies are shaping the way drug discovery data is analyzed and shared. Open data initiatives and web servers are assisting the analysis of the large amounts of data that we are now able to produce. The final goal is to accelerate the process of moving from new data to useful information that could lead to treatments for human diseases. This review discusses open chemoinformatic resources to analyze the diversity and coverage of the chemical space of screening libraries and to explore structure–activity relationships of screening data sets. Free resources to implement workflows and representative web-based applications are emphasized. Future directions in this field are also discussed.
Computer-aided drug discovery has a large impact for the pharmaceutical industry by helping during the drug development process to reduce time and costs, in order to achieve a desired result. However, researchers from the pharmaceutical and medicinal chemistry fields often lack training on informatics. The creation of free and easy to use chemoinformatic tools for drug development will help investigators avoid having to spend time acquiring programming and development skills, in the already complex and multidisciplinary field of drug discovery. At the same time, the resources will assist research teams to focus on solving problems that are specific to their fields of expertise. In this context, chemoinformatics has an important role helping to mine the chemical space of the almost infinite number of organic drug-like molecules available for drug discovery. The outcome allows researchers to find connections between biological activities, ligands and proteins.3
Herein we review representative chemoinformatic tools essential to explore the structure, chemical space and properties of molecules. The review is focused on recent and representative free web-based applications. We also discuss KNIME as an open resource broadly used in chemoinformatics for automatization of data analysis. The review is organized in eight major sections. After this introduction, open sources of chemical biology data are discussed. Section 3 discusses online servers for the generation of molecular properties, diversity analysis, and visualization of the chemical space. The next section focuses on web-based application to predict ADME and toxicity properties, which are essential in drug discovery programs. Section 5 presents online applications to analyze structure–activity relationships (SAR) and structure–multiple activity relationships (SmAR). The section after that discusses web-servers aim to assist drug discovery and development efforts focused on a particular disease or target family. Section 7 covers open resources to implement workflows for data analysis. In contrast to most web-based applications discussed in Sections 3–6, the workflows presented in Section 7 can be highly customizable by the user. The last section presents Conclusions and future directions.
Database | Data | General information | Ref. |
---|---|---|---|
ChEMBL | In total, there are >1.6 million distinct compound structures, with 14 million activity values from >1.2 million assays. These assays are mapped to ∼11000 targets, including 9052 proteins | ChEMBL is an open large-scale bioactivity database. It contains data from the medicinal chemistry literature, deposited data sets from neglected disease screening, crop protection data, drug metabolism and disposition data, bioactivity data from patents, the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts | 4 |
PubChem | It contains the information of 92058388 compounds; 1252809 bioassays; 2395818 tested compounds; 170 RNAi bioactivities; 233516687 bioactivities; 10341 protein targets; 22104 gene targets | PubChem is a public chemical information repository in the National Center for Biotechnology Information. It provides information on the biological activities of small molecules. PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem substance, PubChem compound, and PubChem BioAssay. PubChem also provides a fast chemical similarity search tool | 5,6 |
Binding Database | It holds about 1.1 million measured protein-small molecule affinities, involving about 490000 small molecules and several thousand proteins | Binding DB is a publicly accessible database of experimental protein-small molecule interaction data primarily from scientific articles and US patents | 7 |
CARLSBAD | The 2012 release of CARLSBAD contains 439985 unique chemical structures, mapped onto 1420889 unique bioactivities | The CARLSBAD database has been developed as an integrated resource, focused on high-quality subsets from several bioactivity databases, which are aggregated and presented in a uniform manner, suitable for the study of the relationships between small molecules and targets | 8 |
ExCAPE-DB | In total there are 998131 unique compounds and 70850163 structure–activity relationship (SAR) data points covering 1667 targets | ExCAPE-DB is a large public chemogenomics dataset based on the PubChem and ChEMBL databases. Large scale standardization (including tautomerization) of chemical structures was performed using open source chemoinformatics software | 9 |
BRENDA | BRENDA is the main collection of enzyme functional data available to the scientific community | Currently BRENDA contains manually curated data for 82568 enzymes and 7.2 million enzyme sequences from UniProt | 10 |
DrugCentral | Over 14000 numeric values are captured covering 2190 human and non-human targets for 1792 unique active pharmaceutical ingredients | DrugCentral is a comprehensive drug information resource for FDA drugs and drugs approved outside US. The resources can be searched using drug, target, disease, and pharmacologic action terms | 11 |
Probes & drugs portal | It contains 31182 compounds, 4727 targets, and 114825 bioactivities | The probes & drugs portal is a public resource joining together focused libraries of bioactive compounds (probes, drugs, specific inhibitor sets, etc.) with commercially available screening libraries | 12 |
DrugBank | It contains 9591 drug entries including 2037 FDA-approved small molecule drugs, 241 FDA-approved biotech (protein/peptide) drugs, 96 nutraceuticals and over 6000 experimental drugs. Additionally, 4661 non-redundant protein sequences are linked to these drug entries | The DrugBank database is a unique bioinformatics and chemoinformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information | 13 |
repoDB | repoDB spans 1571 drugs and 2051 United Medical Language System (UMLS) indications disease concepts, accounting for 6677 approved and 4123 failed drug-indication pairs | repoDB contains a standard set of drug repositioning successes and failures that can be used to fairly and reproducibly benchmark computational repositioning methods. repoDB data was extracted from DrugCentral and ClinicalTrials.gov | 14 |
PharmGKB | It has over 5000 genetic variants annotations, with over 900 genes related to drugs and over 600 drugs related to genes | PharmGKB captures pharmacogenomic relationships in a structured format so that it can be searched, interrelated, and displayed according to the researchers' interests. The knowledge base is valuable both to the researcher who is interested in a specific single nucleotide polymorphism and its influence on a particular drug treatment and to the researcher interested in a disease or drug and looking for candidate genes which may affect disease progression or drug response | 15 |
Of note, although the availability of this data is important to build new models and make in silico predictions, the data and content in these databases is rather heterogeneous.
Perhaps the most common and widely used databases are ChEMBL, which contains 1.6 million distinct compounds and 14 million activity values,4 PubChem5,6 with more than 93 million compounds and more than 233 million bioactivities, and Binding Database with 490k small molecules and 1.1 million measured protein-small molecule affinities.7
Other resources are CARLSBAD, a bioactivity database with 435343 compounds and 932852 bioactivities. The advantage of CARLSBAD is that only one activity value of a given type (Ki, EC50, etc.) is stored for a given structure–target pair.8 ExCAPE-DB is a comprehensive chemogenomics dataset with 998131 compounds and 70850163 biological activity data.9 BRENDA is an enzyme information system of enzyme and enzyme–ligand information obtained from different sources; functional and structural data of more than 190000 enzyme ligands are stored within this system.10 The knowledge on bioactivity could help to identify potential targets for a specific molecule.
DrugCentral is a database that integrates structure, bioactivity, regulatory, pharmacologic actions and indications for active pharmaceutical ingredients approved by FDA and other regulatory agencies.11 The probes and drugs portal is a public resource putting together focused libraries of bioactive compounds (877 probes and 12190 drugs) with commercially available screening libraries. The rationale behind it is to reflect the current state of bioactive compound space and to enable its exploration from different points of view.12 Finding new uses for old drugs could be economically advantageous, therefore the development of databases like DrugCentral and probes and drugs will be beneficial for polypharmacology.16
One of the central points to the concept of chemical space is molecular representation i.e., the set of descriptors used to define the space of the chemicals that will be analyzed. A second major point is the visual representation and mining of that space, e.g., analysis of the diversity and coverage. Those aspects are important to consider when dealing with the analysis and interpretation of data, because distinct approaches may lead to representations that in most cases are not comparable to each other and the best one is usually defined by the he nature of the data analyzed. Web servers to explore chemical space usually incorporate one or more of the following operations: calculation of descriptors, visualization, and diversity analysis. Table 2 summarizes recent online servers for generating and mining the chemical space of compound databases using different approaches. Representative servers are further commented in this section.
Tool | Primary use | Functions | Implementation | Ref. |
---|---|---|---|---|
ChemMine | Set of chemoinformatics and data mining tools | Compounds visualization, similarity quantification, a search toolbox to retrieve similar compounds from PubChem, clustering and data visualization and molecular properties calculation | The server integrates over 30 chemoinformatics and data mining tools, being ChemMineR, an R package that integrates Open Babel and JOELib functionalities, one of the most important. The web interface was written in Python using Django web framework | 22 |
ChemBioServer | Mining and filtering chemical compound libraries | 2D and 3D molecule visualization, compound filtering: by toxicity, repeated compounds and steric clashes, similarity clustering using molecular fingerprints, data mining, graphical representation and visualization | The application back-end was developed in R programming language, while the front-end is implemented with PHP. 2D/3D display of compounds is accomplished with JChemPaint and Jmol respectively. Compound fingerprints are generated with Open Babel | 23 |
FAF-Drugs4 | Mining and filtering chemical compound libraries | Filters compounds based on physicochemical properties, ADMET rules and pan assay interference compounds (PAINS) | The application consists of a set of seven object-oriented Python modules embedded in the RPBS′ Mobyle framework. Each compound processed by FAF-Drugs3 is represented as a molecular object importing methods from the Open Babel toolkit through its Python wrapper Pybel which allows to access to the OpenBabel C++ library | 24 |
BioTriangle | Molecular properties and molecular fingerprints calculation | Computes descriptors that describe chemical features, protein features and DNA/RNA features | The application was implemented in an open source Python framework (Django) for the Graphical User Interface (GUI) and MySQL for data retrieval. The main calculation procedures and transaction processing procedures are written in Python language | 25 |
ChemDes | Molecular properties and molecular fingerprints calculation | Computes more than 3679 molecular descriptors and provides 59 types of molecular fingerprint | The application back-end was developed with Python. Django was chosen as a high-level Python web framework for web interface | 26 |
webMolCS | A web-based interface for visualizing sets of up to 5000 user-defined molecules in 3D chemical spaces and selecting subsets | Computes molecular fingerprints that are used to generate 3D chemical spaces using either principal component analysis (PCA) or similarity mapping (SIM) | This web server was developed using JavaScript and the JChem java chemistry library from ChemAxon | 27 |
Platform for Unified Molecular Analysis (PUMA) | Chemical space and analysis of chemical diversity | Chemical space, molecular properties diversity, scaffold diversity and structural diversity | The application back-end was developed in R programming language: plotly for the interactive plots, rcdk for the chemoinformatic analysis and Shiny for the user interface | 28 |
Consensus diversity plots | Global diversity visualization | Plots to visualize simultaneously several metrics of diversity and classify data sets | The application back-end was developed in R programming language. Shiny package was used for the user interface | 29 |
SwissADME | Molecular and physicochemical properties. Identifies PAINS | Web tool enables the computation of physicochemical, pharmacokinetic, drug-like and related parameters | The website was written in HTML, PHP5, and JavaScript, whereas the backend of computation was mainly coded in Python 2.7 | 30 |
MetaTox | Calculation of probability for generated metabolites. Prediction of LD50 values | Prediction of xenobiotic's metabolism and calculation toxicity of metabolites based on the structural formula of chemicals | The website uses MySQL server to store the data and PHP and HTML codes to implement the main interface. The Python script is used to generate the prediction and data processing | 31 |
SOMP | Prediction is based on PASS (Prediction of Activity Spectra for Substances) technology and labelled multilevel neighborhoods of atom descriptors | Prediction for drug-like compounds that are metabolized by the main CYP isoforms and UGT | The website uses MySQL server to store the data and PHP and HTML codes to implement the main interface. The Python script is used to produce independent sub-processes to generate input to the prediction program and data processing | 32 |
CarcinoPred-EL | Computes ensemble machine learning methods to predict carcinogenicity and identify structural features related to carcinogenic effects | This web server computes molecular fingerprints and uses ensemble machine learning methods to discover potential carcinogens | This website uses PaDEL-descriptors33 to compute the molecular fingerprints and the R package caret for the machine learning methods | 34 |
Pred-Skin | Binary QSAR models | Web-based and mobile application for the identification of potential skin sensitizers | The app is encoded using Flask, uWSGI, Nginx, Python, RDKit, scikit-learn and JavaScript | 35 |
Activity Landscape Plotter | Activity landscape modeling and structure–activity relationships | Structure Activity Similarity (SAS) maps, Structure Activity Landscape Index (SALI) and Dual Activity Difference (DAD) maps | The application back-end was developed in R programming language. Rcdk and Shiny packages are used for the chemoinformatic analysis and user interface, respectively | 36 |
ChemSAR | Structure preprocessing, molecular descriptor calculation, data preprocessing, feature selection, model building and prediction, model interpretation and statistical analysis | This web site computes the standardization of chemical structure representations, 783 1D/2D molecular descriptors and ten types of fingerprints for small molecules, the filtering methods for feature selection, the generation of predictive models | Python/Django and MySQL was used for server-side programming, and HTML, CSS, JavaScript was employed for the web interface | 37 |
Chembench | Chembench is a tool for data visualization, create and validate predictive quantitative structure–activity relationship models and virtual screening | Chembench supports the following chemoinformatics data analysis tasks: Dataset creation, dataset visualization, modeling, model validation and virtual screening | Chembench is a Java-based system. The front end of the website uses Java Server Pages with JavaScript. The struts 2 framework provides the interface between data on the JSPs and Java objects | 38 |
ChemMine is an online portal with five main application domains: compounds visualization, similarity quantification, a search toolbox to retrieve similar compounds from PubChem, clustering, data visualization and molecular properties calculation.22
ChemBioServer is a free-web based tool that can aid researchers on compound filtering and clustering. Compounds that survive the filtering process can be visualized using molecular properties and principal component analysis.23
ChemDes is a free web-based platform for the calculation of molecular descriptors and fingerprints. It contains more than 3679 molecular descriptors that are divided into 61 logical blocks. In addition, ChemDes provides 59 types of molecular fingerprint systems.26
BioTriangle can calculate a large number of molecular descriptors of individual molecules, structural and physicochemical features of proteins and peptides from their amino acid sequences, and composition and physicochemical features of DNAs/RNAs from their primary sequences.25
FAF-Drugs3, now FAF-Drugs4, is a web server that applies an enhanced structure curation procedure that filters compounds based on physicochemical properties, ADMET rules and generally unwanted molecules also known as pan assay interference compounds (PAINS).24 This server can be used to generate and analyze ADMET-relevant chemical spaces.19
The visualization of the chemical space of molecular databases has been proved to be relevant to measure molecular diversity and biological properties. webMolCS is a web-based interface to visualize sets of user-defined molecules in 3D chemical spaces, using different molecular fingerprints and selecting subsets.27
The visualization of the chemical space can offer a good idea on how diverse the datasets are, however, since the diversity criteria depends on the molecular representation employed, a tool to compute different diversity metrics would be useful to researchers with different backgrounds. Platform for Unified Molecular Analysis (PUMA) is a web server developed to visualize the chemical space and measure the molecular properties and structural diversity of datasets.
PUMA addresses the issue of the dependence of chemical space on structure representation. In this server the user can analyze a user-supplied data set using molecular scaffolds, properties of pharmaceutical relevance and fingerprints of different design. Fig. 1 illustrates a screenshot of the server PUMA. The figure exemplifies the analysis done with the chemical space tab available in the main top menu of the application.
Molecular diversity of compound data sets can be evaluated employing molecular scaffolds, structural fingerprints and physicochemical properties. Consensus Diversity Plot (CDP) is a novel method to represent in low dimensions the diversity of chemical libraries considering simultaneously multiple molecular representations and to facilitate the classification of data sets into diverse or not diverse.29 A recent application of CDPlots is the analysis and quantification of the global diversity of 354 natural products from Panama. The diversity of those compounds was compared against the diversity of natural products from Brazil, natural and semi-synthetic molecules used in high-throughput screening, and compounds used in Traditional Chinese Medicine.39 The CDPlots rapidly led to the conclusion that natural products from Panama have a large scaffold diversity as compared to other databases.
SwissADME is a web tool to compute fast but robust predictive models for physicochemical properties, pharmacokinetics, drug-likeness and identifying PAINS.30 Other web servers used to predict toxicity are based on the prediction of metabolites formation. This is the case for MetaTox, which can also be used to predict toxicity endpoints,31 and SOMP, a web-service for the prediction of metabolism by human cytochrome P450.32 Among various toxicological endpoints, the carcinogenicity of potential drugs is of interest because of its serious effects on human health. In general, the carcinogenic potential of a compound is evaluated using animal models that are time-consuming, expensive, and ethically concerning. The use of computational approaches such as CarcinoPred-El, which predicts carcinogenicity based on chemical structure properties, is an appealing alternative. CarcinoPred-El uses different molecular fingerprints and ensemble machine learning methods to predict the carcinogenicity of diverse organic compounds.34
The use of animals for cosmetic experiments is forbidden in Europe, therefore there is a strong need to develop alternative tests to evaluate skin sensitization. Pred-Skin is an app developed to predict the skin sensitization potential of chemicals based on binary QSAR models of skin sensitization potential from human (109 compounds) and murine local lymph node assay (LLNA, 515 compounds) data.35
Most of the web-sites developed to perform SAR analysis are focused on QSAR models (Table 2). This is the case of ChemSAR and Chembench. Both are web-based platforms to generate SAR and QSAR classification models employing machine learning methods.37,38
Activity Landscape Plotter is an R-based web tool developed to analyze SAR using the concept of activity landscape modeling. The objective of activity landscape modeling is to explore the relationship between structure similarity and activity similarity (or potency difference) of screening data sets.40,41 There are a number of numerical and visual methods useful for activity landscape modeling. In particular, Activity Landscape Plotter generates structure–activity similarity, dual-activity difference maps and identifies activity cliffs in a data set with biological activity.36 Dual-activity difference maps are particularly attractive to analyze SAR of data sets with activity data for two biological endpoints. Therefore, these maps are tools to explore SmARs. Fig. 2 shows a screenshot of the Activity Landscape Plotter. It is illustrated the functionality Dual-Activity Difference (DAD) functionality available in the main menu of the server.
There are a number of servers that are focused on specific target families or diseases. These are summarized in Table 3. For complex diseases such as Alzheimer and cancer, useful web servers containing information regarding important targets and their ligands. AlzPlatform42 and AlzhCPI43 are web tools implemented for target identification, polypharmacology and virtual screening of active compounds for the treatment of Alzheimer disease. CDRUG,44 CancerIN,45 and CanSAR46 are web servers developed to predict the anticancer activity of compounds. All these disease-oriented web tools contain valuable information such as genes, related proteins, drugs approved and in clinical trials, compounds associated with biological activity, as well as information on biological assays.
Tool | Primary use | General approach | Implementation | Ref. |
---|---|---|---|---|
AlzPlatform | Web tool implemented for target identification and polypharmacology analysis for Alzheimer disease research | Assembled with Alzheimer disease-related chemogenomics data records. Uses TargetHunter and/or HTDocking programs for identification of multitargets and polypharmacology analysis and also for screening and prediction of new Alzheimer disease active small molecules | AlzPlatform was constructed based on the molecular database prototype CBID, 8, 9 with a MySQL database and an apache web server. OpenBabel10 is the search engine for chemical structures. The web interface is written in PHP language | 42 |
AlzhCPI | This server will facilitate target identification and virtual screening of active compounds for the treatment of Alzheimer disease | AlzhCPI predicts chemical–protein interactions based on multitarget quantitative structure–activity relationships (mt-QSAR) using naive Bayesian and recursive partitioning algorithms | The web server was designed using HTML and CSS technology | 43 |
Kinase SARfari | This is an integrated chemogenomics workbench focused on kinases. The system incorporates and links kinase sequence, structure, compounds and screening data | Kinase SARfari data is accessible via: compound-similarity and substructure searching, target keyword and sequence similarity searching. Provides target and screening data through compound initiated queries | The ChEMBL web services are written in Python programming language within Django software framework | 47 |
KIDFamMap | First tool to explore kinase-inhibitor families (KIFs) and kinase-inhibitor-disease (KID) relationships for kinase inhibitor selectivity and mechanisms | This tool includes 1208 KIFs, 962 KIDs, 55603 kinase-inhibitor interactions (KIIs), 35788 kinase inhibitors, 399 human protein kinases, 339 diseases and 638 disease allelic variants. KIDFamMap searches the kinase candidates (K′) with significant sequence similarity (E-values ≤ e−10) using BLASTP48 and also searches the compound candidates (I′) with significant topology similarity (≥0.6) using atom pairs and moiety composition from the annotated KII database (≤10 μM) | Not reported | 49 |
GLIDA | This web server provides interaction data between GPCRs and their ligands, along with chemical information on the ligands, as well as biological information regarding GPCRs | GLIDA includes a variety of similarity search functions for the GPCRs and for their ligands. Thus, GLIDA can provide correlation maps linking the searched homologous GPCRs (or ligands) with their ligands (or GPCRs) | GLIDA was constructed on the LAMP (Linux, Apache, MySQL and PHP) platform | 50 |
GPCR SARfari | GPCR SARfari is an integrated chemogenomics research and discovery workbench for class A G protein coupled receptors | GPCR data is accessible via compound-similarity and substructure searching, target keyword and sequence similarity searching. Provides target and screening data through compound initiated queries | The ChEMBL web services are written in Python programming language within Django software framework | 47 |
CancerIN | The web server uses machine learning and potency score based methods to classify compounds as anticancer and non-anticancer | This server provides various facilities that includes; virtual screening of anticancer molecules, analog based drug design, and similarity with known anticancer molecules | CancerIN was built using python scripts | 45 |
CDRUG | CDRUG is a web server for predicting anticancer efficacy of chemical compounds | CDRUG uses a novel molecular description method (relative frequency-weighted fingerprint) to implement the compound ‘fingerprints’. Then, a hybrid score was calculated to measure the similarity between the query and the active compounds. Finally, a confidence level (P-value) is calculated to predict whether the query compounds have, or do not have, the activity of anticancer | CDRUG employs both Python and Java to implement prediction of anticancer activity. Pybel is used to calculate the daylight fingerprint and use jCompoundMapp to calculate the kernel fingerprint | 44 |
CanSAR | Tool to identify biological annotation of a target, its structural characterization, expression levels and protein interaction data, as well as suitable cell lines for experiments, potential tool compounds and similarity to known drug targets | A large set of descriptors is calculated for each of the compounds to enable clustering of compounds into chemically related groups. Bemis and Murcko frameworks are calculated for all compounds. The interface allows users to rapidly obtaining biological and chemical annotation together with druggability considerations, explore genomic variation and gene-expression data, identify relevant cell lines for experiments, and tool compounds for analysis | CanSAR is running on an Apache web server implemented in PHP, JavaScript, Perl and Java. Chemical compound search and handling is supported by the Accelrys direct cartridge. The data processing pipelines are written in Perl, Python and Java and utilize OpenBabel, CDK and Pipeline Pilot | 46 |
HEMD | HEMD provides a central resource for the display, search, and analysis of the structure, function, and related annotation for human epigenetic enzymes and chemical modulators focused on epigenetic therapeutics | User may paste a SMILES or sketch a potential epigenetic compound. Submitting the query launches a structure similarity search tool in HEMD. In addition to these structure similarity searches, the “Modulator search” utility also supports compound searches on the basis of physicochemical properties and chemical formulas | Not reported | 51 |
Similar web servers have been implemented for specific targets. Kinase and GPCR SARfari are chemogenomic tools implemented on ChEMBL to incorporate and link GPCR and kinase sequences, structures, compounds and screening data.47 Other web servers such as KIDFamMap were developed to design selective kinase inhibitors.49 This has been a challenging task given the evolutionary conserved ATP binding site where the majority of inhibitors are expected to bind. GLIDA is a public GPCR- related chemical genomics database, it provides chemical information on the ligands as well as biological information regarding GPCRs or G-protein coupled receptors, which represent one of the most important families of drug targets in pharmaceutical development.50
Epigenetics became of great importance for researchers when it was discovered that gene function could be altered by more than just changes in sequence. Today a number of diseases have been linked to amplification, mutation, and other alterations of epigenetic enzymes. Therefore, analyzing the most appropriate epigenetic enzymes involved in different diseases is a prerequisite for epigenetic therapeutics. HEMD is a web server that provides the utilities to display, search and analyze the structure, function and related annotation of human epigenetic enzymes and chemical modulators focused on epigenetic therapeutics.51
KNIME's modular workflow design, along with its ability to automatically parallelize many operations, free distribution, and simplicity to communicate analysis pipelines, has made it widely successful in diverse areas of analytics. It is also quite flexible and allows integration of different software and tools.52 For a detailed explanation of the “workflow” concept, as well as other software following this approach, see the review by Tiwari and Sekhar.53 In the following subsections, the issues that can be addressed through chemistry applications or plugins implemented in KNIME are presented.
A prior step to data curation involves, of course, reading a chemical database. There are many kinds of files in which chemical information may be stored, including CSV, SDF, SQL and XML. KNIME provides extensions able of reading most, if not all, of them. Regarding data curation pipelines, a recent publication by Gally et al. proposed a workflow for preliminary molecule preparation in KNIME.57 Also, a useful and comprehensive tutorial for KNIME application into chemical data curation has been recently published elsewhere.58
Fig. 3 An example of KNIME workflow for reading a chemical dataset and performing target prediction. |
This journal is © The Royal Society of Chemistry 2017 |