Jean-Louis
Reymond
*,
Ruud
van Deursen
,
Lorenz C.
Blum
and
Lars
Ruddigkeit
Department of Chemistry and Biochemistry, University of Berne, Freiestrasse 3, CH-3012, Berne, Switzerland. E-mail: jean-louis.reymond@ioc.unibe.ch; Fax: +41 31 631 80 57
First published on 28th April 2010
The chemical space is the ensemble of all possible molecules, which is believed to contain at least 1060 organic molecules below 500 Da of possible interest for drug discovery. This review summarizes the development of the chemical space concept from enumerating acyclic hydrocarbons in the 1800's to the recent assembly of the chemical universe database GDB. Chemical space travel algorithms can be used to explore defined regions of chemical space by generating focused virtual libraries. Maps of the chemical space are produced from property spaces visualized by principal component analysis or by self-organizing maps, and from structural analyses such as the scaffold-tree or the MQN-system. Virtual screening of virtual chemical space followed by synthesis and testing of the best hits leads to the discovery of new drug molecules.
Jean-Louis Reymond | Jean-Louis Reymond is Professor of Chemistry and Chemical Biology at the University of Berne, Switzerland. He studied chemistry and biochemistry at the ETH Zürich and obtained his PhD in 1989 at the University of Lausanne in the area of natural products synthesis. He then joined the Scripps Research Institute in La Jolla, CA, and became an assistant Professor there in 1992. In 1997 he joined the Department of Chemistry and Biochemistry at the University of Berne as an Associate and in 1998 Full Professor. His research interests focus on exploring molecular diversity using combinatorial chemistry, computer-aided drug design and cheminformatics. |
Ruud van Deursen | Ruud van Deursen was born in 1979 in Helmond (Netherlands). He received his MSc in Chemical Engineering and Chemistry from the Eindhoven University of Technology in 2004. After master courses in Biochemistry and Molecular Biology at Ecole Normale Supérieure de Lyon (France), he wrote his master thesis on using alcohol dehydrogenases for biotransformations in the group of Professors Kurt Faber and Wolfgang Kroutil at Karl-Franzens-University in Graz (Austria). In December 2005 he joined the group of Prof. Jean-Louis Reymond at the University of Berne. Current research is focused on development of chemoinformatic tools for the understanding of chemical space and screening for bioactive molecules. |
Lorenz Blum | Lorenz Christian Blum, was born in Berne (Switzerland) in 1983. He studied chemistry at the University of Berne and received his MSc degree in physical chemistry in 2006. Thereupon he started his PhD studies in chemoinformatics under the supervision of Prof. Jean-Louis Reymond. His current research interests are the assembly, analysis and applications of large virtual molecular databases. |
Lars Ruddigkeit | Lars Ruddigkeit, born in Hamm (Germany) in 1982, studied chemical biology at the Technical University of Dortmund, and wrote his MSc on molecular probes for EGFR with Prof. Dr Herbert Waldmann. In June 2009, he started his PhD under the supervision of professor Jean-Louis Reymond at the University of Berne (Switzerland). His current research interests include the exploration of chemical space and in silico generated molecule databases. |
One can also use scoring functions to rank compounds from virtual libraries prior to their synthesis, with the aim of exploring yet unknown chemical space and accessing new compound classes. This review focuses on this strategy and summarizes approaches to generate virtual libraries, to visualize the chemical space by producing maps, and to perform de novo drug discovery by virtual screening of virtual libraries followed by synthesis and testing of the best hits. Such exploration of yet unknown chemical space might help to solve the problem of the high attrition rates in drug development by giving more compounds to choose from at the hit prioritization level, which should increase the chances of success at later stages.18,19 Exploring a broader range of structures by virtual screening might also allow to address the problem of target promiscuity that is apparent in many drugs and allow the design of safer drugs.20,21
While these early considerations focused on counting only, the idea of actually enumerating and representing molecular structures in a computer was addressed in the 1960's by Lederberg and Djerassi, who invented DENDRAL, a program designed to help structure elucidation by mass spectrometry.27–29 DENDRAL produced all possible organic molecules with a given elemental formula. It was possible to exclude undesirable functional groups from a “badlist” and enforce functional groups specified in a “goodlist” to restrict the output. Provided enough such constraints, the list of structures would automatically be reduced to a handful of possibilities. This project gave rise to the topic of computer-assisted structure elucidation (CASE), which addresses automatic structure assignment from analytical data such as MS and NMR spectra and uses various structure generators30,31 as a key component.32–36
Enumeration by synthesis replaced virtual enumeration with the advent of combinatorial chemistry in the early 1990's. The key triggers were the inventions of (1) solid-supported split-and-mix synthesis,37–39 and (2) surface synthesis of two-dimensional arrays on glass or paper support.40,41 These methods allowed the simultaneous synthesis of thousands to millions of compounds as physically segregated and identifiable products. Solid-supported combinatorial chemistry was pursued first for iterative syntheses of oligomers such as peptides,37–39 peptoids42 and oligonucleotides,43 and later extended to include a broad arsenal of synthetic reactions leading to compounds of ever increasing complexity, in particular in the elegant diversity-oriented syntheses of Schreiber and coworkers.44,45 Latest advances in combinatorial chemistry include improvements in library decoding46 and screening methods,47 and the preparation of libraries of billions of compounds using DNA-encoded chemistry.48 The concept of combinatorial chemistry also led to automated parallel synthesis, which is used to systematically enlarge compound collections in pharmaceutical companies and at commercial providers.49 Databases of many of these compounds are publicly available in which the structures are written as SMILES,50–52 or related formats such as InChI.53 Examples include catalogs from commercial providers and public databases such as COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundZINC,54a BindingDB,54b Chembl54c or PubChem.55
The availability of collections of millions of compounds for drug discovery has suggested the concept of chemical space for describing the ensemble of all the molecules.56–58 The chemical space metaphor offers a more inspiring imagery than the older “needle in a haystack” paradigm in the context activity screening, and has been broadly embraced by the medicinal chemistry community to talk about drug discovery. All the known molecules form the “available chemical space”. There also exists a much larger space containing all the chemically possible molecules, which we call the chemical universe. Although chemical space is not uniquely defined, one generally considers that structurally related molecules form close groups, and that drug discovery can be guided geographically in chemical space. Areas of interest mark the biologically relevant chemical space, which includes natural products that have co-evolved with protein and nucleic acid binding sites in the course of the evolution of life, and all the drugs so far crafted by homo sapiens sapiens in his own fight for survival.
Is chemical space finite? Yes, if boundaries are defined. For small molecule drug discovery the natural limit is the molecular weight, which must be capped at 300–500 Da to ensure reasonable bioavailability.59 This chemical space of drug-like molecules has been estimated to be in excess of 1060 molecules.56,60 Our group has pushed the concept one step further and produced actual lists of all molecules that are possible up to a certain size following simple constraints of chemical stability and synthetic feasibility, forming the GDB database.61–63 The database is constructed from an exhaustive list of graphs produced by the program GENG,26 which are transformed into molecules by replacing graph nodes by atoms (C, N, O, F, Cl, S) and graph edges by single, double or triple bonds following simple valency rules, and retaining only chemically meaningful ring systems and functional groups (Fig. 1). It should be noted that exotic yet sometimes known molecules such as a molecule corresponding to a non-planar graph,64 or those containing strained fused ring systems such as COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundcubane or COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundprismane, are not considered in such enumerations.
Fig. 1 Process for generating the chemical universe database GDB-11. |
GDB has been published for the enumeration up to 11 atoms (GDB-11, with C, N, O, F, 26.4 million cpds with 152.9 ± 7.3 Da)63 and 13 atoms (GDB-13, with C, N, O, Cl and S, 980 million cpds 179.9 ± 8.3 Da),62 and completed in-house for 15 atoms (GDB-15, 28.8 billion cpds 206.8 ± 5.4 Da). GDB consists in large part of relatively rigid molecules, with bicyclic and tricyclic topologies being the most abundant. Most GDB-molecules are generated at intermediate ratios of polar atoms to carbon at clogP values between −2 and 2. These molecules fulfill Lipinski's criteria for oral bioavailability59 as well as lead-likeness65 and fragment-likeness66 criteria, mostly because these criteria primarily restrain molecular size. The GDB approach is limited to relatively small molecules due to the combinatorial explosion. An analysis of chemical space for larger molecules has been recently proposed by focusing on scaffold topologies.67 This description does not explicitly enumerate molecules but allows understanding of structural types in broad terms and was used to show that only a small subset of the possible scaffold topologies occur in known molecules.68
One can also travel in chemical space with genetic algorithms that combine molecule generation with a fitness function in iterative cycles.72–74 One of the first examples was the SPROUT algorithm of Johnson and coworkers, which grows molecules into a targeted protein binding site by coupling building blocks following retrosynthesis rules.75–77 SPROUT selects synthetically feasible products that have a maximum fitness as estimated by docking to the target protein. The same strategy is followed in SYNOPSIS,78 which restricts itself to directly realizable reactions, and in EVOLUATOR,79 which allows interactive molecule selection as the molecule population evolves to its highest fitness. Other genetic algorithms include Skelgen,80 TOPAS,81 Flux,82,83 ADAPT,84 and the more recent multi-objective optimization algorithms GANDI85 and COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundMEGA.86
Chemical space travel has also been realized using formal molecular evolution rules that are independent of synthetic schemes, resulting in a much deeper and structurally more innovative exploration of chemical space. In one case, Gasteiger and coworkers reported a molecular breeding algorithm based on the recombination of molecular fragments that was used to generate median molecules maximizing common features of two different starting molecules.87 The fitness function in this algorithm optimized the Pareto rank relative to the Tanimoto similarity coefficients of structural fingerprints to both starting molecules. Genetic algorithms breeding random fragments were similarly reported that assemble any target molecule by iterative cycles,88 evolve a molecular population to maximum fitness as defined by QSAR,89 and generate new inhibitors by cross-breeding known ones.90
The approach is exemplified by our own version of chemical space travel, which uses a SPACESHIP to travel between a starting molecule A and a target molecule B by iterative cycles of mutation and selection (Fig. 2).91 In the SPACESHIP, the mutation generator is the engine, which is driven by exhausting mutants containing elementary structural changes in bond and atom types. Motion is directed by a compass, which points towards the target B by selecting mutants with the highest Tanimoto similarity coefficient to the target for the next step.
Fig. 2 The SPACESHIP algorithm travels from A to B in the chemical space of molecules up to 50 heavy atoms not accessible to GDB. |
SPACESHIP explores chemical space for molecules up to 50 heavy atoms which is not accessible to exhaustive enumeration by GDB. The algorithm can join any pair of molecules in a few tens of mutations and selection cycles and generates “trajectory libraries”, which are filtered for chemical consistency by eliminating strained rings and impossible functional groups. Trajectory libraries contain up to several million intermediate molecules between A and B that may later be used for virtual screening. In a model study, a trajectory library of 500000 compounds linking AMPA, an agonist of the corresponding glutamate receptor, with CNQX,92 was ranked by high-throughput docking. A strong enrichment of high-scoring hits such as the β-amino acid 1 formed at intermediate distances between AMPA and CNQX was observed in this library compared to docking with non-selected libraries, suggesting that the trajectory libraries explore privileged regions of chemical space (Fig. 3).
Fig. 3 Chemical space travel trajectories between AMPA and CNQX represented in the 2-dimensional Tanimoto similarity space. The trajectory library is colored according to the distance from CNQX to AMPA in number of mutation steps. Binding energies as estimated by docking with Autodock 3.0.5 to the AMPA-receptor 1FTK.pdb are indicated for start and target and a strong-docking intermediate. |
While fitness values produce a different chemical space for every application, it is also possible to define generally valid dimensions using descriptors, which represent structural and physico-chemical properties of the molecules. Thousands of descriptors have been reported in the literature, allowing practically limitless possibilities to construct chemical spaces.58,96 Maps to represent these spaces can be produced by principal component analysis (PCA) and representation of the plane of the first two PCs or the space of the first three PCs. In such property space maps, compounds with related structural, physicochemical and sometimes biological activities are generally grouped together. Notable examples include the ChemGPS system97,98 and related approaches to classify drugs and natural products.99,100 The multidimensional property spaces defined by descriptors can also be visualized using self-organizing maps, which are grids of neurons to which similar compounds are assigned.101 SOM-maps have been used successfully to differentiate various bioactivity classes.102,103 A simple structure-based classification of the chemical universe database GDB-11 can be obtained using a SOM trained with autocorrelation vectors of atomic properties101 as descriptors. In this representation, molecules are organized by their structural types.63 SOM are limited to classifying, at most, a few million molecules due to the computational time needed to train the map.
The periodic system, which is arguably the oldest and best known map of a chemical space, came out of a historical breakthrough when classification of the elements was attempted based on the atomic weights and later the atomic number rather than on the properties of their compounds.104 Similarly, a unified and generally useful classification of organic molecules might arise by using a system based purely on structural features rather than on properties as in the examples above. Two recent approaches have proposed structure-based classification concepts for organic molecules that lead to a mapping of the chemical space.
In the first case, Schuffenhauer et al. reported a so-called scaffold-tree classification by gradually deconstructing molecules in successive steps of functional groups and cycle removals following a simple set of priority rules.105 The analysis defines linkages called brachiation between related molecules. Most remarkably, the scaffold-tree reveals natural families of bioactive scaffolds when annotated with known bioactivities, suggesting new activities for known scaffolds and new scaffolds for known activities.106 For example, analysis of the brachiating structure for inhibitors of the pyruvate kinase led to the identification of three activators (AC50 ≤ 10 μM) and six inhibitors (IC50 ≤ 10 μM) from databases of known compounds.107
In the second case, we have reported a classification of organic molecules based on molecular quantum numbers (MQNs).108 A set of 42 MQNs are defined as counts for elementary constituents of molecules such as atoms, bonds, polar groups, and topological features. MQNs reflect purely structural elements rather than calculated properties as described earlier. The analysis produces a very straightforward map of chemical space when the 42 MQN-dimensions are projected in the PC1/PC2 plane using a non-normalized PCA. For example the MQN-map of the GDB-11 database groups molecules in islands containing molecules with increasing numbers of rings and decreasing number of rotatable bonds. In each island, the north end contains polar molecules and the south end apolar molecules (Fig. 4a–c). Molecules are also well separated into different categories in such maps (Fig. 4d), as was previously observed in a SOM-classification of the database.63 Distances between molecules in MQN-space can be calculated by using a city-block distance, which is the sum of the absolute differences between MQN values of each molecule. MQN-space groups structurally related molecules, as illustrated for the closest MQN-neighbors of COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compounddiazepam COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compound2–4 found in COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundZINC, while compounds with high structural similarity as measured by structural fingerprints such as 5 and 6 are more distant (Fig. 5A). MQN-distance classification provides a simple and efficient enrichment scheme for virtual screening of COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundZINC (Fig. 5B).
Fig. 4 MQN-map of GDB-11 colored by (a) number of cycles, (b) number or rotatable bonds, (c) number of hydrogen-bond acceptor atoms, and (d) molecule categories. In (d) the category of molecules was assigned using the following priority rule: 1. Heteroaromatic (red) > 2. Aromatic (magenta) > 3. Fused heterocyclic (blue) > 4. Fused carbocyclic (cyan) > 5. Heterocyclic (green) > 6. Carbocyclic (bright green) > 7. Heteroacyclic (yellow) > 8. Carboacyclic (orange). Each point in the map is colored according to the majority category for the compounds grouped at that point, with grey shading (saturation in COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundHSL scale) indicating category purity. |
Fig. 5 MQN-city block distances for virtual screening. A. Analogs of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundDiazepam by MQN-distance (COMPOUND LINKS Read more about this on ChemSpider Download mol file of compound2–4) and by structural fingerprint measure (5–6). B. Enrichment curves of recovering known bioactive ligand analogs of COMPOUND LINKS Read more about this on ChemSpider Download mol file of compounddiazepam from COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundZINC using MQN-distances or Tanimoto similarity coefficients of structural fingerprints. |
The chemical space travel algorithms discussed above have successfully been implemented in a number of case studies.103 SYNOPSIS was validated by successfully guiding a focused library of 200 possible HIV inhibitors featuring mostly heteroaromatic amides, of which 18 were successfully synthesized and led to 10 non-toxic inhibitors that show significant activity, such as compound 7 (IC50 = 80 μM) (Fig. 6).78 EVOLUATOR has been used to identify compound 8 as an inhibitor that is active on both the α1- and α1-adrenergic receptors and shows a displacement of >50% at a concentration of 10 μM in the radioligand binding assay.79 Skelgen has been used to discover estrogen inhibitors. From the 17 synthesized structures, 5 show inhibition in μM range, such as 9 (IC50 = 0.34 μM).80 Flux was applied for the identification of inhibitors for the disruption of the interaction between the Tat-Peptide and TAR RNA, which is part of the human immunodeficiency virus (HIV-1), such as 10 (IC50 = 500 μM).109
Fig. 6 Examples of bioactive molecules identified from virtual libraries prior to synthesis. |
In the above examples, molecule generation is coupled to fitness selection, and the database of generated structures is never discussed or explicitly exposed. This strategy eludes the questions of completeness, i.e. have all the possibilities been examined? and of intellectual property protection, i.e. are the generated molecules lost to the public domain? In the case of the chemical universe databases GDB, completeness is addressed because the database is exhaustive, implying that the best possible molecules should be found in the database for any given target provided that a perfect virtual screen is available. Interestingly, the molecules exposed in GDB are not lost to the public domain. Indeed, although GDB-molecules are in principle possible because they contain chemically stable structural elements such as functional groups and ring systems, they are by no means trivial to synthesize. A claim to a structure from GDB will therefore only be possible and valid once the compound has actually been made in the laboratory. Note that this may not necessarily apply if extremely focused GDB-subsets containing molecules that are entirely trivial to make were exposed.
As proof of concept for the use of GDB in drug discovery, we have investigated the case of the glycine binding site of the NMDA-receptor, an important neurotransmitter receptor implicated in various neurological diseases.110 Docking GDB-molecules to the binding site defined in the crystal structure of its COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundglycine complex showed that known ligands such as COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundD-alanine, COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundD-serine, or COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundglycine itself, are indeed among the best (top 1.03%) docking compounds. In one implementation,110 we selected a GDB-subset of 15061 structures using a Bayesian classifier trained with known NMDA-receptor ligands, and carried out high-throughput docking of the corresponding 69367 stereoisomers generated using CORINA.111 Synthesis and testing of a selection of 23 compounds among the 712 compounds docking better than COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundglycine led to the identification of simple dipeptides such as 11–12 as a new class of NMDA-glycine site inhibitors, as well as the D-alanine analog 13 (Fig. 7). Lead optimization was performed by attaching hydrophobic alkyl groups to the terminal amino group, providing the COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compoundN-ethyl β-alanine dipeptide COMPOUND LINKS
Read more about this on ChemSpider
Download mol file of compound14 as optimal ligand. The preference of the NMDA-glycine site for amino acids was confirmed when we docked a random selection of 8000 (31121 stereoisomers) molecules from GDB, which featured non-cyclic amino acids similar to the previously identified ligands in the best docking hits.112 This non-directed screening campaign pointed to the yet unknown diketopiperazines 15 and 16 as possible new types of ligands for the receptor. Indeed synthesis and testing showed that compound 15 was a weak inhibitor of the glycine site, while 16 was inactive. Further discovery programs ongoing in our laboratory have largely confirmed that high-throughput docking of GDB-derived molecules followed by synthesis and testing provides a reliable entry into new ligands.
Fig. 7 A. Structural formulae of virtual hits COMPOUND LINKS Read more about this on ChemSpider Download mol file of compound11–16 identified from GDB-11. B. Binding modes within the NMDA-glycine site (1PB7.pdb) for COMPOUND LINKS Read more about this on ChemSpider Download mol file of compoundglycine (green), virtual hit COMPOUND LINKS Read more about this on ChemSpider Download mol file of compound11 (blue), virtual hit 12 (magenta) and virtual hit 13 (orange). |
This journal is © The Royal Society of Chemistry 2010 |