Dagmar
Stumpfe
and
Jürgen
Bajorath
*
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität Bonn, Dahlmannstr. 2, D-53113, Bonn, Germany. E-mail: bajorath@bit.uni-bonn.de; Fax: +49-228-2699-341; Tel: +49-228-2699-306
First published on 23rd November 2011
There has been steadily increasing interest in the systematic analysis of structure–activity relationship (SAR) information contained in compound data sets of different size, composition, and origin. In this contribution, we provide an overview of SAR analysis in medicinal chemistry and review computational approaches of different design and sophistication that make it possible to analyze SARs on a large scale. Special emphasis is put on recently introduced SAR visualization methods that are expanding the scope of traditional SAR analysis in medicinal chemistry.
While SAR analysis is central to the lead optimization process, it also plays a critical role at earlier stages of medicinal chemistry projects and at the interface between medicinal chemistry and biological high-throughput compound screening (HTS).3 In order to obtain interesting active compounds for further chemical exploration, hits are usually selected from HTS campaigns, which is common practice in the pharmaceutical industry. From typically large numbers of structurally diverse and mostly weakly potent screening hits, also containing false-positives, small sets of compounds must be prioritized as alternative starting point for so-called hit-to-lead investigations, which precede the lead optimization stage when the most promising active compound(s) are further explored. During hit-to-lead projects, analogs of initially identified hits are generated and their activities are determined. A key task of chemistry efforts during hit-to-lead stages is the generation of limited numbers of analogs to demonstrate that one or more active compounds (representing different structural classes or chemotypes) exhibit sustainable SARs, which is a major criterion for the selection of candidates for subsequent lead optimization. To demonstrate preliminary SAR behavior, primary screening data present the major source of information. While the potency and structural diversity of screening hits are typically considered as criteria for hit selection, hits might often belong to series of structurally similar or analogous compounds that are present in screening libraries (for example, originating from combinatorial chemistry efforts). Hence, analyzing structural neighbors of hits that are more or less potent (or inactive) might already reveal preliminary SAR characteristics and thus help to shortcut hit-to-lead investigations.
Taken together, we can conclude that SAR analysis plays a crucial role at different stages of medicinal chemistry projects, also including its interface with HTS. Furthermore, SAR exploration is carried out at different levels of complexity. As discussed in the following, alternative computational approaches have been introduced, in addition to those based on the classical QSAR paradigm, to support SAR analysis in different ways.
Fig. 1 R- group table. For a series of serotonin transporter inhibitor analogs (taken from BindingDB50) that are distinguished by substitutions at one of two or two sites (R1 and R2), an R-group table is shown. The shared molecular framework (scaffold) is drawn with thick edges. |
As long as analog sets are relatively small and the number of alternative substitution sites is limited to a few, e.g., one to three, SAR trends can often be deduced from R-group tables and new active analogs predicted. Here, chemical experience and intuition usually play a key role. However, as analog series grow in size, the “classical” and essentially subjective way to analyze SARs on the basis of R-group table information quickly reaches its limits. How many analogs might we be able to compare as individuals with a trained chemist's eye to recognize SAR trends? Probably tens or more, but certainly not hundreds or thousands. It has been shown that judgment about important molecular characteristics such as drug-likeness is highly variable among experienced medicinal chemists when compound collections are inspected (as lists of molecular graphs).4 Moreover, our judgment in such situations is usually context-dependent, i.e., an individual often draws different conclusions about a given compound when the ordering in a list is changed.4 Thus, subjective evaluation becomes quickly unreliable or infeasible once data sets substantially grow in size. This “numbers game” explains, to a large extent, the interest in utilizing computational models for SAR analysis. During lead optimization, QSAR models are often utilized when a sufficiently large number of analogs is available as training data. However, as discussed in the following, the need to explore SAR information using computational concepts and tools goes well beyond individual compound series and the applicability domain of QSAR modeling.
But what are the origins of such data sets and what is one specifically interested in? In pharmaceutical environments, sets of active compounds directed against high-profile targets, e.g., different protein kinases or G protein coupled receptors, usually evolve over time. Active compounds become available from subsequent HTS campaigns at different points in time and are subjected to hit-to-lead efforts, lead candidates are explored and series under investigations are often not further pursued when a project hits a roadblock. Then, alternative starting points are considered, new lead candidates are put forward, and so on. Often, multiple medicinal chemistry teams are involved in cycles of compound exploration and optimization over time, focusing on different candidates. Data sets originating from such projects typically consist of compounds at rather different stages of development, ranging from initially selected hits and compound sets for which preliminary SAR information has been collected to more or less extensively explored analog series. It is not uncommon that such data sets contain many different structural classes and hundreds or thousands of active compounds. Hence, one is often interested in retrospectively studying such heterogeneous data sets and exploring available SAR information in context, which might help to better understand which compounds to prioritize going forward. For example, potential lead candidates might not have been followed-up on during earlier rounds, for one reason or another, and could be revisited. On the other hand, structural classes similar to current compounds of interest might have already been extensively explored, with only limited success. Such questions often govern retrospective SAR analysis efforts on evolving compound data sets. In these cases, one is not primarily interested in identifying SAR determinants for an individual compound series. Rather, one would like to understand which compound series carry most, or most interesting, SAR information.
In this context, searching for SAR discontinuity1,5 usually becomes a major focal point. Compound series are characterized by a high degree of SAR discontinuity if small chemical changes lead to significant potency alterations. Pairs or groups of structurally very similar compounds with dramatic potency differences form activity cliffs5,6 that represent the extreme form of SAR discontinuity. Within large and structurally diverse data sets, compound subsets forming such locally discontinuous SARs are generally associated with high SAR information content and thus a prime target for SAR investigation. By contrast, structurally increasingly diverse compounds might retain comparably high potency, which corresponds to SAR continuity. Compound subsets might form “flat” SARs when many structural changes do not lead to notable increases in potency, which represents an extreme form of SAR continuity. This very SAR phenotype is usually most disliked in medicinal chemistry because it is often not evident during optimization efforts whether an SAR is essentially flat or evolvable. Consequently, by far too many analogs might be generated before the decision is reached to abandon a particular series. In such situations, SAR analysis is challenged with helping to answer another key question in medicinal chemistry, i.e., when should one stop exploring a given series?
In addition to evolving compound collections, sets of hits from HTS efforts are also an attractive target for SAR analysis. In these instances, the focus changes yet again. Hit sets from large HTS campaigns might also consist of hundreds or even thousands of structurally diverse and predominantly weakly potent compounds. Here, the key task is to decide which hits to prioritize for further chemical exploration and hit-to-lead investigations. For this purpose, the presence of preliminary SAR information in sets of structurally related compounds usually is a prime indicator for the potential to further evolve and optimize SARs, even if molecules displaying such SAR trends are not the most potent hits identified in a screen.
Regardless of whether one considers evolving compound data sets or primary HTS hits, in these cases, SAR analysis has large-scale character and typically aims at the systematic extraction of available SAR information, which depends on the application of computational approaches and brings along specific methodological requirements.
Fig. 2 Analysis of SAR information. Representative computational approaches that are applied to access SAR information in different ways, as discussed in the text, are schematically illustrated. The figure has been adapted from ref. 51. |
Clustering methods are routinely applied to HTS hit sets in order to group structurally diverse hits into subsets (clusters) of similar ones and inspect the structural neighborhood of the most potent hits. Cluster analysis is usually carried out in combination with statistical techniques11 to quantitatively characterize potency distributions, determine background activity levels, and help to eliminate false-positive detections. Such hit triaging efforts aim at the selection of hits for secondary assays and, ultimately, hit-to-lead investigations. The results of structural classification and statistical analysis are typically displayed in simple graphical representations such as dendrograms or scatter plots.
Another standard approach to analyze sets of active compounds is dimensionality reduction of chemical reference spaces and mapping onto 2D graph representations. In such simplified space representations, compound distributions can be easily analyzed. The trade-off is the loss of chemical information that is associated with simplifying and mapping of reference spaces. For dimension reduction, statistical methods such as principal component analysis (PCA)12 or multi-dimensional scaling (MDS)13 of originally defined property spaces can be applied or, alternatively, self-organizing maps (SOMs)14 can be generated through neural network simulations. Using SOMs, compound sets are projected onto 2D neuron grids where similar compounds map to the same or adjacent neurons. Especially for the analysis of HTS data, software has been developed15,16 to generate different types of plots for clustering results or chemical space projections and annotate data points (compounds) with activity or other property information. For series of active compounds, it is then attempted to deduce SAR information by comparing data points in various types of diagrams that convey structural and/or activity information in different ways.
Regardless of whether HTS or compound optimization data sets are analyzed or whether compound classification or dimension reduction methods are applied, all standard approaches have in common that prioritized active compounds must be manually selected and compared to derive SAR information. This often makes SAR analysis an arduous task, especially when data sets are large.
Although primarily designed for whole molecules, rather than substructures or scaffolds, Reduced Graphs21 also provide a systematic grouping of compounds according to topological features and chemical functionalities. These representations simplify conventional molecular graphs by further abstracting from chemical features and hence induce a degree of fuzziness. Comparison of simplified graph representations for series of active compounds makes it frequently possible to identify structural patterns that are responsible for SAR trends.
In addition to R-group/core structure assessment, molecular scaffolds are also used as a basis for structural classifications to explore active compounds and identify structural motifs that are responsible for a specific biological activity. For example, scaffolds have been organized according to defined substructure or topological relationships22 and arranged in activity-oriented sequences or pathways.23 Furthermore, scaffolds from bioactive compounds have been iteratively decomposed into substructures following chemical rules.24Decomposition pathways define a structural hierarchy termed Scaffold Tree.24 Substructures are annotated with the activity information of the compounds they originate from. Hierarchical structural decomposition of a compound activity class might yield substructures that are not yet found in bioactive compounds, which provides a basis for activity prediction.25 In addition, libraries of substructures that are often found in pharmaceutically relevant compounds are also used for activity annotations,26 which makes it possible to detect substructures that occur in compounds with defined potency or other molecular property ranges. Furthermore, a scaffold-oriented hierarchical organization of active compounds has been combined with R-group tables to study potency distributions of compounds and corresponding molecular scaffolds.27
Numerical functions capturing pair-wise compound similarity and potency comparisons have also been developed to quantify SAR features including the SAR Index (SARI)30 and the Structure–Activity Landscape Index (SALI).34 SARI combines two score components accounting for SAR discontinuity and continuity, respectively, yielding a normalized composite score between 0 (maximal discontinuity) and 1 (maximal continuity). Intermediate SARI scores characterize heterogeneous SARs that combine continuous and discontinuous SAR features.30 SARI scores can be calculated as an SAR diagnostic of an entire compound set (i.e., describing a global SAR phenotype) or as local scores for a subset of compounds (e.g., a cluster), quantifying a local SAR. For single compounds, a local discontinuity score can be calculated (using a slightly modified version of the SARI discontinuity score component) to assess individual compound contributions to SAR discontinuity. SALI scoring only emphasizes pairs of structurally very similar compounds with large potency differences and is thus designed to detect activity cliffs in a data set. Therefore, pairs of compounds can be ranked according to their SALI scores in order to identify prominent activity cliffs and represent the entire spectrum of cliffs of varying magnitude that are present in data sets.
One network prototype is provided by the activity cliff-centric SALI graphs.34 A schematic representation is shown in Fig. 3. Here, nodes represent compounds and directed edges are drawn between them (from the less to the more potent compound) if a pre-defined SALI threshold is reached. Thus, SALI graphs can be generated as a ‘sliding window’ through a data set to monitor activity cliffs of increasing magnitude. At high threshold levels, the representation focuses on the most prominent cliffs. The SALI graph reveals individual activity cliffs, pair-wise cliff sequences, and more complex cliff patterns formed by subsets of compounds.
Fig. 3 SALI graph. A schematic representation is shown. Nodes are labeled with compound numbers. Edges indicate activity cliffs of pre-defined magnitude. |
A conceptually different network prototype is displayed in Fig. 4. In this case, nodes also represent compounds but (undirected) edges represent pair-wise similarity relationships (calculated Tanimoto similarity), rather than activity cliffs. Furthermore, different from SAS maps, potency information is in this case not captured through pair-wise compound comparisons, but by annotation of nodes applying a color spectrum that represents the potency range in a data set. This type of annotated similarity-based compound network provides the basis of the SAR networks described in the following. In these networks, nodes are usually directly associated with compound structures, thus enabling interactive analysis.
Fig. 4 Similarity-based compound network. A prototypic annotated compound network is shown that provides the basis for different types of SAR network representations. |
In Fig. 5, a Network-like Similarity Graph (NSG)35 is shown that represents an SAR network designed to reveal both global and local SAR information. Here, nodes are color-coded according to compound potency, as described above, and scaled in size according to per-compound local discontinuity scores. Thus, a large node makes a significant contribution to local SAR discontinuity because the corresponding compound has a potency value very different from its immediate structural neighbors. It follows that combinations of large red and green nodes connected by an edge indicate the most prominent activity cliffs in a compound set. In addition, cluster information for a data set is provided in NSGs (complementing pair-wise similarity relationships) and cluster discontinuity scores are reported that characterize local SARs. Thus, the NSG represents a similarity-based compound network that is annotated with multiple layers of SAR-relevant information. For clarity, the graph layout algorithm separates densely connected regions/clusters from each other (such that inter-cluster distances have no chemical meaning). The NSG initially provides a global SAR view of a compound data set and all of its SAR microenvironments and then helps to quickly focus on the most interesting local SAR regions. For example, in Fig. 5, two densely connected clusters are apparent that have opposite SAR character. At the bottom, a cluster of compounds representing a highly discontinuous local SAR is shown (discontinuity score 0.996) that is rich in activity cliffs. By contrast, above this region, another densely connected cluster is seen (score 0.011) that exclusively consists of small yellow or brown nodes and hence represents a region of extremely high SAR continuity, corresponding to a “flat” local SAR. Thus, subsets of compound representing different local SARs including activity cliffs can be readily identified in and selected from the graph. The NSG concept has been extended in different ways. For example, by utilizing potency ratios for two related targets instead of potency values for one, the activity landscape represented by an NSG is transformed into a selectivity landscape.36 For compound data sets with activity measurements against two targets, this makes it possible to graphically study selectivity patterns and molecular selectivity determinants and identify ‘selectivity cliffs’ that are formed by pairs of structurally similar compounds with very different potency values for the two targets. Going beyond two targets, the NSG design has also been adopted to outline multi-target activity landscapes.37 Furthermore, it has been utilized to study sets of compounds with alternative mechanisms of action.38
Fig. 5 Network-like Similarity Graph. An exemplary NSG representation is shown for a set of 71 squalene synthase inhibitors taken from the MDDR database.52 Nodes are color-coded and scaled in size (as explained in the text). In addition, compound clusters (resulting from hierarchical cluster analysis of the compound set) are shown on a grey background and labels with SARI cluster discontinuity score information. Discontinuity scores range from 0 to 1. The larger the score of a cluster, the more discontinuous is the local SAR formed by the corresponding subset of compounds; conversely, the smaller the score, the more continuous is the local SAR. The figure has been adapted from ref. 35. |
In Fig. 6, regions corresponding to the highly discontinuous and continuous compound clusters in Fig. 5 are compared in alternative activity landscape models of the same data set. On the left, a 2D projection of a fingerprint-based chemical reference space is shown generated using multi-dimensional scaling. Through application of an interpolation function, a contiguous potency surface has been computed from individual compound potency values and added as a third dimension to the 2D projection, yielding the 3D representation39 on the right that is reminiscent of a geographical landscape. As can be seen, strongly discontinuous and continuous local SARs in NSGs correspond to activity cliff and smooth regions, respectively, in the 3D representation, thus providing consistent SAR views. As long as one consistently utilizes the same molecular representation, such comparisons between alternative models can be carried out in a meaningful manner. If one changes molecular representations, landscape characteristics generally change as well.39
Fig. 6 2D and 3D activity landscape representations. Highlighted are corresponding regions in 2D and 3D landscape views of the set of squalene synthase inhibitors whose NSG is shown in Fig. 5. On the left a 2D projection of a fingerprint space is shown from which the 3D representation in the center is derived. In addition, two compound clusters from the NSG are shown including one representing a highly discontinuous (top) and another representing a highly continuous (bottom) local SAR. Corresponding regions in these alternative representations are connected with dashed lines. Color-coding according to compound potency is uniformly applied to nodes and surface regions. |
All SAR networks or activity landscape designs discussed so far initially provide global views of a data set. A conceptually different way to analyze SAR environments is shown in Fig. 7. With the introduction of Similarity-Potency Trees (SPTs),40 which are also based on the design principles illustrated in Fig. 4, the focus is changing to provide to a strictly local view of SARs. In SPTs, the structural neighborhood of a chosen reference compound and its SAR information content is analyzed. The radius of the neighborhood is determined by a chosen similarity threshold value relative to the reference compound. This molecule is then used as a root node of a tree structure where edges connect nearest neighbors (i.e., an edge is drawn between a compound and the one it is most similar to within the data set) and the structural similarity of compounds to the root is decreasing along the tree. In SPTs, horizontal and vertical node patterns reflecting potency progression emerge if a compound subset contains interpretable SAR information.40 For interesting compound subsets, overlapping SAR environments can be inspected in series of SPTs centered on neighboring compounds. SPTs can also be generated for all compounds in a large data set and ranked by SAR information content using scoring functions that emphasize the presence of regular SAR patterns.40 Hence, SPT represents a versatile data structure for local SAR analysis. It is also attractive to couple NSG and SPT analysis.41 Following the NSG-SPT approach, an NSG is initially calculated for an entire data set to obtain a global SAR view and focus on discontinuous local regions that are high in SAR information content. Compound subsets forming these regions are then selected from the NSG and subjected to systematic SPT analysis by generating overlapping local SAR environments (i.e., each compound in a given subset is used once as a root). This provides high-resolution SAR views of selected local environments. NSG-SPT analysis has been shown to successfully extract available SAR information from a large phenotypic screening data set containing more than 13000 anti-malarial hits.41
Fig. 7 Similarity-Potency Tree. The design principles of the SPT data structure providing a compound-centric local SAR view are illustrated. |
SAR networks are generally based on calculated Tanimoto similarity values that depend on chosen molecular representations. As long as a representation such as a molecular fingerprint is consistently used, calculated similarity values can be compared on a relative scale. However, structural relationships that are based on calculated similarity values are not always straightforward to interpret. It is sometimes difficult to reconcile why some molecules that yield higher similarity values in pair-wise comparisons than others should be structurally more similar to each other. The difficulty to consistently interpret structural relationships based on calculated whole-molecule similarity has frequently been pointed out in medicinal chemistry as a general complication of SAR network representations. Therefore, a data structure has recently been introduced that replaces calculated similarity values with well-defined substructure relationships between active compounds. This SAR network design has been termed Bipartite Matching Molecular Series Graph (BMMSG).42 Substructure relationships between all compounds in a data sets are systematically determined applying the Matched Molecular Pair (MMP) formalism.43,44 An MMP is defined as a pair of compounds that differ only at a single site, i.e., that are only distinguished by a single substructure.43 Thus, pairs of compounds forming an MMP are related to each other by a defined chemical transformation exchanging this substructure (e.g., an R-group or a ring system). All compounds in a data set that only differ by a single substructure at a specific site are then considered a Matching Molecular Series (MMS). Hence, in an MMS, different transformations involving the same site in otherwise identical molecules occur. In the BMMSG structure, nodes are connected by an edge if the compounds form an MMP (and thus display a well-defined substructure relationship). An exemplary BMMSG is shown in Fig. 8. The graph structure is bipartite because it contains two types of nodes: ‘set nodes’ that represent the structural fragment common to all compounds in an MMS and ‘molecule nodes’ representing individual compounds. A BMMSG is generally formed by a number of disjoint subgraphs that contain compounds with specific substructure relationships shared within each subgraph. From the graph structure, characteristic patterns emerge that convey SAR information, e.g., an ‘SAR hotspot’ consisting of an MMS with well-defined potency progression or a ‘parallel series’, as illustrated in Fig. 8. A parallel series consists of two molecular frameworks that are correspondingly substituted (i.e., where pairs of analogs with different frameworks but identical R-group patterns exist) and produces a characteristic spindle-type subgraph structure in a BMMSG, as depicted in Fig. 8. If there is corresponding potency progression in these parallel series, an SAR transfer event is identified.
Fig. 8 Bipartite Matching Molecular Series Graph. The complete BMMSG representation of a set of 881 factor Xa inhibitors taken from BindingDB50 is displayed. Set nodes are colored in white, light blue, or dark blue, dependent on the type of transformation from which they result, and molecule nodes are colored by potency. Two signature subgraph patterns are labeled and shown in enlarged form. The alternative way to represent an SAR hotspot via a series of squared molecule nodes in a set node is termed a ‘supernode’. The representation has been adapted from ref. 42. |
In Fig. 9, the comparison of a subgraph in a BMMSG with an SPT that was calculated for the same set of compounds is shown. The correspondence between compound subsets in these alternative SAR network structures is clearly visible. The SPT displays horizontal patterns with clear potency progression that are found in corresponding ‘supernodes’ in the BMMSG subgraph. Hence, both data structure reveal interpretable SAR information, but a strength of the BMMSG representation is that structural changes that relate compounds carrying SAR information to each other can be readily followed in the graph and interpreted. For interactive analysis, nodes are directly associated with compound structures and edges with exchanged fragments.
Fig. 9 Corresponding graph patterns. A subgraph of the BMMSG in Fig. 8 is highlighted on the right and the corresponding SPT is shown on the left. |
This journal is © The Royal Society of Chemistry 2012 |