Open Access Article
Naravut Suvannang
a,
Likit Preeyanon
b,
Aijaz Ahmad Malik
a,
Nalini Schaduangrat
a,
Watshara Shoombuatong
a,
Apilak Worachartcheewan
b,
Tanawut Tantimongcolwat
c and
Chanin Nantasenamat
*a
aCenter of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand. E-mail: chanin.nan@mahidol.edu; Fax: +66 2 441 4371 ext. 2715; Tel: +66 2 441 4380
bDepartment of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
cCenter for Research and Innovation, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
First published on 27th March 2018
Estrogen is an important component for the sustenance of normal physiological functions of the mammary glands, particularly for growth and differentiation. Approximately, two-thirds of breast cancers are positive for estrogen receptor (ERs), which is a predisposing factor for the growth of breast cancer cells. As such, ERα represents a lucrative therapeutic target for breast cancer that has attracted wide interest in the search for inhibitory agents. However, the conventional laboratory processes are cost- and time-consuming. Thus, it is highly desirable to develop alternative methods such as quantitative structure–activity relationship (QSAR) models for predicting ER-mediated endocrine agitation as to simplify their prioritization for future screening. In this study, we compiled and curated a large, non-redundant data set of 1231 compounds with ERα inhibitory activity (pIC50). Using comprehensive validation tests, it was clearly observed that the model utilizing the substructure count as descriptors, performed well considering two objectives: using less descriptors for model development and achieving high predictive performance (RTr2 = 0.94, QCV2 = 0.73, and QExt2 = 0.73). It is anticipated that our proposed QSAR model may become a useful high-throughput tool for identifying novel inhibitors against ERα.
ER belongs to the steroid nuclear receptor superfamily and consists of two major subtypes namely, ERα and ERβ. The former is comprised of 595 residues and found on chromosome 6q while the latter is comprised of 530 residues and found on chromosome 14q. ERs have two major functional domains, the DNA-binding domain (DBD), which is responsible for DNA binding and dimerization, and the ligand-binding domain (LBD) that plays an important role in binding to different ligands and interacting with co-regulatory proteins. In addition, the N-terminus of ERs are highly viable and contain a transactivation domain, which interacts directly with other transcription factors. Furthermore, the C-terminus of the ERs are thought to affect the transactivation capacity of the receptors Fig. 1.6 Most ligands can bind to both types of ERs but differ in their binding affinities7 due to the high similarity of the ERα and ERβ in their DBD and a 55% similarly in their LBD.8 In response to estrogen, ERα and ERβ function as ligand-activated transcription factors that bind the estrogen response elements (EREs) and interact with co-activator or co-repressor proteins to regulate gene transcription.9–12 Aside from causing cancer, abnormal ER signaling may also give rise to cardiovascular, metabolic, inflammatory and neurodegenerative diseases as well as osteoporosis.13
Apart from the genomic effects, ERs are also known to exert extra-nuclear actions by which they regulate important cellular processes such as, leading cell proliferation, cell differentiation, and cell signaling to contributing to a biological outcome of tumor angiogenesis.14,15 Both ERα and ERβ are crucial for regulating mammary growth and development.16,17 Under normal physiological conditions, ERα mediates the proliferative actions of E2 which can be opposed by ERβ and together these receptors maintain a subtle balance of estrogen signalling in the cells.18 ERα is normally expressed in only 10–20% of human mammary epithelial cells while 80–85% of cells express ERβ.19,20 In contrast, the expression of ERα is increased while that of ERβ is decreased in breast cancer cells.19,21 Therefore, the expression of ERα is used as a measure of steroid hormone receptor status and is currently an acceptable prognostic marker for predicting the response to hormonal therapy.22 Unfortunately, the role of ERβ in breast cancer is still not well understood.23,24 However, it has been postulated that increased protein levels of ERβ are linked to a better prognosis, increased survival and a better response to anti-estrogen therapy.24
As previously mentioned, the expression of ERα is greatly increased in breast cancer cells and as such represents a promising therapeutic target for combating breast cancer. Anti-estrogens are agents that can hinder the production or utilization of estrogen and are categorized into two general classes: (i) selective estrogen receptor modulators (SERMs) and (ii) the so-called “pure” antagonists. The first class or SERMs are drugs that competitively binds ERα and ERβ and function by direct agonistic or antagonistic interactions. The outcome of such ER-binding is tissue-dependent meaning that some SERMs may exert agonistic response in one tissue and antagonistic response in another tissue. Tamoxifen represents a drug in the class of SERMs that serves as the first line of treatment against breast cancer. It is currently being administered to patients in an effort to regress tumor growth of ER positive (ER+) breast cancers. The second class of anti-estrogens or “pure” antagonists (i.e. ICI 182780 also known as Fulvestrant/Faslodex) works by preventing the binding of helix-12 to the surface of the ligand-binding domain, which in turn prevents the transcriptional activation of ERα. In spite of current endocrine therapies against estrogen, which represents a significant advance in breast cancer therapy in which many women develop resistance to current drugs. The selection and outgrowth of breast cancers resistant to endocrine therapy is common and most deaths arising from breast cancer are found in patients with ERα+ tumors.25 Moreover, in ERα+ breast cancers, one-third of women treated with tamoxifen for a period of 5 years will develop a recurrent disease within 15 years. Thus, the development of tamoxifen and aromatase inhibitor resistance remains a key problem in breast cancer treatment.25
Computational approaches have often been used to complement experimental studies for several reasons, which among others include: (i) handle and manage large volumes of biological and chemical information, (ii) model biomolecular phenomenons that may be impossible by experimental means and (iii) making sense of data by uncovering hidden patterns and trends. In the context of drug discovery efforts, in silico approaches can be used to not only help identify and prioritize classes of compounds to screen but it can also help reduce the number of compounds to be tested. Quantitative structure–activity relationship (QSAR) is ligand-based approach in computational drug design for correlating the molecular features of a chemical library with their respective bioactivity.26–28 QSAR has been instrumental in shedding light on the molecular basis of bioactivity of interest by learning from past bioactivity data while also being amenable to extrapolating on the bioactivity of new compounds that are foreign to the trained data set. The utilization of QSAR for the investigation of ERs had started in 1986 where Singh29 examined the binding affinity of 2-phenylindoles towards estrogen receptor using Kiers first-order valence molecular connectivity index. Thusfar, there exist 56 research articles reporting the QSAR modeling of ER inhibitors according to a search on Scopus for articles containing (QSAR or QSPR or “quantitative structure–activity relationship”) and (“estrogen receptor” or ERα or ERβ) as search query. A brief analysis of the existing QSAR models of ER revealed that nearly all are based on small data sets that are typically less than 50 compounds (i.e. belonging to the same congeneric class) while focusing on the selectivity of inhibitors against the two ER isoforms via classical QSAR30–32 and 3D-QSAR.33,34 However, there were only a few studies reporting the use of large data set for the QSAR modeling of ER inhibitors. Among this are the work of Gao et al.35 whose data set consisted of 463 compounds, the work of Mekenyan et al.36 reporting a data set size of 151 compounds and the work by Fang et al.37 on a set of 230 compounds.
This study explores the origin of ERα inhibitory activity by discerning their underlying structure–activity relationship via QSAR modeling. To achieve this, interpretable and simple to compute descriptors in concomitant with interpretable learning method were employed. The effectiveness and usefulness of twelve classes of fingerprint descriptors for model construction was determined. Molecular features important for the investigated ERα inhibitory activity were revealed via the Gini index and their contribution were to discerned in light of previous evidences from the literature. A schematic illustration of the QSAR modeling workflow for predicting the inhibitory activity of ERα is provided in Fig. 2.
936 bioactivity data points targeting human ERα (CHEMBL206) was obtained from the ChEMBL database,38 version number 23. A subset of the data reporting IC50 as the bioactivity value was selected for further investigation and this consisted of a total of 3641 compounds. Next, entries with < or > signs were subjected to removal as external data set while entries having CONFIDENCE_SCORE equal to 9 (i.e. a direct single protein target is assigned) and those with ASSAY_TYPE equal to B (i.e. binding measurements of compounds to molecular targets as provided by Ki, IC50 and KD values) were selected for further use. Moreover, redundant compounds with (i) identical SMILES notation (ii) IC50 value greater than 2 SD (i.e. if less than 2 SD then the median value is used) and (iii) missing IC50 values were eliminated thereby further reducing the data set to 1780 compounds. After that, the SMILES notation for all entries from the data set was subjected to salt removal followed by its conversion to 2D structures using the Chem function of RDKit.39 Finally, after desalting the chemical structure, a set of 477 compounds having identical SMILES notation were removed. This resulted in the final data set consisting of 1231 compounds. It is also worthy to note that IC50 values were converted to pIC50 units (−log
IC50) so as to afford a more uniform distribution of the data.
| Fingerprint class | Descriptors | Description | Reference |
|---|---|---|---|
| AtomPairs 2D count | 780 | Count of atom pairs at various topological distances | 42 |
| AtomPairs 2D | 780 | Presence of atom pairs at various topological distances | 42 |
| CDK fingerprinter | 1024 | Fingerprint with length of 1024 and search depth of 8 | 43 |
| CDK extended | 1024 | Extends the fingerprinter with additional bits describing ring feature | 43 |
| CDK graph only | 1024 | Special version of fingerprinter not taking bond orders into account | 43 |
| E-State | 79 | E-State fragments | 44 |
| Klekota–Roth count | 4860 | Count of chemical substructures | 45 |
| Klekota–Roth | 4860 | Presence of chemical substructures | 45 |
| MACCS | 166 | Key-based fingerprint which uses 166 predefined keys | 46 |
| PubChem | 881 | PubChem fingerprints | |
| Substructure count | 307 | Count of SMARTS patterns for functional group classification | 47 |
| Substructure | 307 | Presence of SMARTS patterns for functional group classification | 47 |
This study employs 10-fold cross-validation (10-fold CV) to evaluate the model's performance in which a data set is partitioned into 10 data subsets after which 9 subsets are used to train a model and subsequently evaluated on the held out subset (i.e. used as the test set). This procedure was repeated iteratively until all data subset had a chance to be held out as the test set while the remaining subsets were used as the training set for model building.
After construction of the RF model, a reduced subset of top 20 features were selected for the construction of the second RF model so as to avoid over-fitting and to satisfy the philosophical Occam's razor principle in which a simple explanation is favorable to a more complicated one where analogously a model with fewer descriptors that still afford robust level of performance is preferable to a model with significantly higher descriptors.
Furthermore, external validation (QExt2) was performed on the held out 30% external set. The reliability of QSAR models was provided by the difference of R2 and Q2 as originally proposed by Eriksson et al.52 Further rigorous test for the possibility of chance correlation was performed via Y-scrambling experiments in which the X–Y pairs are shuffled such that the resulting X–Y pairs are false pairs. If the resulting shuffled models afforded similar level of prediction performance with that of the original X–Y pair then it could be concluded that the model's performance is unreliable and arose by chance correlation. However, if the Y-shuffled models provided poor performance in comparison to the high performance of the original X–Y pair then it is indicative of the model's robustness. A total of 100 Y-scrambled models were computed.
P; <5), the number of hydrogen bond acceptors (<10) and the number of hydrogen bond donors (>5). As useful as the Ro5 are, they have been shown to afford limited value in contributing to our understanding on the underlying principles of the target–ligand relationship (i.e. the affinity of the ligand toward the target) as they were strictly based on general molecular properties of the ligand. Oprea et al.56 showed that the Ro5 criteria do not serve to discriminate drugs from non-drugs in which more than 90% of the compilation of chemical reagents known as the Available Chemicals Directory were also Ro5 compliant. However, this does not negate the notion that the criteria exemplified by the Ro5 cannot be used to narrow properties that are useful for therapeutically relevant pharmacokinetic space. Moreover, Benet et al.57 has shown that QSAR model built using the Ro5 criteria could successfully predict drug disposition characteristics for drugs both meeting and not meeting the Ro5 criteria.
Fig. 3 revealed that of the 1231 compounds present in the curated data set, roughly two-third of compounds had zero violation while the other one-third of compounds are distributed between one and two violations. In this latter set, approximately three-quarter fell in the one Ro5 violation spectrum with the remaining one-quarter falling within the two Ro5 violation zone. It is interesting to note that as the number of Ro5 violations increased, the bioactivity also increased (Fig. 4).
![]() | ||
| Fig. 4 Histogram plots of the distribution of pIC50 values for compounds in violation of zero (a), one (b), two (c) and three (d) Ro5 descriptors. | ||
A closer look revealed that a minority proportion of compounds in violation of the Ro5 was due to the fact that it had molecular weight greater than 500 Da. On the other hand, a larger proportion of compounds in violation of the Ro5 was because they had log
P value greater than 5. In spite of this, it should be noted that Lipinski et al.58 pointed out that compounds in violation of the Ro5 should not necessarily be removed from further consideration. In fact, efforts have been directed to soften the Ro5
59 as it is well known that there are several instances where therapeutically useful drugs are in violation of several Ro5 parameters such as Atorvastatin, Lipitor, Losartan, Montelukast, Olmesartan, Telaprevir, Telmisartan, etc. It is worthy to note that the Ro5 should be used sparingly as general guidelines and not as strict rules so as to set loose criteria that would allow the discovery of potent drug candidates that may at first glance be removed if the Ro5 criteria was strictly followed.
The applicability domain of the QSAR model proposed herein was assessed via the PCA bounding box approach in which the chemical space spanned by the training set (i.e. the 70% subset) is compared to that of the external set (i.e. the 30% subset) as shown in Fig. 5. It was found that the chemical space spanned by the external set falls within the boundaries of the chemical space of the training set and thus is also deemed to be within the applicability domain of the constructed QSAR model. Moreover, the relative chemical space spanned by compounds from internal and external sets as visualized in Fig. 6 can be seen to share a high degree of similarity as also seen from the PCA scores plot.
![]() | ||
| Fig. 5 Applicability domain analysis as deduced from the PCA scores plot of compounds from internal (blue) and external (red) sets constituting 70% and 30% of the data set, respectively. | ||
![]() | ||
| Fig. 6 Plot of the molecular weight versus lipophilicity for the internal (blue) and external (red) sets that constituting 70% and 30% of the data set, respectively. | ||
Models were constructed via the RF algorithm and their results are presented in Table 2. Assessment of the predictive performance of the model was performed according to the suggested statistical thresholds of Golbraikh and Tropsha60,61 in which acceptable models should have R2 > 0.6 and Q2 > 0.5.60 Results indicated that the two best models as judged from both internal and external validation, which consisted of AtomPairs 2D count (RTr2 = 0.93, QCV2 = 0.73 and RMSETr) and substructure count (RTr2 = 0.94 and QCV2 = 0.73). Particularly, the substructure count was selected for further investigation owing to its interpretability and fewer number of descriptor (i.e. 307 descriptors as compared to 780 to 4860 descriptors from the other fingerprints), which also require less computation time.
| Fingerprint class | Training set | External set | RTr2 − QExt2 | ||
|---|---|---|---|---|---|
| RTr2 | RMSETr | QExt2 | RMSEExt | ||
| AtomPairs 2D count | 0.93 | 0.38 | 0.73 | 0.53 | 0.20 |
| AtomPairs 2D | 0.85 | 0.54 | 0.68 | 0.62 | 0.17 |
| CDK fingerprinter | 0.87 | 0.51 | 0.71 | 0.56 | 0.16 |
| CDK extended | 0.84 | 0.55 | 0.67 | 0.65 | 0.18 |
| CDK graph only | 0.81 | 0.60 | 0.70 | 0.58 | 0.11 |
| E-state | 0.80 | 0.63 | 0.64 | 0.71 | 0.16 |
| Klekota–Roth count | 0.91 | 0.41 | 0.72 | 0.54 | 0.19 |
| Klekota–Roth | 0.82 | 0.60 | 0.70 | 0.59 | 0.12 |
| MACCS | 0.86 | 0.52 | 0.71 | 0.58 | 0.15 |
| PubChem | 0.84 | 0.57 | 0.71 | 0.56 | 0.12 |
| Substructure count | 0.94 | 0.34 | 0.73 | 0.52 | 0.21 |
| Substructure | 0.87 | 0.51 | 0.68 | 0.63 | 0.19 |
The possibility for chance correlation can be assessed from the R2–Q2 margin as described by Eriksson52 where values <0.2–0.3 are indicative of predictive and reliable models while values >0.2–0.3 suggests possible chance correlation or the presence of outliers in the data set. Furthermore, from observation of the RTr2 − QExt2 margin, it is revealed that differences were negligible with values not greater than 0.2. Fig. 7a and b shows the scatter plots of experimental versus predicted pIC50 values. As for the threshold value for RMSE, which is rather difficult to establish, but generally models with higher RMSE values can be considered to afford sub-optimal prediction. Such high RMSE value may be due to the presence of a small number of outlying compounds that give rise to high error predictions.62 Furthermore, the inherent variability of experimental assays in concomitant with the diversity of chemotypes present in the data set are also expected to directly give rise to prediction error.63,64
Moreover, the proposed models were further subjected to stringent test to evaluate the possibility of chance correlation by carrying out Y-scrambling experiments. Briefly, this encompassed the shuffling of the X block of descriptors with that of its corresponding Y label such that the shuffled data set have false X–Y pairs whereas the original data set had true X–Y pairs. Fig. 7b and c shows the results from the Y-scrambling experiment and it can be seen that original models (i.e. denoted by blue circles) for all fingerprint class are found to be located at the upper right quadrant thereby suggesting robust models in accordance with the threshold of Golbraikh and Tropsha.60 On the other hand, Y-scrambled models (i.e. represented by red circles) were found to be lying within the boundaries of the lower left quadrant, which is indicative of their poor performance.
| Fingerprint class | Description |
|---|---|
| SubFPC169 | Phenol |
| SubFPC307 | Chiral center specified |
| SubFPC274 | Aromatic |
| SubFPC295 | C ONS bond |
| SubFPC23 | Amine |
| SubFPC300 | 1,3-Tautomerizable |
| SubFPC301 | 1,5-Tautomerizable |
| SubFPC26 | Tertiary aliphatic amine |
| SubFPC88 | Carboxylic acid derivative |
| SubFPC302 | Rotatable bond |
As can be seen in Fig. 8, the top ranking feature is phenol (SubFPC169), which contains a 1,2-benzenediol moiety and belongs to the class of organic compounds known as catechols. Catechols are secondary metabolites found in many plants that have been shown to confer numerous bioactivities.
The second top-ranked feature is the aromatic (SubFPC274) descriptor, which is a ubiquitous substructure that plays an important structural role as scaffolds of compounds as well as functional moieties that mediates π–π stacking interaction. Differences in the number and type of atoms in the aromatic rings of molecules can present various development concerns such as aqueous solubility, lipophilicity, serum albumin binding, cytochrome P450 inhibition and hERG inhibition.65
The third top-ranked feature is amine (SubFPC23) which present in amino acids are used to form bonds that are essential for their electron donation property.
The bioavailability of a drug-like molecule is related to its rotatable bond (SubFPC302), the fourth important feature, number where drug-like compounds have 10 or fewer rotatable bonds. Although, this is not absolute as some effective inhibitors carry more than 12 rotatable bonds.66 In recent years, many highly potent molecules carrying more than 10 rotatable bonds are still administered through the oral route with some modifications to their dosage forms.
The fifth and sixth important features are amine (SubFPC88) and secondary carbon (SubFPC2), respectively. Carboxylic acid is a common functional group found in the pharmacophore of diverse classes of therapeutic agents.67 Currently, a large number (>450) of carboxylic acid-containing drugs have been marketed worldwide. The secondary carbon, which is attached to two other carbons, is also a common component in the structure of some anti-cancer agents.68
The seventh and eighth important features are hetero N nonbasic (SubFPC181) and conjugated double bond (SubFPC287), respectively. Hetero N nonbasic can be defined as an aromatic nitrogen atom having two further total connections or an aromatic nitrogen atom affording a charge of +1 with three further total connections. Therefore both features are essential for anticancer activity in compound structure.69,70 In a conjugated double bond, the double bonds are separated by two or more methylene groups and can react with nucleophiles in a similar fashion as the aromatic ring (i.e. withdrawing electrons from electronegative atoms).
The ninth and tenth important substructures are vinylogous ester (SubFPC137) and alkyl aryl ether (SubFPC18), respectively. These two functional groups have been found in several breast cancer drugs.71,72 Alkyl aryl ether is also a key substructure of Tamoxifen, which is a selective estrogen receptor modulator as well being one of the oldest and most-prescribed FDA-approved drug for hormonal therapy.73
![]() | ||
| Fig. 9 Binding pocket of the ligand-binding domain of ERα in complex with CHEMBL304552 (PDB id: 1SJ0). Helices and sheets are depicted in gray color, the ligand is represented in orange colored sticks while its interacting residues are colored blue. Important interactions are indicated by colored dashed lines as follows: brown, hydrophobic interactions; yellow, salt bridge; green, π-stacking interaction (parallel). | ||
Footnote |
| † Electronic supplementary information (ESI) available: Figures of Ro5 descriptors and Table of full prediction performance. See DOI: 10.1039/c7ra10979b |
| This journal is © The Royal Society of Chemistry 2018 |