Discovery of new Cdc2-like kinase 4 (CLK4) inhibitors via pharmacophore exploration combined with flexible docking-based ligand/receptor contact fingerprints and machine learning

Cdc2-like kinase 4 (CLK4) inhibitors are of potential therapeutic value in many diseases particularly cancer. In this study, we combined extensive ligand-based pharmacophore exploration, ligand–receptor contact fingerprints generated by flexible docking, physicochemical descriptors and machine learning-quantitative structure–activity relationship (ML-QSAR) analysis to investigate the pharmacophoric/binding requirements for potent CLK4 antagonists. Several ML methods were attempted to tie these properties with anti-CLK4 bioactivities including multiple linear regression (MLR), random forests (RF), extreme gradient boosting (XGBoost), probabilistic neural network (PNN), and support vector regression (SVR). A genetic function algorithm (GFA) was combined with each method for feature selection. Eventually, GFA-SVR was found to produce the best self-consistent and predictive model. The model selected three pharmacophores, three ligand–receptor contacts and two physicochemical descriptors. The GFA-SVR model and associated pharmacophore models were used to screen the National Cancer Institute (NCI) structural database for novel CLK4 antagonists. Three potent hits were identified with the best one showing an anti-CLK4 IC50 value of 57 nM.


Introduction
Alternative splicing is a deviation from the normal preferred sequence where certain exons are skipped resulting in multiple forms of mature mRNA. This improper splicing contributes to the pathogenesis of many human diseases. 1 Cyclin dependentlike family kinases (Cdc2-like protein kinases) are dual specic kinases that have been shown to regulate mRNA splicing by phosphorylation of serine and arginine rich proteins. 2 The Cdc2-like kinases family (CLK) consists of four isoforms CLK1-4 exhibiting the typical protein kinase folds and possessing different lengths of amino acids and they are involved in alternative splicing and RNA processing in Duchenne muscular dystrophy, Alzheimer's disease, HIV-1, inuenza virus and cancer. 3 CLK4 inhibition has recently raised interest as a potential treatment for different CLK4-over expressing cancer types, i.e., renal cancer, breast cancer, melanoma and other cancers. Accordingly, several potent CLK4 inhibitors are currently investigated as potential clinical candidates ( Fig. 1), 4 these include (i) TG003 which is one of the rst CLK4 inhibitors published in 2004 with a K d value of 30 nM. (ii) ML106 which is a quanzoline-based inhibitor of CLK4 and was published in 2009 with a K d value of 50 nM. (iii) CX-4954 which was reported, marketed and used as highly selective and potent inhibitor of casein kinase 2 (CK2). However, CX-4954 also inhibits CLK4 with an IC 50  which is an isoquinoline based compound, but the exact structure has not been disclosed yet, it is the rst CLK inhibitor that entered clinical trials and has an activity against solid tumors with an IC 50 of 1 nM against CLK4.
Despite the signicance of CLK-4 in neurodegenerative disease and cancer-related drug discovery research, it appears that only limited number of earlier computer-aided molecular design and discovery efforts dealing with CLK4 inhibitors were reported. The most prominent was published in 2013 and it involved three-dimensional quantitative structure-activity relationship (3D-QSAR) and pharmacophore modeling to identify ligand structural features required for CLK4 binding. 5 Nevertheless, the authors failed to validate their models experimentally by in vitro bioassays of captured hits. Additionally, their structure-based 3D QSARs were based on homology model of CLK4 rather than the actual crystallographic structural. In another study, bisphenol A was found to act as a ligand for CLK4 among other targets through molecular docking studies. 6 Moreover, docking studies were performed to optimize several potent CLK4 inhibitors. [7][8][9] A pharmacophore is dened as abstract three-dimensional description of binding interactions envisaged to anchor certain ligand within corresponding binding pocket. [10][11][12][13][14][15][16][17][18][19][20]42,43 The concept of Ligand-Receptor Contacts Fingerprints (LRCFs) is well established. 25 It proceeds by either mapping out all binding site atoms that contact a list of docked potent ligands and evade inactive ones, [24][25][26]57 or identify binding site atoms that frequently contact certain bound ligand during molecular dynamics or related simulations. [47][48][49] Machine learning (ML) is the implementation of statistical approaches for learning and predicting properties. 27,28 Supervised ML attempts to build predictive model(s) based on data collected from input and output sources. 29 Numerous ML methods have been developed and implemented in the eld of drug design and discovery. [39][40][41] In the current project, ligand-based pharmacophore modeling was combined with molecular docking to identify novel CLK4 inhibitors. Ligand-based modeling [10][11][12][13][14][15][16][17][18][19][20]42 [24][25][26][47][48][49]57 Subsequently, LRCFs, ligand-based pharmacophores and numerous other calculatable physicochemical properties were allowed to compete within the context of genetic algorithm coupled to variety of machine learning [27][28][29][30] methods to search for optimal QSAR model(s) that can explain bioactivity variations within training compounds. Thereaer, QSAR-selected pharmacophores were validated by receiveroperating characteristic (ROC) curve analysis and were used as 3D search queries to screen the National Cancer Institute's (NCI) list of compounds for new CLK-4 inhibitors. Fig. 2 summarizes the overall workow implemented in this study.  procedure in IC 50 format were selected. Hence, 91 CLK4 inhibitors from published literature were included in the present modelling (1-91 Table S1 under ESI †). [7][8][9] The conformations of collected compounds were explored using the "CAESAR conformation generation" option in Discovery Studio (version 4.5, BIOVIA, USA) to represent their conformational exibilities. CAESAR (Conformer Algorithm based on Energy Screening and Recursive Buildup) has been found to be significantly faster than alternative methods for all data sets investigated. 58 Conformational ensembles were generated for each molecule with a maximum energy threshold of 20 kcal mol À1 from the local minimized structure and a maximum limit of 250 conformers per molecule based on the generalized CHARMm force eld implemented within Discovery Studio 4.5.
Successful models were clustered into 60 groups (i.e., at ratio of 10 to 1) using the hierarchical average linkage method implemented in CATALYST-HYPOGEN. Subsequently, the highest-ranking representatives, based on their F-values (calculated from the correlation of t values against the bioactivities for the total list of collected CLK4 inhibitors), were selected as representatives in subsequent QSAR modeling. [10][11][12][13][14][15][16][17][18][19][20]42,43 This was done to minimize collinearity among pharmacophoric descriptors so as to improve noise-tosignal ratio in subsequent ML modeling. 31,32 However, in case if a particular cluster included more than 10 pharmacophores, then it was represented by additional pharmacophores to maintain the same group-to-representative ratio (i.e., 10-to-1). Table S4 under ESI † shows information about representative pharmacophores including their features, success criteria, differences from corresponding null hypothesis and Cat.-Scramble condence levels (Y-scrambling method that challenges CATALYST-HYPOGEN to generate pharmacophore models of superior qualities from scrambled bioactivity data compared to the original unscrambled training sets). [10][11][12][13][14][15][16][17][18][19][20]42,43 2.1.3 Flexible docking and ligand-receptor contacts ngerprints Preparation of CLK4 crystal structure. Searching the protein databank yielded a single crystallographic structure for CLK4 (25-November-2020, PDB code: 6FYV, resolution ¼ 2.46 A). Hydrogen atoms were added to the protein utilizing Discovery Studio 4.5 templates for protein residues. The protein structure was utilized in subsequent docking experiments without energy minimization. Explicit water molecules were kept.
Ligand docking and scoring. Docking experiments were conducted employing exible docking as implemented within Discovery Studio 4.5. A binding site sphere of 7.9 A radius surrounding the center of the co-crystallized ligand (silmitasertib, PDB code: 3NG) 33 was used to dene the binding site. Docking was performed with ligands in their unionized states (only two training compounds are ionizable, namely, 88 and 91 in Table S1 under ESI, † however they were docked unionized). The implemented docking procedure combines features from LibDock 21,22 and CDOCKER 34 docking engines. Flexible docking simulates protein exibility and docks ligands with an induced t receptor optimization procedure. The following steps are performed: (i) calculate receptor side-chain conformations; initially, the protocol creates protein side-chain conformations using the ChiFlex algorithm. This algorithm generates ensemble of low energy protein conformations with varied sidechain rotamers 35 (ii) create ligand conformations (similar to the method in Section 2.1.1) (iii) perform initial placement of the ligand conformations into the active site of each receptor sidechain conformation using LibDock docking engine (iv) clustering to remove similar ligand poses (ligand poses are clustered regardless of the protein conformation since the protein conformations are rebuilt during the next step) (v) rene selected protein side-chains in the presence of the rigid ligand using ChiRotor algorithm. This algorithm proceeds by removing side-chain atoms, then perform side-chain conformation sampling, followed by assembling each side chain based on lowest energy conformation followed by nal minimization 35 (vi) perform a nal ligand renement using CDOCKER docking engine. In the current project, amino acids within 7 A from the bound crystallographic ligand (silmitasertib, PDB code: 3NG) were allowed to be exible during docking, namely, Leu167, Gly168, Phe172, Val175, Ala189, Lys191, Val225, Phe241, Leu244, Gly245, Ser247, Glu292, Asn293, Leu295, Val324, and Asp325. These should be directly involved in binding interactions with binding ligands, and therefore, susceptible to conformational exibility (i.e., induced t).
LibDock docking engine. The site-feature docking algorithm (LibDock) docks ligands, aer removing their hydrogen atoms, into a putative active site guided by binding hotspots. The ligands' conformations are aligned to polar and apolar receptor interactions sites (i.e., hotspots). 21,22 The following LibDock parameters were implemented in the current study: the number of binding site hotspots (polar and apolar) was set to 100. The ligand-to-hotspots matching RMSD tolerance value was set to 0.25 A. The maximum number of poses saved for each ligand during hotspots matching before nal pose minimization ¼ 100. Maximum number of poses to be saved for each ligand in the binding pocket ¼ 100. Minimum LibDock score (poses below this score are not reported) ¼ 100. Maximum number of rigid body minimization steps during the nal pose optimization (using BFGS method) ¼ 50. Maximum number of steric clashes allowed before the pose-hotspot alignment is terminated (specied as a fraction of the heavy atom count) ¼ 0.1. Maximum value for nonpolar solvent accessible surface area for a particular pose to be reported as successful ¼ 15.0 A 2 . Maximum value for polar solvent accessible solvent area for a particular pose to be reported as successful ¼ 5.0 A 2 . No nal ligand minimization was implemented.
CDOCKER docking engine. CDOCKER is a CHARMm-based simulated annealing/molecular dynamics method for docking ligands into potential receptors. 34 The following CDOCKER parameters were implemented in the presented project: starting ligands' conformers were energy-minimized then heated to 1000 K over 1000 molecular dynamics steps to generate 10 starting random conformations for each ligand. Each random conformer was rotated 10 times within the binding pocket for subsequent energy renement. The van der Waals energies of the resulting conformers/poses were evaluated and those of $300 kcal mol À1 were discarded. Surviving conformers/poses were exposed to a cycle of simulated annealing over 2000 heating steps to targeted temperature of 700 K followed by 5000 cooling steps to targeted temperature of 300 K. The docked poses were energy minimized to a nal minimization gradient tolerance zero kcal mol À1 A À1 . Top 10 poses were arbitrary selected and saved for subsequent scoring.
Docking-based ligand-receptor contacts ngerprints (LRCFs). High ranking docked conformers/poses generated by the exible docking procedure were scored using two scoring functions: LibDock score 21,22 and CDocker interactions energy. 34 Taking into account each scoring function in turn, the highest scoring docked ligand-protein complex was chosen for each inhibitor (1-91, Table S1 under ESI †) for subsequent elucidation of LRCFs. This resulted in two sets of docked ligand-CLK4 virtual complexes corresponding to each scoring function (LibDock score and CDocker interactions energy) for each docked ligand (unless the two scoring functions converged on the same docked pose in which case a single pose is used for generating LRCFs). Each virtual complex in each set was evaluated to identify close binding site atoms to docked ligand, i.e., of distance # 2.5 A. Atomic neighbors that lie closer than this predened distance threshold are allocated an intermolecular contact value of "one", otherwise they are given a contact value of "zero". Distance evaluations were automatically performed employing a tailored-made Fortran-based soware. [24][25][26][47][48][49] Accordingly, each docking-scoring conguration is used to build a 2D matrix, such that each matrix is composed of row labels corresponding to docked ligands and column labels corresponding to different binding site atoms. The matrix is lled with binary code, whereby "zeros" correspond to interatomic distances that exceed the predetermined threshold (of 2.5 A) and "ones" for distances below (or equal) the predened 2.5 A threshold cutoff. Each row represents the docking-based Ligand-Receptor Contacts Fingerprint (LRCF) of the corresponding docked compound.
2.1.4 ML-QSAR model building Data preparation. The list of collected CLK4 inhibitors was tted against representative cluster centers pharmacophores (94 models) and their t values were used as descriptors in QSAR-ML modeling. Fit values are determined by the following equation: where, tted pharmacophore features represent the number of pharmacophore features that superimpose (i.e., overlap or map with) respective chemical functions within the tted compound. W is the weight of the corresponding pharmacophore feature spheres. This value is xed to 1.0 in HYPOGENgenerated models. "disp" is the distance between the center of certain pharmacophoric sphere (feature centroid), and the center of the respective superimposed chemical moiety of the tted compound. Tol, known as tolerance, is the radius of the pharmacophoric feature sphere. S(disp/tol) 2 is the summation of (disp/tol) 2 values for all pharmacophoric features that successfully map corresponding chemical functionalities in the tted compound. [10][11][12][13][14][15][16][17][18][19][20]42,43 Additional 166 physicochemical descriptors that included numerous physicochemical, topological and ngerprint descriptors 55,56 were also calculated employing the "Calculate Molecular Properties" protocol implemented within Discovery Studio (version 4.5, BIOVIA, USA). Moreover, the LRCFs of the molecules were added as additional binary descriptors to the ML-QSAR list of compounds. Two sets of LRCFs were used for each compound, namely, those based on LibDock-score scoring function and CDocker interaction energy scoring function. Accordingly, the ML-QSAR matrix included 91 rows corresponding to 91 collected compounds and 361 columns corresponding to the t values against 94 pharmacophore models, 101 LRCFs and 166 physicochemical descriptors.
A subset of 72 compounds from the total list of modeled inhibitors (1-91, Table S1 under ESI †) was utilized as training list for ML-QSAR modeling. The remaining 19 compounds (ca. 20% of the dataset) were employed as external testing subset for validating ML-QSAR models. The test molecules were selected as follows: the collected compounds were ranked according to their IC 50 values, and then every h compound was selected for the test set starting from the high potency end. The selected test molecules are marked with asterisks in Table S1 under ESI. † ML-QSAR modelling. Genetic function algorithm (GFA) was used to select combinations of pharmacophore t values, LRCFs and physicochemical descriptors and feed them into the particular machine learner (ML) under evaluation to assess how successful the combination, i.e., descriptors and ML, is in explaining the observed variation in bioactivity (log(1/ IC 50 )). [10][11][12][13][14][15][16][17][18][19][20]42,43 In the current project we implemented the feature selection node within KNIME Analytics Platform (Version 4.1.0) for GFA with the following GFA settings: (i) population size was set to 200 (ii) max number of generations was set to 1000 (iii) fraction of survivors was set to 40% (iv) selection strategy method was "tournament", where the winner of each tournament is selected to perform crossover (v) elitism rate, which indicates the fraction of the best individuals within a generation that are transferred to the next generation without alternation, was set to 10% (vi) uniform crossover strategy in which each bit (gene) is chosen from either parent with equal probability (vii) crossover rate was set to 60%, accordingly, top 60% of survivors are allowed to mate (viii) mutation rate was set to 1%, accordingly, 1% of the surviving chromosomes in each generation is exposed to mutation, and this is done to escape being captured in a local minimum.
Several ML methods were evaluated to tie these properties with anti-CLK4 bioactivities, namely, multiple linear regression (MLR), 36 random forests (RF), 37 extreme gradient boosting (XGBoost), 37,38 probabilistic neural network (PNN), 39 and support vector regression (SVR). 40,41 The best ML method was found to be SVR. This method is a supervised machine learning method that uses the principle 'kernel trick' to nd number of boundary instances, also called "support vectors", to create discriminatory function that separates training observations into distinct classes with widest possible boundaries. This function can be used to classify new observations or to predict continuous functions (regression). 40,41 The following SVR settings were implemented in this project: an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors (i.e., the nu parameter) ¼ 0.664. This value was optimized in this project to minimize the error on training data and reduce the computational complexity of models to avoid over tting. Other parameters were set to their default values in the Weka-Knime (version 4.1.3) LibSVM node, these include: kernel cache (cache size ¼ 40.0), kernel coefficients epsilon ¼ 0.001 and gamma ¼ 0.00, kernel type is radial basis function: exp(Àgamma Â ju À vj^2), and loss function is 0.1.

ML model evaluation
Statistical validation of ML-QSAR models. Each ML-QSAR generated model was validated internally, i.e., using the training compounds, and externally, i.e., using the testing list (testing inhibitors are marked with asterisks in Table S1 under ESI †). Internal validation was performed by employing leave one-out cross-validation (r LOO 2 ) and leave-20%-out crossvalidation (r L20 2 %). 42 Additionally, the models were also validated against the external testing set. Predicted bioactivities of testing compounds (log(1/IC 50 )) values (determined by each ML-QSAR model under evaluation) and their experimental counterparts were used to calculate the predictive r 2 (r PRESS 2 ) dened as: where SD is the sum of the squared deviations between the biological activities of the test set and the mean activity of the training set molecules, PRESS is the squared deviations between predicted and actual activity values for every molecule in the test set.
Validation of generated pharmacophore models using receiveroperating characteristic (ROC) curve analysis and goodness of hit score. ML-QSAR selected pharmacophores were validated using ROC curve analysis and goodness of hit list (GH scoring). [59][60][61][62][63] The testing set is composed of experimentally validated active and inactive CLK4 inhibitors extracted from ChEMBL database (https://www.ebi.ac.uk/chembl/) and not included in the modelling list (91 compounds in ESI Table S1 †). The ROC set includes 383 active compounds (anti-CLK4 IC 50 # 400 nM) and 270 inactive compounds (anti-CLK4 IC 50 > 3000 nM). Table S5 under ESI † shows the chemical structures of the testing set compounds in simplied molecular-input line-entry system (SMILES) format, together with their corresponding bioactivities.
ROC analysis evaluates the capability of a particular pharmacophore model(s) to correctly classify a group of compounds into actives and inactives. It affords several success criteria for evaluation: [10][11][12][13][14][15][16][17][18][19][20][24][25][26][42][43][44] (i) area under the ROC curve (AUC) of the corresponding ROC curve, (ii) accuracy, (iii) true negative rate, and (iv) true positive rate. True positive rate (or sensitivity, SEN) is calculated by dividing truly active captured hits (true positives) by the sum of true positives and false negatives. True negative rate (or specicity, SPC) is truly inactive compounds being discarded aer proper identication. True negative rate is calculated by dividing the true negatives by the sum of true negatives and false positives. Accuracy describes the percentage of correctly classied molecules (active and inactive) by the screening protocol.
GH scoring is calculated as in eqn (2). 59-63 where, Ya is the percentage yield of actives calculated as the ratio of actives found in the hit list to the total number of compounds in the hit list (Ya ¼ TP/(TP + FP) Â 100).

In silico screening of the NCI database for new CLK4 antagonists
Pharmacophore hypotheses selected by the best GFA-SVR model, were employed as 3D search queries to screen the National Cancer Institute (NCI) list of compounds. The t values of captured hits against capturing pharmacophore models and relevant physicochemical molecular descriptors (i.e., that emerged in the optimal GFA-SVR model) were calculated. Additionally, captured hits were docked into the CLK4 binding site (PDB code: 6FYV, resolution ¼ 2.46 A) using exible docking with same docking settings employed for modeled compounds to determine their LRCFs. All descriptors were then fed into the optimal GFA-SVR model to determine the predicted bioactivities for the captured hits. The highest-ranking hits were acquired for subsequent in vitro testing.

Biological evaluation of captured hits
The bioassay was conducted using LanthaScreen Eu kinase binding assay (Invitrogen-Life Technologies, USA). 45 This assay is based on the ability of the potential kinase inhibitor under evaluation to bind and displace a proprietary "tracer" molecule (known as Alexa Fluor 674 conjugate) from the catalytic site of the targeted kinase. In case the tested molecule is of low affinity to the targeted kinase, and thus fails to displace the "tracer" molecule, then the tracer remains within the kinase binding site and maintains effective orescent interaction with certain Eu-labeled anti-tag antibody attached at the kinase surface, thus emitting signicant uorescence (time-resolved uorescence energy transfer, TR-FRET). On the other hand, if the potential inhibitor binds tightly to the kinase, then it will displace the "tracer" molecule from the binding site causing loss of the TR-FRET interaction, i.e., resulting from tracer/Eu interaction, with concomitant loss of uorescence. Stock solutions of hit molecules were prepared in DMSO, and then serially diluted in assay buffer 50 mM HEPES pH 7.5, 0.01% BRIJ-35, 10 mM MgCl 2 , 1 mM EGTA to yield nal hit concentrations of 10 mM. The reaction of the test compound solution, kinase antibody mixture and tracer were mixed and incubated over 1 hour at room temperature, and then the uorescence was read at ls 665 and 615 nm. 46 DMSO did not exceed 1% in the nal kinase reaction. Hits that inhibited CLK4 more than 75% at 10 mM were further tested at 1.0, 0.10 and 0.01 mM to determine their IC 50 values. Staurosporine was used as standard (positive control) with IC 50 value ¼ 7.45 nM. IC 50 values were calculated using nonlinear regression of the log(concentration) vs. inhibition percentage values using GraphPad Prism 5.0.

Supervised ML-QSAR modeling of CLK4 inhibitors
The current project involves supervised ML, whereby elaborate search was performed to identify the best ML method that yields optimal self-consistent and predictive regression model a Hypo(5-R2-08) is the 8 th pharmacophore model generated using training subset 5 (Table S2 under ESI) with the 2 nd HYPOGEN run settings (Table  S3 under ESI), Hypo(6-R2-07): is the7 th pharmacophore model generated using training subset 6 (Table S2) with the 2 nd HYPOGEN run settings (Table S3), Hypo(8-R3-08) is the 8 th pharmacophore model generated using training subset 8 (Table S2) with the 3 rd HYPOGEN run settings (Table S3), Hypo(6-R2-03) is the 3 rd pharmacophore model generated using training subset 6 (Table S2) with the 2 nd HYPOGEN run settings (Table S3), Hypo(2-R5-05) is the 5 th pharmacophore model generated using training subset 2 (Table S2) with the 5 th HYPOGEN run settings (Table S3), Hypo(3-R6-08) is the 8 th pharmacophore model generated using training subset 3 (Table S2) with the 6 th HYPOGEN run settings (Table S3), Hypo(1-R6-02) is the 2 nd pharmacophore model generated using training subset 1 (Table S2) with the 6 th HYPOGEN run settings (Table S3), Hypo(5-R2-07) is the 7 th pharmacophore model generated using training subset 5 (Table S2) with the 2 nd HYPOGEN run settings (Table S3), Hypo(6-R2-08) is the 8 th pharmacophore model generated using training subset 6 (Table S2) with the 2 nd HYPOGEN run settings (Table S3), Hypo(8-R2-04) is the 4 th pharmacophore model generated using training subset 8 (Table S2) with the 2 nd HYPOGEN run settings (Table S3), Hypo(5-R6-08) is the 8 th pharmacophore model generated using training subset 5 (Table S2) with the 6 th HYPOGEN run settings (Table S3), Hypo(1-R2-08) is the 8 th pharmacophore model generated using training subset 1 (Table S2) with the 2 nd HYPOGEN run settings (Table S3). Table 2 shows the X, Y, Z coordinates of pharmacophores Hypo(5-R2-08), Hypo(6-R2-07), and Hypo(8-R3-08). b LEU244HNLD is the hydrogen atom attached to peptidic N of Leu244 selected by LibDock score scoring function, VAL 324 HB LD is the hydrogen atom attached to beta carbon of Val324 selected by LibDock score scoring function, ASP 325 HA LD is the hydrogen atom attached to alpha carbon of Asp325 selected by LibDock score scoring function. Fig. 3 shows the position of these three atoms within the binding pocket, LEU 210 HD12 CD is the hydrogen atom attached to delta carbon of Leu210 selected by CDocker interaction energy scoring function, LYS 191 HZ2 CD is one of the hydrogen atoms at the terminal amine on the side chain of Lys191 selected by CDocker interaction energy scoring function. c Num_Rings6: number of 6-membered rings. CHI_2: second order connectivity index, positively correlated with molecular size, Num_Rings5: number of 5-membered rings. Kappa_3: third order kappa shape index, related to molecular exibility, Dipole_Y: 3D the calculated magnitude and the X-vector component of the molecular dipole moment in debyes as estimated from the partial atomic charges (calculated by Gasteiger method) and atomic coordinates. PMI_x: principle moment of inertia in the X-dimension, Shadow_XYfrac area of the molecular shadow in the XY plane. 54,55 d Resubstitution correlation coefficient: the model is trained on the training list and used to predict the bioactivities of the same training set. e Leave-20%-out correlation coefficient. f Leave-one-out correlation coefficient. g Predictive correlation coefficient on the external testing set.
connecting combination of pharmacophore models, LRCFs and physicochemical descriptors with anti-CLK4 bioactivities. However, the fundamental problem in ML regressors is their failure to infer information about the exact descriptors that control variation in the response (bioactivity) of the training observations (training compounds). 50 Still, MLR and RF are noticeable exceptions as it is possible to identify signicant contributors (i.e., descriptors) by either numerical contribution of individual descriptors in the regression model of MLR or by repetitive emergence of certain descriptor(s) as decision criterion in decision trees within RF. 51 Nevertheless, to solve this issue with other ML methods, it was decided to combine evaluated ML methods with GFA for feature-selection 51 to identify the most successful combination of pharmacophore(s), physicochemical descriptors and LRCFs that control the bioactivity variation within the training compounds. Several ML modeling approaches were attempted, namely, MLR, SVR, XGBoost, RF, and PNN. All GFA-ML models were validated by leave-20%-out and leave-one-out cross-validation, and against external testing set. 52,53,57 Table 1 shows the results. Clearly, GFA-SVR 40,41 and GFA-RF 51 were most consistent in explaining training and testing bioactivity variations. However, the GFA-SVR model scored better leave-one-out crossvalidation r 2 prompting us to select this model for hit identication and bioactivity prediction. Interestingly, the top ML-QSAR models (GFA-SVR and GFA-RF) converged on several descriptors, namely, Hypo(5-R2-08), Hypo(8-R3-08), VAL 324 HB LD , ASP 325 HA LD and CHI_2 indicating their signicance in the prediction of anti-CLK4 bioactivity. Table 2 shows the X, Y and Z coordinates of the three pharmacophores that emerged in the GFA-SVR model, while Fig. 3 shows the three pharmacophores and how they map a crystallographic bound ligand within CLK4.
Clearly from Fig. 3, the GFA-SVR pharmacophores, i.e., Hypo(5-R2-08), Hypo(6-R2-07), and Hypo(8-R3-08), emphasize certain ligand-receptor binding interactions seen in the crystallographic complex: all three pharmacophores highlight a hydrophobic interaction tying the chlorobenzene of the bound ligand with the aromatic side chain of Phe172. However, Hypo(5-R2-08) and Hypo(6-R2-07) emphasize hydrophobic and p-stacking interactions, respectively, connecting the benzoic acid fragment of the bound ligand and the aromatic side chain of phe241 ( Fig. 3A and C). Similarly, Hypo(5-R2-08) and Hypo(8-R3-08) highlight hydrogen bonding interaction connecting the carboxylic acid of the bound ligand with the side chain ammonium of Lys191 ( Fig. 3A and B). The same pharmacophores, i.e., Hypo(5-R2-08) and Hypo(8-R3-08), highlight pstacking interactions involving the pyridine side-ring of the bound ligand against the peptidic amide connecting Leu244 and Gly245 (Fig. 3A and B). However, Hypo(8-R3-08) and Hypo(6-R2-07) underscore hydrogen bonding interaction involving the central aniline NH of the bound ligand with a network of water molecules connected to Asp250 and Gly245 ( Fig. 3B and C). It is noteworthy to mention that the apparent less-than-optimal mapping of the hydrogen bond donor feature of Hypo(8-R3-08) against the central aniline NH of the crystallographic bound ligand (Fig. 3B) is not unexpected since Hypo(8-R3-08) (and other pharmacophores in the optimal GFA-SVR model) was totally generated through ligand-based process without considering the binding site. Moreover, despite the reliably of crystallographic structures in drug design, they suffer from some serious problems such as inadequate resolution and crystallization-related artifacts of the ligand-protein complex. Furthermore, crystallographic structures generally ignore structural heterogeneity related to protein anisotropic motion  and discrete conformational substates. 64 These factors warrant the use of ligand-based methods as adjuvants to complement information derived from bound ligand poses derived from crystallographic complexes. Interestingly, Hypo(6-R2-07) uniquely highlights hydrogen bonding interaction involving the central pyridine nitrogen of the bound ligand with a network of hydrogen-bonded water molecules connected to the carboxylic acid side chain of Asp325 (Fig. 3C).
To evaluate the ability of pharmacophore models in the GFA-SVR QSAR to effectively and selectively capture active hits, the three pharmacophores were validated against an external ROC testing set extracted from ChEMBL database. The results are shown in Table 3.
Clearly from Table 3, the three pharmacophores exhibit mediocre performances in discriminating actives from inactive CLK4 inhibitors, as their receiver operating characteristic areas under the curves (ROC-AUCs) ranged from 53% to 57%. Moreover, they exhibited moderate goodness of hit (GH) values. [59][60][61][62][63] This explains the need for additional descriptors including LRCFs to achieve satisfactory machine learning models. Therefore, unsurprisingly, the GFA-SVR model in Table 1 highlights three signicant ligand receptor contacts that emerged from the exible docking via LibDock-score scoring function, namely, LEU 244 HN LD , VAL 324 HB LD , ASP 325 HA LD shown in Fig. 4. Interestingly, LEU 244 HN LD corresponds to hydrogen-bonding interaction connecting potent ligands to Leu244 (Fig. 4) that is not represented in any of the pharmacophores within the best GFA-SVR model (Fig. 3).

In silico screening of the NCI database for new CLK4 antagonists
The fundamental utility of pharmacophores and associated ML models is in the discovery of new chemical scaffolds of similar or  even better biological proles compared to starting training compounds, i.e., scaffold hopping. Therefore, pharmacophores Hypo(5-R2-08), Hypo(8-R3-08) and Hypo(6-R2-07), Fig. 3A Table 1 including t values against Hypo(5-R2-08), Hypo(8-R3-08) and Hypo(6-R2-07) together with their predicted bioactivities and experimental anti-CLK4 inhibition at 10 mM. Hits that illustrated inhibitory percentages exceeding 75% at 10 mM were further evaluated at 1.0, 0.1, 0.01 and 0.001 mM to determine their IC 50 values. Hits 96, 107 and 109 (Table S6 under ESI †) exceeded 75% CLK4 inhibition at 10 mM warranting further evaluation to determine their anti-CLK4 IC 50 values. Fig. 5 shows the dose-response curves of the three hits, while Fig. 6 shows how the molecules map pharmacophore models Hypo(5-R2-08), Hypo(8-R3-08) and Hypo(6-R2-07). Interestingly the three hits illustrated potent anti-CLK4 inhibitory proles with the best one, hit 96, illustrating IC 50 value of 57 nM. The Hill slopes of the three active hits are #1.0 suggesting they are authentic (non-promiscuous) inhibitors. 56 Fig. 6 The chemical structures of captured hits (left column) and their mappings against pharmacophores Hypo(5-R2-08), Hypo(8-R3-08) and Hypo(6-R2-07) (respectively, left to right). Fig. 6 and 7 show how active hits t their respective capturing pharmacophores and how they dock into CLK4, respectively. Mapping the active hits against three pharmacophores seems to emphasize similar binding interactions, namely, hydrogen bonding with a network of water molecules connected to Asp325, hydrophobic interactions with the aromatic side chain of Phe172 and Phe241, as well as p-stacking interactions against the peptidic amide connecting Leu244 and Gly245.
Interestingly, captured hits exhibit signicantly distinct chemical scaffolds compared to the modelled list of compounds (1-91, Table S1 under ESI †) and even to all known CLK4 inhibitors, as in Fig. 8 and 9. This conclusion is based on principal component analysis (PCA) (Fig. 8) showing the relative  Table S6 under ESI †) compared to the modeled compounds (1-91, ESI Table S1 †), as in Fig. 8, or even to all known CLK4 inhibitors within ChEMBL database, as in Fig. 9. In fact hits 96 and 109 represent the rst in class nanomolar CLK4 inhibitors, as this the rst time to report potent triazine-based CLK4 inhibitors.

Conclusions
In conclusion, the combination of pharmacophore modeling of CLK4-antagonists and LRCFs generated by exible docking followed by GFA-driven ML-based modeling yielded selfconsistent and predictive SVR-QSAR model. The resulting pharmacophores were validated by receiver operating characteristic (ROC) curve analysis and used as virtual search queries to screen the National Cancer Institute (NCI) database for promising CLK4 hits of novel chemo-types. Three hits (96, 107 and 109) showed nanomolar and low micromolar IC 50 values. The three hits represent novel CLK4 inhibitors scaffold.

Conflicts of interest
There are no conicts to declare.  Table S6; † red spheres ) compared to modeled compounds (1-91, Table S1; † blue spheres ). The top three principal components calculated for modeled compounds and captured hits are based on eight descriptors (i.e., log P, molecular weight, hydrogen bond donors and acceptors, rotatable bonds, number of rings, number of aromatic rings, fractional polar surface area surface area). The active hits 96, 107 and 109 are indicated in the figure with arrows. Fig. 9 Principal component analysis showing the relative distribution of captured hits (red spheres, ) including active hits (yellow spheres, ) among all reported 3643 CLK4 inhibitors extracted from ChEMBL database (blue spheres, ). The top three principal components were calculated based on eight descriptors (i.e., log P, molecular weight, hydrogen bond donors and acceptors, rotatable bonds, number of rings, number of aromatic rings, fractional polar surface area surface area).