Privileged substructures for anti-sickling activity via cheminformatic analysis

Sickle cell disease (SCD), an autosomal recessive genetic disorder, has been recognized by the World Health Organization (WHO) as a major public health problem as it affects 300 000 individuals worldwide. Complications arising from SCD include anemia, microvascular occlusion, severe pain, stokes, renal dysfunction and infections. A lucrative therapeutic strategy is to employ anti-sickling agents that can disrupt the formation of the HbS polymer. This study therefore employed cheminformatic approaches, encompassing classification structure–activity relationship (CSAR) modeling, to deduce the privileged substructures giving rise to the anti-sickling activity of an investigated set of 115 compounds, followed by substructure analysis. Briefly, the compiled compounds were described by fingerprint descriptors and used in the construction of CSAR models via several machine learning algorithms. The modelability of the data set, as exemplified by the MODI index, was determined to be in the range of 0.70–0.84. The predictive performance was deduced by the accuracy, sensitivity, specificity and Matthews correlation coefficient, which was found to be statistically robust, whereby the former three parameters afforded values in excess of 0.7 while the latter statistical parameter provided a value greater than 0.5. An analysis of the top 20 important substructure descriptors for anti-sickling activity revealed that 10 important features were significant in the differentiation of actives from inactives, as illustrated by aromaticity/conjugation (e.g. SubFPC287, SubFPC171 and SubFPC5), carbonyl groups (e.g. SubFPC137, SubFPC139, SubFPC49 and SubFPC135) and miscellaneous groups (e.g. SubFPC303, SubFPC302 and SubFPC275). Furthermore, an analysis of the structure–activity relationship revealed that the length of alkyl chains, choice of functional moiety and position of substitution on the benzene ring may affect the anti-sickling activity of these compounds. Thus, this knowledge is anticipated to be useful for guiding the design of robust compounds against the gelling activity of HbS, as preliminarily demonstrated in the data-driven compound design presented herein.


Introduction
Human hemoglobin (Hb) is an iron-containing protein that is found abundantly within red blood cells (RBCs). Hb is formed by a symmetric polypeptide chain dimer pairing in which the alike and b-like chains form a tetrameric structural and functional unit. Their main function is to transport O 2 from the lungs to all body tissues, as well as to transport CO 2 out of the tissues and back to the lungs. Furthermore, Hb is also capable of interacting with other gases, such as carbon monoxide (CO) and nitric oxide (NO), and these interactions govern important biological roles. 1 Adult hemoglobin (HbA) is the most common form of hemoglobin in adults and is composed of two a-chains and two b-chains, constituting 141 and 146 amino acids, respectively. 2 Mutations of the genes result in the structural alteration and perturbation of the globin chain that eventually culminates in Hb-associated diseases as seen in HbA, hemoglobin S (HbS), hemoglobin C (HbC) and hemoglobin E (HbE), as well as thalassemia (i.e. decreased globin chain production). 3 Sickle cell disease (SCD) is a global health problem in several parts of the world (e.g. sub-Saharan Africa, the Mediterranean basin, the Middle East, India and the United States) that has been estimated to annually affect approximately 300 000 infants (WHO), and this number has been forecasted to rise to 400 000 by 2050. 4 The hallmark of SCD involves the polymerization of deoxygenated HbS that consequently leads to the sickling process that alters the shape of RBCs. 5 Mechanistically, HbS arises from the A / T point mutation that leads to the substitution of hydrophilic glutamic acid with hydrophobic valine at the sixth position (Glu6Val) of the b-globin chain. 6 The resulting Val6 on the b 2 -globin chain then interacts hydrophobically with Phe85 and Leu88 from a neighboring Hb molecule. At low oxygen tension, HbS is polymerized inside RBCs, leading to gel or ber formation and thereby causing a drastic decrease in red cell deformability. Consequently, this leads to several complications such as anemia, microvascular occlusion, severe pain, strokes, renal dysfunction and infections.
Currently, the clinical management of SCD is blood transfusion, although long-term transfusion therapy may cause an iron overload in patients, leading to potential side effects such as organ damage and infections. Even though iron chelation therapy has greatly improved blood transfusion, it only offers a temporary solution to the problem. 7 Allogeneic hematopoietic stem cell transplantation (HSCT) is a gene transfer therapy aimed at the underlying molecular causes of SCD. However, most successful transplantations require the use of stem cells from matched sibling donors, thereby making this a challenging therapeutic approach for some patients. HSCT may therefore not be applicable for many current patients. 8 Gene therapy is one of the most promising approaches as it does not try to x the symptoms, but rather the problem of the disease. 9 However, this approach is available to only a very small percentage of patients due to its extremely high costs and requirements for highly specialized centers. Moreover, several anti-sickling agents have been investigated and conrmed to possess ameliorative properties. Hydroxyurea has been shown to decrease the number and severity of sickled cells by signicantly increasing fetal hemoglobin (HbF) production in patients with SCD. It was therefore approved for use by the FDA in 1998. Nevertheless, the side effects of this drug include neutropenia, bone marrow suppression, elevation of hepatic enzymes, anorexia, nausea, vomiting and infertility. 10,11 Recently, on July 7, 2017, the U.S. FDA approved L-glutamine oral powder (Endari) as the rst new drug in 20 years for SCD, which acts by reducing acute complications in adults and children of 5 years and older. Although its mechanism of action is not fully understood, the drug is found to be involved in the oxidative stress phenomena of SCD. It has been shown to improve the nicotinamide adenine dinucleotide (NAD) redox potential of RBCs by increasing the availability of reduced glutathione. However, several common adverse reactions were found in >10% of incidences, such as nausea, headaches, abdominal pain, coughs, pain in the extremities, back pain and chest pain. 12 In fact, it should be noted that several side effects with no specic therapy occur for SCD patients. Therefore, the pathophysiological hallmark of SCD presents an interesting subject. The idea for the treatment of SCD was to bind small molecules near the mutation site in such a fashion that it would prevent the insertion of the bglobin chain of Hb containing the Val mutation (the donor site) into the hydrophobic pocket of a second Hb molecule (the acceptor site), thereby inhibiting deoxy-HbS polymerization (Fig. 1). The rationale for our study was to follow the treatment of SCD based on pathophysiology, to inhibit deoxy-HbS polymerization using computational methods.
Classication structure-activity relationship (CSAR) modeling represents an important approach for elucidating the origin of biological activity for a set of compounds of interest as a function of their molecular descriptors. The obtained CSAR model can help to reveal the privileged substructures that are essential for the biological activity of potent compounds which can subsequently be used as therapeutic agents. Privileged substructures are a concept introduced by Evans et al. 13 in their analysis of cholecystokinin antagonists based on benzodiazepines, in which they discovered that there exist conserved substructures that were not found in compounds of different activity. Therefore, we applied CSAR, together with scaffold and substructure analysis, to rationalize the underlying physicochemical features dening anti-sickling activity in several series of compounds reported by Abraham and colleagues. [14][15][16][17][18][19][20] In this study, we examined the utility of several sets of substructure ngerprint descriptors in modeling anti-sickling activity. Important physicochemical features were then decoded from such predictive CSAR models to discern the privileged substructures inuencing the anti-sickling activity.

Materials and methods
A schematic summary of the CSAR modeling process performed in this study is provided in Fig. 2.

Data collection
The compounds with anti-sickling activity used in this study were compiled from the literature, [14][15][16][17][18][19][20] which afforded an initial set of 132 compounds. The removal of redundant compounds resulted in a nal set of 115 compounds. The compounds were treated with the CSAR curation workow as described by Fourches et al. 21 ChemAxon Standardizer was utilized using the same protocol from our previous study. 22 The anti-sickling activity is represented as a solubility ratio which is summarized below: Anti-sickling activity ¼ HbS ðdrugsÞ HbS ðcontrolÞ where HbS (drugs) and HbS (control) denotes the presence and absence of drugs in a solution of HbS. Solubility ratios greater than 1.06 were estimated as necessary for decreasing the clinical severity of sickle cell disease. Therefore, the compounds were classied into two types, consisting of 32 active compounds (solubility ratios of $1.06) and 83 inactive compounds (solubility ratios of <1.06). Moreover, a set of 1600 decoy molecules was generated from active compounds using DUD-E and treated as inactive compounds. 23

Molecular descriptors
Molecular descriptors can be dened as the quantitative and/or qualitative description of molecules of interest in terms of their constitution, connectivity and physicochemical properties. 24,25 They are of prime importance for quantitative structure-activity relationship (QSAR) studies. 26 Molecular descriptors can be easily calculated from GUI-based soware such as Dragon, 27,28 PaDEL-Descriptor soware, 29 QuBiLS-MIDAS, 30 QuBiLS-MAS 31 and CODESSA; 32 they can be derived programmatically via R or Python environments using packages/modules such as ChemoPy, 33 PyDPI, 34 RDKit 35 and rcdk; 36 and they can be obtained via the internet using webservers such as BioTriangle 37 and ChemDes. 38 Fingerprint descriptors provide descriptions of the constituting substructures inherently present in a molecule. This study makes use of the PaDEL-Descriptor soware 29 for computing several ngerprint classes. Until now, the current version of PaDEL has provided 1875 descriptors, consisting of 1444 1D and 2D descriptors and 431 3D descriptors, and 12 types of ngerprint (a total of 16 092 bits). In this study, we employed 12 types of ngerprint for describing the structural features of the investigated compounds as summarized in Table 1. Three of the twelve ngerprint classes pertain to the frequency count of the substructures presented in the investigated compounds (i.e. they contain the suffix count in the name of the ngerprint class, such as the substructure ngerprint count), while the remaining nine classes consider only the presence/absence of substructures or ngerprint bits in the investigated compounds.

Data ltering
Constant and near constant variables were employed to initially select ngerprint descriptors from a large data set of twelve ngerprint descriptor sets, which not only adds complexity but could potentially give rise to bias in the model. The constant of each ngerprint descriptor and bioactivity (anti-sickling) were calculated using a standard deviation (SD) of 0.1 as a cut-off value. The ngerprint descriptors with SD values greater than 0.1 were selected for further analysis. The numbers of descriptors aer lteration are shown in Table 2

Data balancing
As noted in the previous section, the data set was highly imbalanced as the ratio of active to inactive compounds was 1 : 2.6. From a machine learning point of view, such an imbalanced data set has a tendency to cause classiers to overt, as well as to perform poorly on the minority class. To alleviate this data imbalance issue, an undersampling technique was applied by randomly selecting a subset of 32 inactive compounds from the initial set of 83 inactive compounds. Aer obtaining the balanced data set consisting of 32 active and 32 inactive compounds, the total set of 64 compounds was divided into two subsets using an 8 : 2 ratio, consisting of 48 compounds in the internal set (24 active and 24 inactive) and 16 compounds in the external set (8 active and 8 inactive).

Data set modelability
The modelability of the data set is essentially dependent on the underlying relation of the chemical structures and their observed bioactivity. In particular, two highly similar compounds with striking differences in their bioactivity (i.e. one compound in a pair affords favorable bioactivity while the other affords poor bioactivity), otherwise known as an activity cliff, would be detrimental for machine learning algorithms in their attempts to correlate structures with related levels of bioactivity. On the other hand, similar compounds with similar bioactivities (i.e. where both compounds in a compound pair provide the same bioactivity class) would contribute favorably to the modelability of the data set. This so-called modelability index (MODI) was introduced by Golbraikh et al. 46 for the a priori estimation providing the feasibility of building robust predictive models for any given data set. This statistical metric can be computed as follows: Step 1. For any given pair of compounds, C i and C j dened by an m-dimensional vector, the normalized Euclidean distance (d ij 0 ) is computed as follows: where d ij , d i and n represent the distance scores between the two compounds, the mean Euclidean distance and the number of compounds, respectively.
Step 2. For each compound in a data set, the MODI can be computed by identifying its rst nearest neighbor (i.e. the compound with the smallest Euclidean distance) belonging to the same or a different class as follows: where N C is the number of classes (i.e. C ¼ 2 denotes active and inactive compounds), N same i is the number of compounds of the i th class that have their rst nearest neighbors belonging to the same i th class, and N total i is the number of compounds belonging to the i th class. A data set is considered to be modelable if the MODI index is greater than the threshold value of 0.65. The MODI index was computed using an in-house developed R code.

Statistical analysis
Statistical analysis was performed to investigate the difference patterns, features and trends that are present in individual descriptors between bioactivity classes (i.e. active and inactive) using 6 descriptive statistical parameters, comprising the minimum (Min), rst quartile (Q1), median, mean, third quartile (Q3) and maximum (Max) parameters. A box plot of descriptors was created using the R package ggplot2. 47 The normal distribution of each descriptor was assessed using Kolmogorov-Smirno tests from the ks.test function in the R stats package. Practically, the parametric t-test is applicable if the data follows a normal distribution, whereas for a nonnormal distribution the non-parametric approach, namely the Mann-Whitney U test, is recommended. Particularly, the wilcox.test function from the R stats package 48 was used.

Multivariate analysis
For a CSAR model, its prediction performance will depend not only on compound descriptors but also on the predictor used. This study employs random forest (RF) as the classier owing to its demonstrated success in previous models as well as its interpretability. RF is an ensemble classier that produces a number of decision trees, using a randomly selected subset of training samples and variables. The classication starts at the root node in which the data set at the node is split according to the value of the descriptors that are selected, such that descriptors of different activities are predominantly moved to different branches. The classication is obtained by averaging the results of all trees by a majority vote from each tree. 49,50 The RF classier was generated using the randomForest R package.
To accurately predict the anti-sickling activities of the compounds, it is necessary to tune two parameters of the RF model, i.e. the number of trees used for constructing the RF classier (n tree ) and the number of random candidate features (m try ). In this study, a 10-fold CV procedure was applied to tune the n tree parameter from the range of n tree˛{ 100, 200, ., 1000}, while the m try parameter was estimated using the tuneRF function in the randomForest R package. 51 In order to provide a better understanding of the anti-sickling activities of the compounds, informative descriptors were extracted from the RF model by means of its built-in feature importance estimator. In particular, the mean decrease of the Gini index (MDGI) was utilized to estimate the important descriptors, in which the descriptors with the largest MDGI values represent the most important features, as these descriptors contribute most to the prediction performance of the model.

Model validation
The balanced data set was then subjected to a 5-fold repeated cross-validation (5-fold repeated CV) scheme and external validation so as to assess the model intrapolation and extrapolation, respectively. Briey, data splitting was continuously resampled for 100 iterations (i.e. the data was reshuffled and re-stratied before each round) where each data split iteration divides the data set into internal and external sets using the 80/ 20 split ratio. Subsequently, the internal set (consisting of 48 compounds) was subjected to 5-fold repeated CV in which the data was partitioned into 5 folds, where one fold was retained as the testing set while the remaining folds were used to train the model. This process was repeated iteratively until all folds had the chance to be retained as the testing set. The partitioned 5 folds were reshuffled three times in a repeated CV fashion. Moreover, external validation was also performed on the external set and the decoy data set in order to assess the extrapolation capability of the model on unknown data that has not been previously seen by the training model.
The prediction of anti-sickling activity is essentially a binary (two-class) classication problem, i.e. whether the bioactivity of the compound is active or inactive. For this kind of binary classication problem, the following set of metrics, i.e. accuracy (Ac), sensitivity (Sn), specicity (Sp) and the Matthew's correlation coefficient (MCC), were used to evaluate the prediction performance: where TP, TN, FP and FN represent the instances of true positives, true negatives, false positives and false negatives, respectively. The value of MCC ranges from À1 to 1, in which an MCC of 1 indicates the best possible prediction scenario while an MCC of À1 indicates the worst possible prediction. On the other hand, an MCC of 0 is indicative of random prediction. Furthermore, receiver operating characteristic (ROC) curves were plotted to show the predictive capability of our CSAR models using the pROC package in the R soware. 52 The ROC curve presents the model behaviour of the true positive rate (sensitivity) against the false positive rate (1-specicity) in a visual way. The area under the ROC curve (AUC) was calculated to quantitatively and objectively measure the performance of the proposed CSAR models. A random classier has an area under the curve of 0.5, while the AUC for a perfect classier is equal to 1.

Applicability domain analysis
Applicability domain (AD) analysis allows the denition of chemical space boundaries in which the classication model can be reliably used to predict the putative bioactivity of the investigated compounds. 24,53 In particular, AD allows the relative estimation of the feasibility of predictions made on query compounds on the basis of how similar they are to the compounds used to train the model. There are several approaches that have been proposed to assess the AD of compounds. 54 Of these approaches, the principal component analysis (PCA) bounding box is an intuitive approach which is based on the conversion of the original data into a new orthogonal coordinate system that also corrects for correlations amongst the descriptors. Newly formed axes are dened as PCs presenting the maximum variance of the investigated compounds in the data set. The AD of the classication model presented herein is represented by the PCA scores plot in which the boundary spanned by compounds from the training set is considered to be the AD of the model. Thus, if compounds from the testing set are found to fall within this dened boundary, they are also considered to be within the model's AD, and vice versa (i.e. compounds from the testing set located outside the boundary of the training set space would be expected to be outside the model's coverage).

Reproducible research
To support the reproducibility of the constructed CSAR models as described in this study, all R codes and associated input les (e.g. ngerprint descriptors, SMILES notations, biological activity, etc.) used to create the results, gures and tables are publicly available on GitHub at https://github.com/chaninlab/ anti-sickling/.

Chemical space of the anti-sickling agents
Chemical space analysis was performed in order to explore the general characteristics of the active versus inactive classes of anti-sickling agents via the use of Lipinski's rule-of-ve descriptors. In particular, Lipinski's rule-of-ve descriptors are a renement of drug-likeness used to predict whether a chemical compound will exhibit pharmacological or biological activity as an orally active drug in humans, based on the observation that most medications are relatively large-sized lipophilic molecules, comprising the molecular weight (MW), Ghose-Crippen-Viswanadhan octanol-water partition coefficient (AlogP), number of hydrogen bond donors (nHBDon) and number of hydrogen bond acceptors (nHBAcc). 55 The MW represents the mass of a compound, typically used for obtaining interpretations and calculations. Furthermore, the appropriate size of a compound is important for its passage via the lipid membrane. AlogP is a well-known measure of molecular hydrophobicity (also known as lipophilicity), which is used for calculating the membrane penetration and permeability of compounds. nHBDon and nHBAcc are used to measure hydrogen bonding capacity. A visualization of the chemical space of AlogP as a function of MW is shown in (Fig. 3). A dense distribution of anti-sickling compounds was observed within the MW range of approximately 200-400 Da and within the AlogP range of approximately À2.5 to 4. In addition, a visualization of the overall distribution of the data values of Lipinski's descriptors is shown as a box plot in (Fig. 4). It can be seen that most compounds follow the criteria of Lipinski's rule where the MW is less than 500, except for 2 compounds (57 and 58). Furthermore, the AlogP and nHBDon values are less than 5, and nHBAcc is less than 10.
Analysis of the box plots revealed that there were slight differences between the bioactivity classes (i.e. active and inactive) using Lipinski's rule. In addition, the results of statistical analysis show signicant differences in MW and AlogP between the active and inactive compounds using the Mann-Whitney U test. The MWs of the active compounds were larger than those of the inactive compounds, which was observed from the mean value of the box plots. Similarly, the AlogP values of the active compounds were greater than those of  the inactive compounds, whereas the nHBDon and nHBAcc values of the active compounds were less than those of the inactive compounds.

CSAR modeling of the anti-sickling agents
Prior to initially establishing a prediction model, all activity cliffs must be detected, veried and treated using a score of the modelability of the data set or the MODI index. Herein, this index is used to identify the feasibility of obtaining a CSAR model for discriminating active compounds from inactive compounds. For binary data sets, if the MODI index is greater than 0.65, the data set should be reliable for classication modeling. Interestingly, 12 types of ngerprint met this criteria with their MODI index ranging from 0.70-0.84. The interpretative predictive model is more useful for providing the basis of the biological and chemical properties of the anti-sickling agents. Herein, a CSAR model based on RF is presented for discriminating between the active and inactive compounds of anti-sickling agents. Table 2 lists the results from 100 independent runs for the RF model with twelve different types of ngerprint over an internal validation test, 5-fold CV and an external validation test. Furthermore, the decoy set was used to assess the abilities of the CSAR models to accurately predict inactive compounds. From Table 2, it was found that the best averaged values of Ac ¼ 80.48 AE 5.46% and MCC ¼ 0.61 AE 0.11, as evaluated by a 5-fold CV procedure, were achieved using the substructure count ngerprint descriptor. Meanwhile, the E-state and substructure ngerprints performed well, with the second and third highest averaged values of Ac/MCC of 79.88 AE 7.19%/0.60 AE 0.14 and 79.10 AE 6.87%/0.58 AE 0.14, respectively. Interestingly, as for the external validation test, the substructure count ngerprint was also found to outperform other descriptors in terms of their average values of Ac ¼ 82.38 AE 8.99%, Sp ¼ 84.82 AE 11.32%, Sn ¼ 83.27 AE 11.29% and MCC ¼ 0.66 AE 0.17. However, this descriptor provided a moderate Ac value of 85.93 AE 3.29% on the decoy data set, and it was comparable to the substructure that had the best Ac value of 88.13 AE 3.12%. Considering the results from 5-fold CV and the external validation tests, it can thus be seen that the substructure count was superior to other ngerprint classes. Fig. 5 shows the AD of the classication model built using the substructure count as estimated using the PCA bounding box. The undersampling technique was applied using the Kennard-Stone algorithm to select a subset of 32 inactive compounds from the initial set of inactive compounds for balancing. Aerwards, the total set of 64 compounds was split into two sets, consisting of training (80%) and testing (20%) sets, using the Kennard-Stone algorithm. The training and testing sets were subjected to PCA analysis and PCA bounding box plots were constructed for AD analysis. It can be observed that the compounds in the testing set were nearly located at the boundaries of the compounds in the training set, thereby suggesting a well-dened AD for the CSAR model.

Mechanistic interpretation of feature importance
The analysis of feature importance can provide a better understanding of the mechanistic details governing anti-sickling activity. In order to select informative descriptors on substructure counts, this study utilized the RF model because of its builtin ability of feature importance estimation and its great prediction performance, as discussed above. Generally, two measures were used to rank the important features, namely the mean decrease of the Gini index and the mean decrease of the accuracy. Since Calle and Urrea 56 reported that the Gini index had more robust results compared to those from the accuracy, we utilized the mean decrease of the Gini index to rank the importance of the substructure count descriptors.
As suggested previously, 49,50 the m try parameter could be obtained from the square root of the total number of features or by using the default value of m try ¼ 11. In this study, 10 RF models were constructed by varying the m try parameter setting from 2 to 20 (m try˛{ 2, 3,5,7,9,11,13,15,17,19,20}) and xing the m tree parameter at 100. The use of multiple RF models might increase the reliability of the estimation of informative features. The descriptor importance for the substructure count, ranked by the mean decrease in the Gini index, is shown in Fig. 6, and detailed information for the 20 top-ranked informative descriptors is described in Table 3. The descriptor with the largest value of MDGI is the most important.
A further analysis was performed on each of these features by visualizing the prevalence of their functional moieties in the active versus inactive classes by means of a box plot, as shown in Fig. 7. The results showed that 10 out of the 20 top-ranked informative descriptors showed signicant differences between the active and inactive compounds using the Mann-Whitney U test. It could be stated that these informative descriptors are benecial for providing information on the This journal is © The Royal Society of Chemistry 2018 different characteristics of the active and inactive compounds. Notably, these signicant SubFPCs can be divided into three groups, encompassing compounds with aromaticity/ conjugation, compounds with the carbonyl group moiety and miscellaneous compounds.   Fig. 7 Box plots of anti-sickling agents using importance substructure fingerprints. A single asterisk (*) denotes significance at p # 0.05, double asterisks (**) denote significance at p # 0.001 and triple asterisks (***) denote significance at p # 0.0001.
Interestingly, three out of the ten signicant SubFPCs (SubFPC287, SubFPC171 and SubFPC5) belong to the general class of compounds with aromaticity/conjugation. The most important feature was SubFPC287, with an average Gini index value of 5.96, denoting the alternation of single and double bonds. This descriptor is commonly known as conjugation, in which the molecule contributes to a more stable structure due to the delocalization of charge through resonance and hybridization energy. 57 It was found that molecular conjugation is more prevalent in the active class compared to the inactive class. We observed that 23 out of 32 active compounds and 28 out of 83 inactive compounds possessed this property. Moreover, we also found that SubFPC5 is more prevalent in the active compounds, ranking fourth with an average Gini index value of 2.18, corresponding to the alkene in SubFPC287. The second important feature was SubFPC171, with an average Gini index value of 5.56. SubFPC171 is essentially an aromatic ring with an attached chloride atom, known as an aryl chloride. This moiety has been demonstrated to be thermally stable as well as being capable of exhibiting nucleophilicity, owing to its inherent electron-rich properties. 58 Interestingly, SubFPC171 can be found predominantly in almost all active compounds, except for a few (2d, 3d, 20a, 21a, 17b and 3c).
It was found that 27 out of 32 active compounds and 23 out of 83 inactive compounds possessed this moiety. Furthermore, the results of the analysis of the different types of aryl chloride in the active class revealed that compounds 22a, 6c and 18c contain monochlorobenzene, while the remaining compounds contain the dichlorobenzene ring. This feature is related to the increase in activity of the compounds for binding these moieties. 15 Four out of the ten signicant SubFPCs (SubFPC137, SubFPC139, SubFPC49 and SubFPC135) belonged to the general class of compounds containing the carbonyl group moiety. A carbonyl group is a carbon atom double-bonded to an oxygen atom. In particular, this moiety is polar due to the electronegativity of the oxygen atom, and it is more soluble in water as it forms H-bonds. The twelh important feature was SubFPC137, with an average Gini index value of 1.21 denoting a vinylogous ester (R-O-CH]CHCOR 0 ), which is an ester that is relatively similar to a double bond, containing the carbonyl and ethoxy groups. The ethoxy group is known as an ethyl phenyl ether, and it is much more similar to an ester than an ether due to the conjugation between the carbonyl group and the double bond. The resonance of this moiety is also similar to an ester, and therefore it presents a very unique electronic environment for the alkene group. 59 This moiety can be seen conspicuously in the active class. Specically, 19 out of 32 active compounds and 27 out of 83 inactive compounds contained this moiety. The thirteenth important feature was SubFPC139, with an average Gini index value of 1.14 denoting a vinylogous halide, which contains the carbonyl and halide groups. Interestingly, this SubFPC is correlated with SubFPC49, which has an average Gini index value of 0.83 denoting a ketone. We observed that these moieties were apparently found in the active class: 16 out of 32 active compounds possessed these moieties. Furthermore, the nineteenth important feature was SubFPC135, with an average Gini index value of 0.47 denoting the vinylogous carbonyl group, which consists of a carbonyl group and another atom (e.g. nitrogen, oxygen, sulfur or a halide). This moiety was obviously found in 24 out of 32 active compounds and 19 out of 83 inactive compounds.
Moreover, other signicant SubFPCs are miscellaneous descriptors (e.g. SubFPC303, SubFPC302 and SubFPC275). The third important feature was SubFPC303, with an average Gini index value of 4.50 denoting a Michael acceptor, which is a conjugated system with an electron-withdrawing group as an electrophile and a resonance-stabilizing activating group, which stabilizes the anionic intermediate such as an acrylate ester, acrylonitrile, acrylamide, maleimide, alkyl methacrylate, cyanoacrylate or vinyl sulfone. 60 It can be seen that 18 out of 32 active compounds contained a Michael acceptor whereas 3 out of 83 inactive compounds contained a Michael acceptor. The eleventh important feature was SubFPC302 with an average Gini index value of 1.21, denoting a rotatable bond. These are bonds which allow free rotation around themselves, dened as a single bond. 61 This moiety was important for the determination of molecular exibility. Notably, it was found that rotatable bonds were found in all of the compounds in the data set. In particular, they were highly prevalent in the active compounds. The eighteenth important feature was SubFPC275, with an average Gini index value of 0.50 denoting a heterocycle, which is a cyclic compound containing atoms of at least two different elements as members of its ring. It was observed that this moiety is more prevalent in the inactive class. Notably, 1 out of 32 active compounds (28a) and 20 out of 83 inactive compounds possessed this moiety. Therefore, the heterocyclic moiety may reduce anti-sickling activity.

Scaffold and substructure analysis
The investigated compounds were divided into 6 classes on the basis of their chemotypes, consisting of ethacrynic acid (ECA) analogs, benzyloxyacetic acid-based compounds, phenoxyacetic acid-based compounds, aromatic amide-based compounds, proline-based compounds and 2,2-dimethylchroman-based compounds (Table 4 and Fig. 8). Analysis of the structureactivity relationship (SAR) revealed that the length of the alkyl chain, as well as the functional moiety and substitutions on the benzene ring, may inuence the anti-sickling activity of compounds.
Compounds in the ECA class (1a-31a) exhibited the most potent anti-sickling activity when compared to the other chemotypes described herein. This is reected by the highest solubility ratio in the range of 0.961 to 1.224. In particular, ECA was noted for its ability to cross the RBC membrane. 20 The crucial chemical feature required for HbS binding was suggested to be the vinyl moiety and the substitution of halogen atoms on the benzene ring. 15,20 Notably, compound 11a (solubility ratio ¼ 1.224) was shown to be the most potent compound in the data set as it contains many signicant SubFPCs present in the active class. Furthermore, the results showed that the length of the alkyl chain and the functional moiety may inuence the anti-sickling activity of the ECA analogs. The cyclopentane moiety may enhance the anti- sickling activity of these compounds whereas the benzene ring leads to a decrease in the activity (e.g. 24a > 26a). Moreover, the addition of the vinyl group may increase the anti-sickling activity (e.g. 1a > 8a and 1a > 16a). On the other hand, the addition of the CH 3 S moiety may reduce the anti-sickling activity (e.g. 16a > 17a).  Table 4 from the analysis of the structure-activity relationship. It should be noted that the chemical structures of all compounds are provided in the ESI, Fig. 1 The type and number of chemical moieties that are substituted on the benzene ring have been shown to inuence the anti-sickling activity of the benzyloxyacetic acid analogs (1b-34b) and phenoxyacetic acid analogs. Mono-substitution with Br and Cl provided more potent activity compared to substitution with I in both classes of compounds (i.e. for benzyloxyacetic acid, Br > Cl > I ¼ 14b > 6b > 18b and for phenoxyacetic acid, Cl > Br > I ¼ 6c > 12c > 11c). This could be due to the large size of the I atom that may affect the access or interaction of compounds at the target site of action. A similar effect was also observed for aromatic amides, in which the chlorine analog (2d) was found to provide better activity than the bromide analog (4d). Likewise, the addition of a methyl CH 3 group onto the benzene ring of the Cl and Br benzyloxyacetic acid derivatives may reduce the anti-sickling activity owing to an increase in bulkiness (e.g. 8b > 9b and 14b > 16b). This nding is in accordance with a previous study 19 demonstrating that the anti-sickling activity of compounds may not be enhanced by the insertion of methyl and polar groups. In contrast, multiple substitution on the benzene ring of the phenoxyacetic acid core structure led to enhanced activity of the compounds when compared to monosubstitution (i.e. 8c z 9c z 10c > 6c and 3c z 4c > 1c).
The same phenomenon was also observed for aromatic amide compounds (i.e. di-Cl > mono-Cl > without Cl ¼ 3d > 2d > 1d). Moreover, the inuence on the anti-sickling activity caused by the length of the substituted alkyl chain was exemplied for proline, aromatic amide and 2,2-dimethylchroman compounds, in which longer chain derivatives were shown to provide more potent activity than those with a shorter chain length (i.e. 4d > 6d, 5e > 8e and 1f > 2f > 3f). Interestingly, the oxygen atom on the phenoxyacetic acid and benzyloxyacetic acid core structures may be required for the potent anti-sickling activity of these compounds. The replacement of the oxygen atom with a sulfur atom led to a decrease in activity, as found in the halogen derivatives of 2-(phenylthio)acetic acid (6b > 31b and 14b > 33b) and 2-(benzylthio)acetic acid (12c > 19c) derivatives.

Data-driven design of novel compounds
To apply the newfound knowledge from the feature importance and scaffold analyses, a set of compounds was designed by modifying existing compounds from the anti-sickling data set. The following criteria were considered for the selection of template compounds: (i) the selected compounds were labeled to show inactivity with a value less than 1.06, and (ii) selected compounds did not contain heterocyclic substructures that may cause a reduction in anti-sickling activity. These criteria led to the selection of four scaffolds (ECA, benzyloxyacetic acid, phenoxyacetic acid and aromatic amide). The designed compounds for each scaffold are shown in Fig. 9.
In the ECA scaffold class, compound 18a was modied by replacing the bromine atom with the vinyl group (18a 0 ). Notably, the vinyl moiety was suggested to be an important feature for the ECA class, as mentioned in the scaffold and substructure analysis. Compounds 3b, 4b and 5b from the benzyloxyacetic acid class, as well as 5c from the phenoxyacetic acid class, were modied via the attachment of a methyl group to the Cl atom (3b 0 , 4b 0 and 5b 0 and 5c 0 ) as mentioned in the feature importance (SubFPC171) and scaffold/substructure analysis. Compound 5d from the aromatic amide scaffold was modied by replacing the carboxyl group with a Cl atom (5d 0 ). In spite of this, it was found that Cl modication alone was not sufficient to change the bioactivity. Thus, the vinyl moiety was selected to replace the terminal methyl group in the aromatic amide scaffold. The classication model was then applied to the set of designed compounds in order to predict their possible antisickling activity. It was found that the activity class of the evaluated compounds changed from the inactive class to the active class. Thus, the results indicated that the CSAR model as well as the scaffold and substructure analysis are useful for the compound design.

Conclusion
The hallmark of SCD is HbS polymerization, and the consequent conformation change of the RBCs to that of a sickle shape is associated with increased hemolysis. A lucrative therapeutic strategy for SCD is to employ small-molecule inhibitors for disrupting HbS polymerization. A total of 115 compounds were compiled from the literature and the resulting data set was balanced and used for model construction. Several classes of ngerprint descriptors and machine learning algorithms were evaluated for their ability to robustly predict anti-sickling activity. The results indicated that substructure ngerprints, together with the RF method, afforded the best performance while also affording an interpretable set of descriptors. As such, the origin of anti-sickling activity was deduced by rationalizing the contributions of important substructures as selected by the RF-derived Gini index. Feature analysis of the active compounds revealed the importance of aromaticity/conjugation (i.e. SubFPC287, SubFPC171 and SubFPC5, corresponding to a conjugated double bond, an aryl chloride and an alkene, respectively), carbonyl groups (i.e. SubFPC137, SubFPC139, SubFPC49 and SubFPC135, corresponding to a vinylogous ester, a vinylogous halide, a ketone and a vinylogous carbonyl, respectively) and miscellaneous groups (e.g. SubFPC303 and SubFPC302, corresponding to a Michael acceptor and a rotatable bond, respectively). Moreover, analysis of the structureactivity relationship revealed that the length of the alkyl chain and the substitution on the benzene ring may affect the antisickling activity of these compounds. Thus, the knowledge gained from this study serves as general guidelines for the datadriven design of potentially active anti-sickling agents.

Conflicts of interest
There are no conicts to declare.