Joseph
Kern
,
Shruti
Venkatram
,
Manali
Banerjee
,
Blair
Brettmann
and
Rampi
Ramprasad
*
School of Materials Science and Engineering, College of Engineering, Georgia Institute of Technology, 771 Ferst Drive, J. Erskine Love Building, Atlanta, GA 30332-0245, USA. E-mail: rampi.ramprasad@mse.gatech.edu
First published on 31st October 2022
We present machine learning models trained on experimental data to predict room-temperature solubility for any polymer–solvent pair. The new models are a significant advancement over past data-driven work, in terms of protocol, validity, and versatility. A generalizable fingerprinting method is used for the polymers and solvents, making it possible, in principle, to handle any polymer–solvent combination. Our data-driven approach achieves high accuracy when either both the polymer and solvent or just the polymer has been seen during the training phase. Model performance is modest though when a solvent (in a newly queried polymer–solvent pair) is not part of the training set. This is likely because the number of unique solvents in our data set is small (much smaller than the number of polymers). Nevertheless, as the data set increases in size, especially as the solvent set becomes more diverse, the overall predictive performance is expected to improve.
To enable informed solvent identification, several empirical methods have been proposed with varying degrees of success, including the Hildebrand and Hansen methods. In the Hildebrand method, polymers and solvents with similar Hildebrand parametric values (which are related to the cohesive energy density) are predicted as good solvents and those with values differing by more than a threshold are predicted as bad solvents.1,11 In the Hansen method, the difference between three parameters quantifying dispersion, dipolar, and hydrogen bonding interactions for the polymer and solvent are used to provide an estimate of the solubility of the polymer in the solvent.12 Previous work by Venkatram et al. showed the Hansen method performed only marginally better than the Hildebrand method despite its greater complexity.13
Several computational approaches have been used to estimate these solubility parameters, including group contribution methods14 and a variety of machine learning techniques. Sanchez-Lengeling et al. used Gaussian process regression with Morgan and MACCS fingerprints to estimate Hansen solubility parameters for 31 polymers and 193 solvents, achieving reasonable R2 performance (0.56–0.83).15 Kurotani et al. used only four pieces of analytical data to predict Hansen solubility parameters for polymers as well, and while their R2 performance was lower than Sanchez-Lengeling et al., they posited that it may be useful for new polymers with unknown SMILES strings.16 Others have used descriptors derived from atomic structure and quantum chemical calculation for small molecules representing polymer repeat units to predict the Hildebrand solubility parameters using kernel ridge regression and multi-linear regression models.17 Most recently, Liu et al. collected data on 81 polymers and 1221 solvents and created more easily interpretable regression models to predict Chi, Hildebrand, and Hansen parameters.18 They featurized their polymers and solvents using RDKit generated chemical fingerprints such as the count, density and weighted sum for atoms, and 2nd order features generated from the 3d structures of trimers and solvents such as LUMO, HOMO and heat of formation.
Meanwhile, previous work by Chandrasekaran et al.,19 inspired by the promise of machine learning in the materials domain,20,21 showed that a deep neural network trained on a data set of 4595 polymers and 24 solvents could dramatically outperform the Hildebrand approach to estimating solubility.1,11 This method utilized chemical features to represent polymers and a one-hot (label-based) encoding to represent the solvents which were used to train a multilayer perceptron neural network that could classify whether a particular polymer–solvent combination was soluble or insoluble. They also found that a Hildebrand Gaussian process model's classification accuracy was much worse than the neural network, correctly classifying good-solvents only 50% of the time and bad solvents 70% as opposed to the over 90% accuracy for the neural network. The downside to their neural network approach, however, was that solubility estimations could be performed for only the 24 solvents within the training data due to the one-hot encoding; i.e., predictions could not be generalized to cases outside the list of 24 solvents.
In the present work, we demonstrate a method to generalize machine learning models to any solvent so predictions can be made on previously unseen solvents. Moreover, we perform a critical analysis of the strengths and weaknesses of any such data-driven approach and identify situations where caution is mandated. First, a data set of 3373 polymers and 51 solvents was collected. Next, we structurally fingerprinted the polymers and solvents using a hierarchical methodology that is generalizable to any polymer–solvent pair, unlike the one-hot encoding method. Finally, these fingerprints and solubility data were used to train a random forest classifier and a deep neural network binary classifier that predicts if a polymer–solvent pair is soluble or insoluble. A production version of the random forest model has been deployed to our online polymer informatics platform, polymer genome (https://www.polymergenome.org).22,23
In the Methods section we provide details on the dataset, featurization of the polymers and solvents, and the model infrastructures. In the Results and discussion section, we analyze the performance of the two machine learning models with respect to the quality and diversity of the data they have been exposed to, the differences between the two models, areas of caution, and potential next steps.
The training data set consisted of 3373 polymers and 51 solvents (see Table S1, ESI†) making up 11913 soluble pairs and 8843 insoluble pairs. An additional 2909 polymers and 7 solvents (see Table S2, ESI†) making up 7736 soluble data points and 1129 insoluble data points were tested with the final model. These 2909 polymers and 7 solvents did not have instances of being both soluble and insoluble, thus, they were excluded from the training data to prevent the model from incorrectly learning that the specific polymer or solvent is always one class (either soluble or insoluble). A full analysis is available in the section (Fig. S1 and S2, ESI†) describing why this was done.
For solvents, either one-hot encoding or structural fingerprinting was used. One-hot encoding creates 51 columns (corresponding to the number of training solvents). Each row represents a polymer/solvent combination with the column corresponding to the solvent being 1 and all other solvent columns being 0. The structural fingerprint uses the same three hierarchical levels of descriptors, comprising 155 features, inspired by previous work on polymer fingerprinting.22,30 There are differences between the solvent and polymer fingerprints, however. For instance, the solvent fingerprints sample separate blocks and some QSPR descriptors are not used, such as the fraction of atoms that are part of side chains and length of the largest side chain.
To determine if our hierarchical solvent fingerprint adequately represents the solvents and to visualize the high dimensional fingerprint in two dimensions, we used a Uniform Manifold Approximation and Projection (UMAP) plot shown in Fig. 1. From this plot, we find that similar solvents cluster together in the fingerprint space. Acids, like nitric acid or formic acid, are clustered together, as are cyclic compounds like benzene and phenol. There were two grouping of halides: ones with carbons that contain hydrogen atoms, like chloroform, and ones without hydrogen atoms, like trichloro(fluoro)methane. The second grouping also contains carbon disulfide. There is a small grouping of dimethyl solvents that contain either P, S, and or N that also contain a double bond O. These include solvents like dimethylformamide (DMF) and dimethylsulfoxide (DMSO). Finally, there are the alkanes, alcohols, and other solvents that contain only carbon or carbon and oxygen. Note that UMAP plots are stochastic and affected by hyper-parameters such as n_neighbors (a constraint in the size of the local neighborhood UMAP looks at), min_dist (the minimum distance apart points are allowed to be in the low dimensional representation) and metric (how distance is computed in the ambient space of the input data). We used default values except for n_neighbors = 3, min_dist = 0.99, with a cosine metric for this image.31 The n_neighbors value of 3 allowed us to observe the local manifold structure as opposed to global. The min_dist of 0.99 was to prevent points from being too close together for visualization purposes. The cosine metric was chosen because it has been found to be a better metric for chemical similarity calculations than Euclidean or Manhattan metrics.32
(1) |
(2) |
(3) |
When assessing performance, the training dataset of 3373 polymers and 51 solvents was split using five-fold cross validation. Three types of splits were done: a random split stratified by solubility, a split by the polymer, and a split by the solvent. The random split stratified by solubility was meant to measure how the model performed on previously seen polymers and solvents. The split by the polymer was meant to measure how the model performed on unseen polymers but previously seen solvents. The split by solvent was meant to measure how the model performed on unseen solvents but previously seen polymers. Each split type created five-folds of equal size. The model was trained on four of the folds and tested on the fifth five times, so each fold was a test fold once.
For the neural network, We used the same multilayer perceptron neural network infrastructure and hyperparameters as described by Chandrasekaran et al. The neural network consists of two input branches, one for the solvent fingerprints and the other for the polymer fingerprints. Two hidden layers are used for the solvent branch and three are used for the polymer branch. Each hidden layer has 100 neurons. A final layer of 20 neurons is added to the end of each branch which are concatenated and fed into a final set of layers. The final set of layers consists of four hidden layers with 100 neurons each. The final output of the NN is a single neuron with a sigmoid activation function. An activation value of 1 means the polymer is soluble in the solvent and an activation of 0 means it is insoluble. The threshold to differentiate between the two is 0.5. Each hidden layer in the NN uses a parametrized rectified linear unit (PReLU) activation function.19 The input layer for the solvent is altered when the structural fingerprint is used instead of the one-hot encoding. This neural network will be referred to as SolNet2 to distinguish it from an earlier model (SolNet) that used one-hot fingerprinting for solvents.
For the random forest classifier, SciKit-Learn was used.33,34 BayesSearchCV, a method using Bayesian optimization over hyper parameters implemented with the scikit optimize package,35 with five-fold cross validation for 25 iterations optimizing max_depth = [None, 50, 100], max_features = [0.5, 0.75, ‘log2’, ‘sqrt’], min_samples_leaf = [2, 6], and n_estimators = [100, 500] was used for hyperparameter optimization. We performed hyperparameter optimization for each split type, but found that model test F1 score did not significantly improve after hyperparameter optimization, whereas the model training time was dramatically slowed. As such, we used the default parameters with n_estimators = 100, max_depth = None, max_features = ‘sqrt’, and min_samples_leaf = 1 for the reported trees.
The splitting method used for five-fold cross validation dictated model F1 score. For instance, when a random split stratified by solubility was chosen to split the training and test sets of the model, the F1 score for insoluble pairing predictions was 0.866 ± 0.020 for SolNet. When the training and test data was split so no polymer in the training data was in the test data, the F1 score for insoluble pairing predictions dropped to 0.840 ± 0.010 for SolNet. This was likely because, in the former case, i.e., when a random split was used, the model had seen most of the polymers and solvents before in the training data. Even though the test polymer–solvent combination was unique, the model could more readily extrapolate to this new case based on similar experiences with the same polymer and solvent in the past. In a polymer split, the model hadn't seen the test polymers before, so it had to solely extrapolate from train polymers that had similar chemistry to the test polymers. The differences in the polymer chemistry would then have an impact on the predictive accuracy. This trend held for the models trained using a structural fingerprint too.
The model performed slightly better with the solvent structural encoding (SolNet2) and had tighter error bars. For example, the F1 score for insoluble predictions for a random split was 0.877 ± 0.004 as opposed to 0.866 ± 0.020 for SolNet. However, this difference is small and the structural fingerprint required a dramatically larger fingerprint. The main strength of the structural encoding was that it was able to generalize to solvents outside of the data set. In fact, SolNet is incapable of handling the solvent split case (which is why Fig. 2 does not have this case represented). SolNet2, on the other hand, can handle solvent splits. Nevertheless, even in this case, a solvent split resulted in poorer performance than the random-split and polymer split (having a smaller F1 score of 0.612 ± 0.121 for insoluble pairings). This was likely due to three reasons: solvent scarcity, polymer scarcity within splits, and a greater imbalance in classes.
Regarding solvent scarcity, only 51 solvents were in the data set in contrast to the 3373 polymers. During a polymer split, the test polymers are more likely to have chemically similar neighbors that have been paired with the specific solvent before, whereas this was unlikely in the solvent split due to solvent scarcity. This may provide an explanation for why the polymer split F1 score was between 0.83–0.89 for each of the five test fold versus 0.34–0.74 for the solvent split test folds.
The second potential reason for the low F1 score of the solvent split could be the polymer scarcity within a train-test split. For instance, a polymer in the test set may have nine instances in the train set, but each instance in the train set may be soluble while the test case pairing was insoluble. Thus the model will incorrectly classify the test polymer–solvent combination as soluble when it is actually insoluble because it has never seen an instance of this polymer being soluble. For each split there are between 122 to 194 polymers that have soluble pairings in the test set but only insoluble pairings in the train set and between 48 to 994 polymers that have insoluble pairings in the test set but only soluble pairings in the train set. This does not fully explain the variation in F1 score between splits though. For instance, while one split had 994 polymers that had an insoluble pairing in the test set but only soluble pairings in the train set, the model correctly predicted the class in 83.6% of these test pairings and the total insoluble F1 score for that split was 0.734. Conversely, splits with relatively low numbers of these cases (where <5% of data has a class for a polymer in the test split that has not been seen in the train split) have low F1 scores of 0.32 to 0.48. Thus, while the polymer scarcity within a train-test may contribute to the variation in the F1 score, it does not seem to be the dominant reason for the discrepancy.
The final possibility could be due to class imbalance within the test splits. The F1 metric is a harmonic mean of precision and recall, and precision is highly affected by class imbalances.36 Some generated examples of the effect of data imbalance on F1 scores are shown in Fig. S3 in the ESI.† During stratified random splits, 57% of the test data is soluble and 43% is insoluble. Polymer splits are very close to this as well (55% to 59% of test data is soluble). For both these split types, the variability in class imbalance was lower and the classes were not heavily imbalanced, so variability in precision due to class imbalance was expected to be low. For solvent splits, the soluble test data makes up 41%, 50%, 50%, 71%, and 75% of the data with insoluble data being the remainder. When the soluble percentage is close to 50%, precision for soluble and insoluble predictions are almost equivalent (0.02 difference between precision values). When the soluble percentage is larger (71 or 75%), precision values for soluble predictions are 0.81 and 0.90 respectively, whereas precision for insoluble predictions are 0.4 and 0.34 respectively. This large difference is likely due to the class imbalance. However, after assessing the cross-fold variability in recall (shown in Fig. S4, ESI†), this effect does not seem to be the main cause of the variability in F1 score for folds. As such, we believe the dominant issue was the solvent scarcity, as discussed further below.
The random forest classifier outperforms the SolNet2 model across all methods of splitting data. For instance, in the random split stratified by solubility the average F1 score was 0.935 ± 0.003 for soluble pairings vs. the SolNet2 performance of 0.904 ± 0.005. The largest improvement was in the group split by solvents, with a jump for insoluble predictions from 0.612 ± 0.121 in the SolNet2 model to 0.682 ± 0.097 in the random forest model and a jump for soluble predictions from 0.539 ± 0.195 in the SolNet2 model to 0.725 ± 0.137 in the random forest model. The superior performance of the random forest classifier could be because the random forest's ensemble method reduces the chance of over-fitting or because the dataset is too small for a neural network to outperform the random forest.
As seen in Fig. 4, there are, in general, two cases that occur depending on the solvent: one where the recall for one class is high and the other low (e.g., for many alcohols) and another where both soluble and insoluble recalls are roughly the same (e.g., for several aromatic compounds). In either case, the model performs best when (1) the left-out solvent is similar to some training solvents, (2) the polymers accompanying the left-out solvent are similar to the polymers accompanying the similar training solvents, and (3) the polymer–solvent combination of the test cases fall in the same class as the similar pairs in the train set. The model performs worst if there is a similar test solvent with similar training polymers, but the classification of the similar training combinations are opposite (i.e., counter to requirement (3) above). For instance, despite the alcohols having many similar solvents in the training data (according to Fig. 1 and 4), the model often struggled to correctly classify soluble pairings. For the alcohol methanol, this is due to the addition of water. Based on Fig. 1 and a Tanimoto similarity score of 0.53 (eqn (S1), 0 is completely different, 1 is chemically equivalent, ESI†), water is somewhat chemically similar to methanol. The water data also contains a significant number of training polymers that are chemically similar to the test polymers paired with methanol. For insoluble methanol–polymer combinations these chemically similar water–polymer combinations were the same class, but for soluble methanol–polymer pairings the chemically similar water–polymer combinations were the opposite class (i.e., insoluble). Thus, the model often performs poorly for these soluble methanol–polymer pairings because it's been trained on oppositely classified but chemically similar water–polymer pairs, hence the low soluble recall. A more in-depth learning curve analysis of the LOO analysis is shown in the “Learning Curve” section in the ESI.†
Finally, we assessed the random forest model's performance on the additional 2909 polymers and 7 solvents making up 7736 soluble data points and 1129 insoluble data points that were held out since the polymers and solvents only had one class associated with them. Confusion matrices for the predictions on this set are shown in Fig. 5. The top left matrix represents all data, the top right represents the pairings where both the polymer and solvent were unseen by the model, the bottom right represent the pairings where the solvents had been seen by the model but the polymers had not been, and the bottom left represents the pairings where the polymers had been seen by the model, but the solvents had not been.
The recall for these predictions was 0.624 for insoluble points and 0.862 for soluble points. Due to the extremely large imbalance in soluble to insoluble data points (87% of data is soluble), assessing precision and F1 score would not be useful. All insoluble pairs were instances where the polymer was not seen by the model before, but the solvent was seen previously. This suggests we should have seen an insoluble recall closer to the group split by polymer levels (0.83–0.89). This lower recall could be due to a number of factors such as an extremely new polymer space or inaccuracies in the experimental data set. With regards to the new polymer space, these new polymers had 42 new, unique chemical fragments that were not seen in the training data, 32 unique atomic triplets, and the element iodine only appeared once in the training data vs. six times in the test data. On the other hand, there were only 31 test polymers with no similar training polymers (defined as a Tanimoto similarity score, eqn (S1) (ESI†), greater than 0.75). As such, it's also possible that, while the polymers had been seen before, they had not been tested in solvents similar to the test solvents or, if they had been tested in similar solvents, it's possible that the true solubility was opposite in those similar solvents. With respect to inaccuracies in the experimental data set, this is hard to measure without manual experiments. We collaborated with experimentalists to test some combinations at room temperature and found that a small percentage did conflict. However, polymer solubility can be significantly impacted by molecular weight and concentration, so two experiments could have different results depending on these values. Future work could explore how this uncertainty can impact model performance, as has been done with Tg predictions on polymers, and it is important future work consider these critical engineering parameters.37
This work has a few key takeaways:
1. Our generalizable solvent fingerprint was as effective as a previously used one-hot encoding method and could predict on solvents unseen by the model during training, unlike the one-hot encoding method.
2. The models were more accurate for predictions on unseen polymers vs. unseen solvents, likely due the larger polymer data set size (3373) compared to solvent data set size (51).
3. The model performance was modest on unseen solvents, likely because the model either had not seen similar solvents during training, or because the polymer–solvent combinations seen during training were too dissimilar (either chemically or by classification) to the test polymer–solvent combinations.
4. The random forest model outperformed the neural network based models in all splitting methods.
5. The random forest models can make accurate predictions for polymer–solvent pairings where both the polymer and solvent had never been seen if they were trained on similar polymer–solvent pairings (chemically and by classification).
The purely data-driven framework described in this work can be systematically improved as it is exposed to an even larger quantity and diversity of data. It is critical that future data incorporates more solvents, and that the quantity and chemical diversity of polymers tested in each of these solvents is increased. Future works should also focus on including other features that affect solubility, such as concentration, molecular weight, and temperature, and it could expand to include partial solubility and solvent mixtures. Additionally, lower-fidelity solubility predictions (e.g., Hildebrand predictions) may be useful for creating multi-fidelity models with increased data set size and diversity. Finally, an active learning approach might be beneficial, where the chemical space of both polymers and solvents are evaluated in order to choose representative samples to experiment on.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2cp03735a |
This journal is © the Owner Societies 2022 |