Christopher D.
Stubbs
a,
Yeonjoon
Kim
b,
Ethan C.
Quinn
a,
Raúl
Pérez-Soto
a,
Eugene Y.-X.
Chen
*a and
Seonah
Kim
*a
aDepartment of Chemistry, Colorado State University, Fort Collins, CO 80523-1872, USA. E-mail: eugene.chen@colostate.edu; seonah.kim@colostate.edu
bDepartment of Chemistry, Pukyong National University, Busan, Republic of Korea
First published on 24th December 2024
Polymer solubility has applications in many important and diverse fields, including microprocessor fabrication, environmental conservation, paint formulation, and drug delivery, but it remains under-explored compared to its relative importance. This can be seen in the relative scarcity of solvent-based systems for recycling plastics, despite a need for efficient and selective methods amid the looming plastics and climate crises. Towards this need for better predictive tools, this work examines the use of classical and deep machine learning (ML) models for predicting categorical solubility in homopolymers and copolymers, with model architectures including random forest (RF), decision tree (DT), naive Bayes, AdaBoost, and graph neural networks (GNNs). We achieve high accuracy for both our homopolymer (82%, RF) and copolymer models (92%, RF) on unseen polymer–solvent systems in our 5-fold cross-validation studies. The relevance and applicability of our homopolymer models are then verified through in-house experiments examining the solubility of common commercial plastics, followed by an explainable AI (XAI) analysis using Shapley Additive Explanations (SHAP), which explores the relative contribution of each feature toward model predictions. We then apply our homopolymer solubility prediction model to remove unwanted or hazardous additives in polyethylene (PE) and polystyrene (PS) waste. This work demonstrates the validity/feasibility of using ML to predict homopolymer solubility, provides novel ML models for the prediction of copolymer solubility, and explains homopolymer model predictions before applying the explained model to a globally relevant waste challenge.
Polymer solubility can be differentiated from small molecule solubility in many ways, significantly impacting its prediction and metrics.12,13 One such difference is the timescale/speed of solute diffusion: because polymer chains are large and susceptible to entanglement, the diffusion of the polymer solute is often far slower than the diffusion of the solvent – which is not universally true for small molecules. Another important distinction between small molecule and polymer solubility is the number of phases typically formed – while small molecules often form 2–3 phases during dissolution, polymers generally form between 4 and 6 phases – which can include a pure polymer layer, an infiltration layer, a gel layer, and a pure solvent layer; this adds significant complexity to theoretical models of solubility.12,13 A third distinction between small molecule and polymer solubility can be seen in the solubility metrics used. In general, because polymers are statistical ensembles of macromolecules, they have qualitatively different dissolution from small molecules; this makes quantifying their degree of solubility difficult as most measures of solubility are not well-defined.12,13 There are three typical measures of polymer solubility: the Flory–Huggins χ parameter, the Hildebrand solubility parameter, and the Hansen solubility parameter.13 Such parameters aim to approximate the change in free energy associated with polymer dissolution, but their methods vary. The Flory–Huggins χ parameter is most commonly defined using a lattice model of polymer solubility, where z solvents (the coordination number) surround each polymer segment, interacting with the polymer to cause an energy change of Δε at temperature T (eqn (1A)).12 In contrast, the Hildebrand solubility parameter (δ) describes polymer solubility using the cohesive energy density (CED), which is the energy required to break all intermolecular interactions per unit volume; in practice, the CED is typically approximated using the heat of vaporization (ΔHvap) and reference volume (V) at temperature T (eqn (1B)).13 The Hansen solubility parameters separate this CED into dispersive (δD), polar (δP), and hydrogen bonding (δH) components (eqn (1C)).13
Eqn (1): definitions of the most common measures of polymer solubility.
![]() | (1) |
While all three parameters have found successful applications,2,14,15 each has its shortcomings. The frequently used lattice model of the χ parameter does not account for variations in chain packing across a polymer sample; while alternative formulations exist, there does not appear to be widespread consensus on the most broadly applicable definition.12 Furthermore, the CED in the Hildebrand and Hansen solubility parameters is often defined using the heat of vaporization – which is not defined for all polymers due to sample decomposition before vaporization.13 To avoid these shortcomings and others, we can consider whether a specific polymer is soluble or insoluble in a given solvent. This approach, termed binary solubility labels, is informative and succinct as it dramatically reduces the theoretical complexity of predicting polymer solubility. Despite this reduction in complexity, there are still significant computational challenges in predicting polymer solubility – primarily due to the size of polymer systems. Molecular dynamics (MD) and density functional theory (DFT) calculations are generally too costly to apply to polymer systems of reasonable chain lengths, and performing hundreds of these calculations would be prohibitively expensive. Furthermore, accounting for parameters such as polymer morphology or crystallinity adds even more complexity – motivating the search for a low-cost, high-accuracy approach to predicting polymer solubility.
To fulfill this need, we use machine learning (ML) – a broad class of statistical algorithms designed to make predictions from an input database. ML has received significant attention in recent years due to its speed and accuracy compared to ab initio or semi-empirical methods.16–19 This makes ML particularly well suited for systems where the methods mentioned above are too costly or insufficiently describe the system. In particular, ML is well suited for polymer solubility towards solvent recycling, given the broad scope of polymers that may exist in current and future waste streams. Chemical information about each polymer/solvent is represented through one or more chemical descriptors for each system. Chemical descriptors indirectly describe a molecule by tabulating electronic, steric, and structure-based properties, using these properties to infer a prediction target (in this case, the binary solubility label). To represent the polymer dissolution process, we argue that one should account for both polymer–solvent energetic interactions (e.g., number of aromatic rings) as well as for steric interactions and diffusion (e.g., Hall–Kier connectivity and shape indices20), both of which play a crucial role in polymer solubility.13
There have been several reported works on predicting polymer solubility using ML, which is unsurprising given its importance. Most such reports have focused on the previously described Flory–Huggins χ and Hildebrand/Hansen solubility parameters.21–23 In addition to the theoretical problems outlined above regarding polymer vaporization and lattice theory, previous literature has found that data on the Hildebrand and Hansen parameters is scarce and that models built upon them generally show subpar performance of 60–75% accuracy.24 Others have considered binary solubility labels and seen improved success; Ramprasad and coworkers published one example in 2020, where the authors used a deep neural network to predict binary solubility labels at relatively high accuracy.25 This work examined limited solvents (24) and architectures (1) for polymer solubility, limiting the model's predictive ability for unseen or novel solvents. A follow-up to this work by Kern et al. predicted binary solubility labels using a RF model alongside an improved version of the previous neural network.26 However, solvent scope (51) and ML architecture scope (2) were still limited.26
While these works represent informative and significant advances in the field and have informed parts of this work, several areas are not covered, which we address herein. These include homopolymer and copolymer solubility predictions, comparisons of multiple ML architectures, and a thorough analysis of the relationship between descriptor choice and model performance. In particular, the absence of copolymer solubility predictions in literature represents a significant disconnect between state-of-the-art computational polymer chemistry and experimental polymer chemistry, where copolymers play an important role in developing new materials. As polymers with two or more repeat units, copolymers have risen to this prominence due to their stoichiometry-modulated properties and their ability to rapidly self-assemble into sophisticated nanostructures with applications in energy storage and light responsive materials.27–29 While ML predictions have been reported for the thermal,30–32 mechanical,33,34 optical35 and morphological36–38 properties of copolymers, so far there have been no reports predicting copolymer solubility for a broad range of copolymers.39,40 We set out to remedy this gap in the literature, among others, by predicting copolymer solubility for a diverse set of copolymers and associated solvents through both classical and deep ML models.
This report uses classical and graph-based ML models to predict homopolymer and copolymer solubility at high accuracy over various descriptors. We performed in-house experiments to validate our homopolymer models and utilized an explainable artificial intelligence (XAI) analysis to explain model performance. We then predict selective solvents for additive removal in polyethylene and polystyrene and propose multiple solvent systems for efficient additive removal. Our models make robust and accurate predictions from only user-supplied chemical names and cover various solvents and polymers, making solubility predictions rapid, accessible, and relevant.
In contrast, the copolymer database had only 270 copolymer–solvent pairs, which is lower than the homopolymer database due to the relative scarcity of copolymer solubility data (Table 1). It should be noted that here, we define good solvents as those in which a polymer chain expands relative to the pure polymer (soluble), whereas in bad solvents, the polymer chain contracts (insoluble). This definition is primarily theoretical and relates to the energetics of solvation rather than the dynamics such as diffusion, but we accept this compromise to increase the tractability of our approach.
For example, the number of hydrogen bond acceptors was chosen as this closely relates to solubility; similarly, the Morgan and RDKit fingerprints were selected to represent structural information directly, which is also closely related to solubility. For every polymer–solvent data point, descriptors were calculated for each monomer and solvent and concatenated together; the descriptors used for monomers/solvents were not necessarily the same. All classical ML models for homopolymers were near-identical to their copolymer counterparts, with copolymer models adding information about stoichiometry (as an array of the comonomer ratios) and sequencing (as an integer representing random/block/alternating – see Fig. S2 in the ESI†).
The seven descriptor groups used for our classical ML models are referred to by their Python package or by the name of their descriptor class. The molecular descriptors are (1) RDKit, (2) Mordred, and (3) RDKit with Mordred, while the fingerprint descriptors include (4) Morgan fingerprint, (5) RDKit fingerprint, (6) RDKit descriptors with Morgan fingerprint, and (7) RDKit descriptors with RDKit fingerprint. The RDKit descriptors included 25 descriptors for each monomer and 24 descriptors per solvent, with a broad array of molecular descriptors available in the RDKit Python package (e.g., topological shape indices, hydrogen bond acceptors/donors, etc.). The pairwise correlations of the RDKit descriptors are shown in Fig. S1 of the ESI,† which also includes the full list of RDKit descriptors. The Mordred descriptors used bear many similarities to the RDKit descriptors in that both describe similar features, but the Mordred descriptors used were much higher dimensions – 885 and 726 descriptors were used to represent each monomer and solvent, respectively. Although we cannot say this conclusively, this does not appear to lead to overfitting based on the 5-fold cross-validation scores (see Results and discussion for additional details) and similarly accurate descriptors with much lower dimensionality (e.g., RDKit). The Morgan and RDKit fingerprint descriptors were used as implemented in the RDKit Python package, with a bit length of 32768 and a radius of 3. This bit length may appear arbitrary at first glance, but it was chosen after examining multiple other candidate bit lengths which had a smaller bit length and did not cover as much chemical space. In this work, we prioritized chemical space coverage over minimal bit length; this approach appears valid based on our classical model results. All concatenations of different descriptor groups were done linearly (no weighting) without any removal of descriptors except in the combination of RDKit and Mordred, where all descriptors were manually checked to ensure there was no overlap or ‘double-counting’ of the same features in the model. All descriptors were split into train/test sets using the Scikit-Learn python package with a fixed random seed (seed of ‘0’) for reproducibility.42
For our graph-based models, descriptor selection was primarily informed by our previous work utilizing a similar architecture to predict small molecule solubility and cetane number.43–45 The atom, bond, and global features used were comparable to previous work, but the input database and chemical space covered differed significantly.
Graph neural network (GNN) models were trained on a different scheme, with an 80/10/10 train/validation/test split and 5-fold cross-validation (70/20/10); previously published reports inspired the GNN architecture used.43–45 In our GNN architecture (see Fig. 4), atom/bond/global features are generated for each molecule from their graph representation and then embedded as a 128-dimensional vector (embeddings). These embeddings then undergo a series of message-passing operations to yield a final readout vector. The readout vectors for each monomer–solvent pair are concatenated together and further transformed to produce a soluble/insoluble label. It should be noted that the previously discussed classical model descriptors do not apply to the GNN models presented, as the model input differs too significantly. The GNN models used were constructed and trained in Python 3.8.13 using the following packages: TensorFlow 2.9.1,46 Keras 2.9.0,47 RDKit 2022.3.5,48 and Neural Fingerprint (NFP) 0.3.0.49 Model metrics for all models were calculated either manually or via Scikit-Learn functions.42 The GNN used categorical cross-entropy as the loss function, and used the Adam optimizer with a learning rate of 1.0 × 10−4, batch size of 1024, and 1000 epochs.
In each homopolymer experiment, approximately 250 mg of each polymer was stirred in 10 mL of solvent for two hours; following this, the sample was filtered in a 2.5 μm pore size filter, and the solvent was removed under vacuum. The filtrate and filtered polymer were weighed after drying in a vacuum oven. The mass of filtrate recovered relative to the initial dry polymer mass determined experimental solubility. If more than 10% of the pre-dissolution mass was recovered as filtrate, the data point was labeled soluble (otherwise, insoluble). Any data points with swelling were discarded, and visual observation of solution clarity and homogeneity was used to support any soluble/insoluble labeling. It should be noted that some initial experiments were performed with a different polymer mass (500 mg instead of 250 mg) but at the same concentration (25.0 mg polymer per 1 mL solvent). All experimental results can be found in Table S11 of the ESI.†
Dataset | Description | Datapoints | Correct predictions (polymer) | Correct predictions (additive) |
---|---|---|---|---|
A | Complete set of additive data | 33 | 33 | 24 |
B | Data with differing polymer/additive solubility | 13 | 13 | 13 |
To predict polymer and additive (small molecule) solubility, our best homopolymer RF model was combined with a newly developed small molecule model, which used the same RF architecture and descriptors as the homopolymer model but was trained on a different database44 and prediction task (Gibbs free energy of solvation). Additional details on the training and evaluation of this model are located in the ESI† (“Small Molecule Random Forest Performance”). In contrast to the polymer solubility model, the additive solubility model was designed to exclusively predict small molecule solubility, reflected in its differing database and prediction task. All additive solubility predictions had negative ΔGsolvation and so were assigned a binary solubility label of soluble. This assignment matched the literature additive solubility for 6/9 additives in Dataset A and all additives in Dataset B, with the three exceptions having significant hydrogen bonding (azodicarbonamide and melamine) or high halogen content (decabromodiphenyl ether). The polymer model was used exclusively for polymer solubility prediction, and the additive model was used solely for additive solubility prediction.
In addition to architecture-level comparisons, Fig. 2 also examines the test set accuracy of each descriptor-derived model. We see that the molecular descriptors (Mordred, RDKit) and their combinations significantly outperform fingerprint-based descriptors on average, with the RDKit descriptors performing the best using a RF architecture (85% accuracy). The differences between each category's top performers range from <5% to over 20%, demonstrating significant variance in descriptor performance across different architectures. While in some cases (AdaBoost, DT), the fingerprint descriptors outperform the molecular descriptors, the peak fingerprint performance over all models is still 3% lower than the peak molecular performance. Furthermore, for all architectures except naive Bayes, adding molecular descriptors to fingerprint descriptors (the two rightmost bars) yields higher or near-equivalent accuracy than the uncombined fingerprint descriptors. This can be rationalized by considering that the top 4 performing descriptors (RDKit, RDKit + Mordred, Morgan FP + RDKit, RDKit FP + RDKit) contain both property-based and structural information, in contrast to the uncombined fingerprint descriptors, which only have structural information. For instance, uncombined fingerprint descriptors can only implicitly encode molecular properties such as the number of hydrogen bond donors or aromatic rings, which can be highly relevant to solubility. Additionally, the top-performing fingerprint descriptors represent their combination with molecular descriptors – by omitting these combinations, the highest fingerprint accuracy is only 77% (Fig. 2a). Comparatively, the most accurate molecular descriptor models can achieve 85% accuracy, using solely RDKit descriptors with a RF architecture.
We can also analyze model performance relative to the dimensionality of each descriptor. Descriptor dimensionality is relevant to model prediction as descriptors with high dimensionality can potentially fail to generalize (overfit), weakening a key advantage of machine learning over alternative methods. We find molecular descriptors are relatively low dimensional (49 for RDKit, 1611 for Mordred), whereas fingerprint descriptors are relatively high dimensional (∼32000). Here, the descriptor dimensionality is determined by the number of entries needed to describe an individual datapoint to the model, which is generally the number of monomer descriptors plus the number of solvent descriptors (e.g. there are 25 + 24 = 49 descriptors for the RDKit models and 885 + 726 = 1611 descriptors for the Mordred models). For the molecular descriptors (Fig. 2a–d), we see a <2% accuracy difference between the RDKit and Mordred descriptors for all architectures except naive Bayes, despite a 30× increase in dimensionality. As there is a significant dimensionality increase with minimal change in accuracy, we argue that the dimensionality of the Mordred descriptors is appropriate. Moving to the fingerprint descriptors (Fig. 2a–d), we see a more significant variance in accuracy (2–10%) between descriptors despite relatively constant dimensionality (adding RDKit and Mordred to fingerprints increases dimensionality by <5%).
We performed a cross-validation analysis on the best-performing RF architecture to further analyze model generalizability. Specifically, in Fig. 3, we compare each RF model's test set accuracy to its k-fold cross-validation score, averaged over five folds of the model training set (k = 5). This analysis shows that the average cross-validation scores are high (>75%) and within 5% of the test set accuracy (Fig. 2) for all descriptors. Furthermore, for these RF models, we find that molecular descriptors outperform the fingerprint descriptors in every case, with the lowest dimensional molecular descriptor model (RDKit) yielding a high cross-validation score (82%). The remaining molecular models (RF with Mordred and RF with RDKit + Mordred) have marginally higher (2–3%) cross-validation scores but have much higher dimensionality, as previously discussed (49 for RDKit vs. 1611 for Mordred). Compared to the molecular descriptor models, the two uncombined fingerprint models (Morgan FP, RDKit FP) have much lower cross-validation scores (76%/76%). Adding RDKit descriptors to the Morgan FP and RDKit FP models improves their cross-validation scores by at least 5%. Additionally, the difference between their test set accuracy and their cross-validation score decreases – implying that adding molecular descriptors to fingerprint models makes these models more accurate and representative of the entire dataset, agreeing with our comments on Fig. 2 above. This analysis concludes that a RF model with RDKit descriptors is precise, highly accurate, and potentially generalizable in predicting homopolymer solubility. Adding molecular descriptors to fingerprint-based models can improve model accuracy and generalizability.
To ensure that our homopolymer models applied to real-world plastics, we performed in-house experiments to measure the solubility of commercial plastics. These experiments examined the dissolution of single homopolymers in select solvents, with the homopolymers and solvents chosen based on their ubiquity. Our results are summarized in Table S11.† Our best classical homopolymer ML model (RF with RDKit descriptors) achieved 79% experimental accuracy, demonstrating the predictive power of our RF model when combined with the easily calculated and low-dimensional RDKit descriptors chosen.
In addition to the classical homopolymer ML models, a more sophisticated ML model (a graph neural network, GNN) was also used to predict homopolymer solubility. Previous work from our group (predicting bond dissociation enthalpy, small molecule solubility, and cetane number) inspired the GNN architecture shown in Fig. 4.43–45 Unlike their classical counterparts, graph neural networks (GNNs) are a class of neural networks that use chemical bonds and atom information directly as model input rather than indirectly through descriptors. In addition to atom and bond information, other global descriptors can be provided to add chemical context. In this case, we use the number of hydrogen bond acceptors/donors, the Labute accessible surface area (ASA), and the topological polarizable surface area (TPSA) as global descriptors for solubility. These descriptors were chosen for their relevance to solubility. For example, hydrogen bonding between a solute and solvent can provide a robust enthalpic contribution to the free energy of dissolution, compounded by the fact that these intermolecular interactions are significant for long polymer chains. The Labute ASA approximates the available surface area for solvent interaction, while the TPSA can approximate membrane permeability.53 The TPSA, in particular, has been used in previous models of polymer solubility.54 By using these molecular-level global descriptors alongside the atom-and-bond level molecular graph, our GNN is believed to capture both local and global solubility information.
![]() | ||
Fig. 4 Overview of the GNN architecture used for homopolymer solubility.43–45 |
We find a 5-fold cross-validation accuracy of 81% for the GNN, compared to 82% for the best classical model (RF with RDKit descriptors). The similar performance between the GNN and our best classical model is somewhat surprising, especially given that the RF model with RDKit descriptors has far fewer descriptors and model parameters than the GNN. One possible explanation is that the input dataset is not yet large enough for the GNN to learn polymer solubility deeply and adjust parameters. Further comparisons with the classical case are challenging, as our GNN only has one architecture which cannot be trivially separated into individual components. Nevertheless, we developed this deep model to predict polymer solubility as it possesses multiple advantages over its classical counterparts, and a similar architecture has succeeded in predicting small molecule solubility.44 The first GNN advantage is flexibility: most classical ML models cannot be trivially modified without undoing previous model refinement and development, while most GNNs can be modified to suit the task or dataset.
Additionally, it is expected that deep model performance scales quite well with database size – thus, we anticipate that with more input data points, the GNN model will outperform its classical counterparts, which are generally superior for smaller datasets. Lastly, deep ML models simply have more (hyper)parameters to tune, allowing for them to potentially better approximate high dimensional problems. With this in mind, we present the developed GNN as a proof-of-concept that can undergo significant fine-tuning to (potentially) improve accuracy over current classical ML models.
To explain experimental performance, we first analyzed the predicted solubility of polystyrene in toluene and poly(lauryl methacrylate) in ethyl acetate (Fig. 5). Our model prediction is accurate in both cases, though the most impactful features differ significantly in magnitude and type. We first consider the solubility of polystyrene in toluene. As expected for a polymer with a pendant aromatic ring, it is unsurprising that polystyrene dissolves in toluene. In terms of model predictions, solvent connectivity/shape indices (ChiNv and kappaN) and monomer TPSA significantly increase the probability of a soluble prediction (P (soluble)), leading to an approximately 99% chance that the model will predict the combination to be soluble (represented by P (soluble) in Fig. 5). In contrast, the fraction of sp3 carbons in the monomer (0.00), number of heteroatoms in the solvent (0), and Chi4v of the monomer (0.59) decrease P (soluble), though this has a much lower cumulative impact than the solvent connectivity/shape indices – leading to the high net P (soluble) of 0.99. Since styrene and toluene have aromatic rings that can favorably interact, this prediction result is reasonable and agrees with our experimental results and our input database of polymer solubility.
We next examine the solubility of poly(lauryl methacrylate) in ethyl acetate. As opposed to the polystyrene/toluene combination, monomer features followed by solvent features contribute significantly to a low P (soluble) of 0.35. This leads to an insoluble prediction that agrees with our input database. Monomer connectivity/shape indices and many rotatable bonds are the most impactful features, and they strongly decrease P (soluble), while the monomer TPSA (26.30) and solvent chi1v (1.90) moderately increase the probability of a soluble prediction. This can be rationalized by examining the structure of the monomer and solvent. Lauryl methacrylate has a long twelve-carbon chain, which significantly impacts the value of its connectivity/shape indices and the number of rotatable bonds, supporting the importance of these features. Ethyl acetate, on the other hand, is a relatively small molecule with only two C2 chains, represented by the solvent's chi3v (0.35) and kappa2 (2.69). As one might expect, monomer and solvent shape play an important role in solubility for these two cases – but whether these trends hold for the entire dataset remains unclear.
To answer this question, we also performed an aggregate SHAP analysis (Fig. 6). We find that the trends in the polystyrene/toluene case generally hold for the entire dataset, as solvent connectivity/shape indices and number of heteroatoms have the largest SHAP values followed by monomer TPSA, monomer fraction of sp3 carbons, and solvent Lipinski hydrogen bond donors. Since solvents are much smaller and generally diffuse faster than long polymer chains, solvent shape, and connectivity descriptors play an important role in model prediction. Monomer shape and connectivity descriptors are also impactful as they make up the majority of the 11th to 20th largest SHAP values (not shown), but they have smaller average SHAP values than solvent shape/connectivity descriptors, which are within the top 10 largest SHAP values and make up all of the top 5 (Fig. 6a).
In addition to the averaged analysis in Fig. 6a, we consider the distribution of SHAP values as feature values change (Fig. 6b). From this, we see distinct relationships between high/low feature values and positive/negative SHAP values, with vertical line widths representing population density. For solvent chi1v and chi0v, we observe that low feature values typically decrease the probability of a soluble prediction, whereas the opposite trend is present for solvent kappa2. While solvent kappa1 and kappa3 show similar trends to solvent chi1v/chi0v, the number of heteroatoms in the solvent shows two distinct clusters. This feature often minimally decreases P (soluble) in solvents with a low number of heteroatoms, but the overall P (soluble) increases in solvents with a larger number of heteroatoms.
In comparison, a significant feature value for monomer TPSA decreases the probability of a soluble prediction significantly, whereas a low value has an opposite and lesser impact, possibly due to a mismatch between monomer and solvent TPSA distributions in our dataset. The fraction of sp3 carbons in the monomer has a much less pronounced impact, with the bulk of data points having minimal SHAP values. Still, there are cases for which this descriptor becomes more important, as seen by the tailing for both positive and negative SHAP values. Lastly, the number of Lipinski hydrogen bond donors per solvent minimally increases P (soluble) at low values but moderately decreases P (soluble) at high values – this is theorized to be due to a mismatch in polarity between monomer and solvent as in the TPSA case.
From the SHAP analysis above, we can consider the most impactful model features and general trends in feature contributions. We find that, on average, the solvent connectivity, shape, and number of heteroatoms contribute the most to polymer solubility, followed by the monomer TPSA and fraction of sp3 carbons. Surprisingly, less impactful is the solvent's number of hydrogen bond donors (Lipinski), potentially because mismatches between solvent branching/polarity and monomer branching/polarity could lead to limited solvent diffusion – which severely limits the number of hydrogen bonds that can form. Although SHAP values measure feature contributions to model predictions rather than to experimental solubility, given the high accuracy of our model on a diverse set of monomers and solvents (431/175 unique, respectively), we have confidence that we can use our SHAP analysis to propose design strategies for polymer solubility. Specifically, we recommend that a good solvent for a polymer should have a similar shape and degree of branching to the polymer's monomer (Fig. 5), and ideally, the polymer's monomer should have a low TPSA value (Fig. 6). As previously stated, solvent shape and degree of branching appear significantly more important than the number of hydrogen bond donors/acceptors. Our recommendations for choosing a poor solvent follow a similar rationale to the good solvent case, and we recommend that a poor solvent should have a different shape and degree of branching compared to the polymer's monomer. Our findings agree with the familiar adage of ‘like dissolves like’ when considering molecular shape and connectivity, affirming that the recommendations from our SHAP analysis make chemical sense.
We first examine our polymer-additive-solvent system for polystyrene, one of the most well-studied polymers, which was chosen here for its relative simplicity. In this case, the polystyrene additive (stearic acid) is a fatty acid added to polystyrene as a lubricant.10 This lubricant reduces internal or external friction for the polymer, preventing polymer films from sticking or decreasing thermal damage under high shear stress.10 While stearic acid is non-toxic,55 any additive presence limits future recycling or upcycling applications, motivating its removal. To remove stearic acid, we utilize solid–liquid extraction by dissolving the additive but not polystyrene in diethyl ether, separating this solution from the polymer, and precipitating stearic acid from the solution by either solvent evaporation, cooling, or by addition of water.10 Towards this, our small molecule and polymer solubility models agree with literature data, and we thus believe that our proposed solvent system is reasonable.
In addition to polystyrene, we predicted two selective solvent systems for polyethylene, which, like polystyrene, is a cornerstone of modern materials. However, the prevalence of polyethylene and the breadth of its applications make recycling and proper valorization difficult yet important. To limit the scope of these challenges, in this example, we limit ourselves to the study of minimally-branched polyethylene (HDPE), modeled as its ethylene monomer. While the material properties of branched PE have spawned rich applications in many industries, the broader aim of our polymer solubility models is to capture chemical effects first rather than the effects of the data-scarce material, such as polymer chain branching (which are not yet sufficiently captured in our dataset). Despite this challenge, we report two examples where polyethylene is separated from two potentially hazardous additives, Sudan I and 2,4-dihydroxybenzophenone.56–59
The first system removes a colorant (Sudan I) from polyethylene by adding acetone and precipitating the additive using the same method as the polystyrene example above. Colorants are additives that change the visible color of polymers and potentially increase heat/light stability.10 While colorants such as Sudan I may be well-suited for their initial application, colorant removal is vital to ensuring optical clarity and consistency in recycled polymers, and Sudan I itself is a potential carcinogen, mutagen, and genotoxin.56
The second system presented uses ethanol to dissolve 2,4-dihydroxybenzophenone, a stabilizer. Stabilizers generally increase polymer resistance to light and heat in a more targeted and effective manner than colorants, potentially improving or maintaining mechanical properties.10 While this is generally beneficial, stabilizers such as 2,4-dihydroxybenzophenone may pose health or environmental risks. A related compound used in sunscreens, oxybenzone, has been banned in Hawaii for its potential danger to coral, and oxybenzone metabolizes to 2,4-dihydroxybenzophenone in vitro.57–59 While this is not conclusive evidence of risks in using 2,4-dihydroxybenzophenone, it is enough to merit an investigation of its removal. We, therefore, propose the removal of 2,4-dihydroxybenzophenone from polyethylene by the addition of ethanol followed by precipitation, as previously discussed.
While the above examples may appear trivial at first glance, given the deep well of knowledge regarding PE and PS, there is no fundamental limitation to applying our methodology to arbitrary polymers with arbitrary solvents other than data scarcity. Using a data-driven approach rather than intuition or prior knowledge, one could identify multiple candidate solvents to remove banned or hazardous additives from multiple polymers within hours rather than weeks or months of testing. Furthermore, one can use our approach in predictive modeling for materials design by identifying additive removal pathways at the design stage rather than after a plastic has been produced. Lastly, while in this work, we focus on solid–liquid extraction, where only the additive dissolves, we can also use our models for dissolution-precipitation, where both the polymer and additive dissolve. Following dissolution, adding a nonsolvent or solvent evaporation change causes polymer precipitation – leading to the effective removal of the additive.10 This method would potentially be more efficient for densely packed or high molecular weight polymers but was not further examined due to limited data.
From the cross-validation results in Fig. 8, all the descriptors examined achieve a cross-validation score of over 75%. We also see similar trends in descriptor performance compared to the homopolymer RF models. This is somewhat surprising, as even though the model architecture (RF) and prediction target (polymer solubility) have not changed, the database and model input have. First, considering molecular descriptors, we find that RDKit again leads to the best balance of model accuracy and descriptor dimensionality, as its cross-validation score is only 1% below the much higher-dimensional Mordred and RDKit + Mordred models. As in the homopolymer case, the uncombined fingerprint models (Morgan FP, RDKit FP) achieve the lowest overall performance at 81% and 77% cross-validation scores. Furthermore, we see the same trend in the combined fingerprint models as before in the homopolymer models – adding RDKit descriptors to fingerprint models improves their cross-validation score by at least 5%. This further supports the claim that our RF models and descriptors are robust and generalizable, as we see near identical trends in descriptor performance despite significantly different database and model inputs. Although the improved accuracy of the copolymer models compared to the homopolymer models may appear contradictory, the smaller copolymer database is less chemically diverse compared to the larger homopolymer database. This makes for an easier prediction task which leads to higher model accuracy, explaining why our copolymer models perform well despite the additional structural complexity of copolymers.
To further expand the scope of copolymer architectures investigated, we also developed a new GNN architecture to predict copolymer solubility. Compared to our homopolymer GNN model, we use three neural networks (one for each component) instead of two, and we also use a weighted average to combine the two comonomer neural networks, taking into account the copolymer ratio (Fig. 9). As far as the authors know, this approach to predicting copolymer solubility has not been reported in the current literature. While the reported 5-fold cross-validation score of the copolymer GNN model (86%) is lower than the copolymer RF model with RDKit descriptors (92%), the presented GNN architecture has not been subject to years of refinement as the most common classical ML models have. Furthermore, similar to our homopolymer GNN, model performance is expected to improve with database size. Given this, GNN models of homopolymer and copolymer solubility may have the potential to surpass their classical counterparts, given sufficient honing.
![]() | ||
Fig. 9 The modified copolymer architecture used for the copolymer GNN model. Note that the message passing structure and some solvent vector operations are omitted for clarity. See Fig. 4 for additional model details, such as the conversion of monomer/solvent descriptors into a 128-dimensional vector. |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00290c |
This journal is © The Royal Society of Chemistry 2025 |