Takayoshi
Yoshimura
a,
Hiromoto
Kato
a,
Shunto
Oikawa
b,
Taichi
Inagaki
a,
Shigehito
Asano
c,
Tetsunori
Sugawara
c,
Tomoyuki
Miyao
b,
Takamitsu
Matsubara
b,
Hiroharu
Ajiro
b,
Mikiya
Fujii
b,
Yu-ya
Ohnishi
d and
Miho
Hatanaka
*ae
aGraduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan. E-mail: miho_hatanaka@keio.jp
bNara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
cFine Chemical Process Department, JSR Corporation, 100 Kawajiri-cho, Yokkaichi, Mie 510-8552, Japan
dMaterials Informatics Initiative, RD technology and digital transformation center, JSR Corporation, 3-103-9 Tonomachi, Kawasaki-ku, Kawasaki, Kanagawa 210-0821, Japan
eInstitute for Molecular Science, 38 NishigoNaka, Myodaiji, Okazaki, Aichi 444-8585, Japan
First published on 28th November 2024
Polymer informatics, which involves applying data-driven science to polymers, has attracted considerable research interest. However, developing adequate descriptors for polymers, particularly copolymers, to facilitate machine learning (ML) models with limited datasets remains a challenge. To address this issue, we computed sets of parameters, including reaction energies and activation barriers of elementary reactions in the early stage of radical polymerization, for 2500 radical–monomer pairs derived from 50 commercially available monomers and constructed an open database named “Copolymer Descriptor Database”. Furthermore, we built ML models using our descriptors as explanatory variables and physical properties such as the reactivity ratio, monomer conversion, monomer composition ratio, and molecular weight as objective variables. These models achieved high predictive accuracy, demonstrating the potential of our descriptors to advance the field of polymer informatics.
Another critical aspect of polymer informatics is the definition of appropriate descriptors. For instance, BIGSMILES24–27 and Polymer Markup Language28 have emerged as string-based descriptors for polymers, serving well in database construction and forming the backbone of data-driven research. However, because string-based descriptors do not directly represent molecular structures or properties, a vast amount of data is typically required to build machine learning (ML) models using these features as explanatory variables. Alternatively, attentive fingerprints of monomers and dimers have been proposed as descriptors for graph attention networks aimed at predicting the physical properties of copolymers.29 However, constructing such networks requires a substantial dataset of up to 4000 data points. Given the difficulties in accumulating extensive data in polymer-synthesis laboratories, even with automated experimental equipment, developing effective descriptors is imperative for constructing ML models capable of predicting the physical properties of polymers, even with limited datasets. In particular, in the search for polymers or copolymers with specific properties obtained by varying monomers or monomer pairs along with process variables (synthesis conditions), selecting appropriate descriptors for monomers or monomer pairs is crucial, because ML models must exhibit high prediction accuracy for untested monomers or monomer pairs to ensure reliable extrapolation accuracy. In our previous study,30 we demonstrated that incorporating density functional theory (DFT) parameters, including the activation barriers and reaction energies of the initial stage of radical polymerization, as descriptors, along with process variables, improved the extrapolation accuracies of copolymer properties (monomer conversion and monomer composition ratio) for monomer pairs not included in the training data. One of the notable achievements of our descriptor was that ML models trained on data from copolymers composed of several monomers could accurately predict the properties of copolymers containing an untested monomer with a different skeleton, a task that conventional descriptors found challenging.30
Therefore, in this study, we computed the descriptors of copolymers, including reaction energies and activation barriers, for 2500 radical–monomer pairs of 50 commercially available monomer species and compiled them into an open database named the “Copolymer Descriptor Database (CopDDB)”. The remainder of this paper is organized as follows: first, the radical–monomer pair descriptors and their calculation methods are described, followed by an explanation of the conversion of radical–monomer pair descriptors to those for copolymers. We then observed the chemical space defined by the descriptors and used them as explanatory variables of ML models in three case studies. In the first and second case studies, we constructed ML models to predict several physical properties using the descriptors in the CopDDB as explanatory variables and validated their predictive abilities. The objective variable in the first case study was the reactivity ratio r1 from the literature, which is an important parameter used to estimate the monomer composition ratio from the monomer ratio to be prepared (i.e., the copolymerization composition curve). The objective variables in the second case study were the physical properties of binary copolymers, such as the monomer conversion, the monomer composition ratio, and the molecular weights, measured under different monomers and process variables in our previous study.30 In the third case study, we applied Bayesian optimization (BO) with only one step, called one-shot BO, to find the appropriate process variables to achieve the desired physical property using an untested monomer. Through these three case studies described, we have demonstrated the usefulness of CopDDB descriptors.
The reactivity parameters represent the relative electronic energies of the elementary reactions at the initial stage of radical polymerization shown in Fig. 1. Polymerization begins with the addition of an initiator radical to a monomer, which is usually a barrierless process, followed by repeated C–C bond formation with another monomer. Therefore, the reaction energies for the addition of a model initiator radical (the methyl radical) to M1 at the head and tail positions (ΔEhead and ΔEtail in Fig. 1, respectively) were calculated. The conformations of the methyl radical adduct to M1 were generated using an automated reaction-path search method called the multicomponent artificial force-induced reaction (MC-AFIR) method.32,33 We randomly selected one of the M1 conformers and placed a methyl radical at a random position, then performed the AFIR calculation with the artificial force between M1 and the methyl radical. This process was repeated until three successive AFIR searches found the already obtained geometries. The most stable geometries of the head- and tail-adducts were used to calculate ΔEhead and ΔEtail. Next, the local minima (LMs) and transition states (TSs) along the C–C bond formation pathway (head-to-tail addition) between the head adduct ( in Fig. 1) and monomer M2 were computed. The reaction pathways for the head-to-tail addition, starting from 20 random initial alignments, were explored using the MC-AFIR method. The obtained pathways (AFIR pathways) were usually close to the real reaction pathways and the energy maximum point on the AFIR pathway could be appropriate for an initial geometry for the geometry optimization of TS. In our case, however, the reaction coordinate was very simple, and the reactive C–C bond distance in a TS was found to be close to 2.28 Å according to our preliminary calculations. Thus, we selected a geometry on the AFIR pathway where the reactive C–C bond distance was close to 2.28 Å, used it as the initial structure for the relaxation calculation by fixing the C–C bond distance, and then carried out the geometry optimization without any constraints (note that the criterion to select an initial geometry was not critical but often affected the computational time). The most stable TS was selected when multiple TSs were obtained. Intrinsic reaction coordinate (IRC) calculations34 were performed to confirm the TS and obtain the structures of the corresponding precursors and products. All precursors, TSs, and products were confirmed by frequency calculations. The energies of the precursor and TS relative to the dissociation limits of
and M2 (ΔEprecursor and ΔETS, respectively) and the activation barrier ΔEbarrier (i.e., the energy difference between the precursor and TS) were collected. All AFIR calculations and geometry optimizations (without constraints) were performed at the GFN2-xTB35 and B3LYP-D3/def2SVP36–39 levels, respectively. The energies and energy gradients were calculated at the GFN2-xTB level using the ORCA program40 and at the B3LYP-D3 level using the Gaussian16 program.41 These computations supported the AFIR calculations and geometry optimizations conducted through the GRRM program.42 The activation energies for subsequent polymer elongation were not collected, as the reactivity of the propagating radical is primarily considered to depend on the identity of the monomer unit at the propagating end rather than the chain length and composition. As shown in Table S2,† the activation barriers for C–C bond formation at the same propagating end (with different chains) were similar.
The electronic parameters include frontier orbital energies and energy gaps, calculated for the most stable conformers at the B3LYP-D3/def2SVP level.36–39 The singly occupied molecular orbital (SOMO) and lowest unoccupied molecular orbital (LUMO) energy levels of and the highest occupied molecular orbital (HOMO) and LUMO energy levels of M2 were measured. The energy gaps between the SOMO of
and the HOMO of M2, and between the SOMO of
and the LUMO of M2, were determined.
The geometrical parameters include the reactive C–C bond distance and the dihedral angle at the TS of the head-to-tail addition (<C1–C2–C3–C4 in Fig. 1), as well as the volumes and percent buried volumes (%Vbur)43 of the most stable conformers of and M2. The volume and %Vbur represent the bulkiness of the molecule itself and around the reactive center, respectively. The volume was calculated using the Gaussian16 program, and %Vbur was determined using our own program. In addition, conventional parameters obtained with the ChemDraw program were included, such as the sum of the molecular masses of
and M2, and the partition coefficients (log
P) for M1 and M2. In summary, CopDDB includes 24 descriptors: seven reactive parameters, six electronic parameters, eight geometrical parameters, and three conventional parameters, for 2500 radical–monomer pairs.
In this study, we present three case studies that utilize the ML approach with CopDDB parameters as descriptors. The first case study focused on the reactivity ratio r1, which represents the ratio of the reaction rate constants k11/k12. Thus, the DFT-based parameters (seven reactivity, six electronic, and eight geometrical parameters) for (, M1) and (
, M2) were used as descriptors for the M1, M2 monomer pairs, resulting in a total of 38 parameter sets. In the second case study, we focused on the physical properties of five binary copolymers synthesized by combining methyl methacrylate (MMA) with five monomers: styrene (St), glycidyl methacrylate (GMA), 4-acetoxystyrene (PACS), tetrahydrofurfuryl methacrylate (THFMA), and cyclohexyl methacrylate (CHMA). By classifying St, GMA, PACS, THFMA, and CHMA as M1 and MMA as M2, the parameter sets for (
, M1), (
, M2), and (
, M1) were utilized as descriptors for the binary copolymers. The same descriptor preprocessing was applied in the third study, which validated the prediction accuracy for another M1, 2-hydroxyethyl methacrylate (HEMA), using one-shot BO.
Here, a “good” descriptor set is expected to yield similar values for molecules with similar properties. For example, monomers with a styrene skeleton, such as St and PACS, are expected to exhibit similar properties compared to other monomers, such as acrylates and methacrylates. In other words, these monomers should be located close together in the chemical space spanned by the descriptor set. To verify this, we focused on the two-dimensional distribution of 2500 radical–monomer pair data points within the chemical space defined by all the descriptors in the CopDDB. As shown in Fig. 3a, the data points corresponding to St and PACS radicals (i.e., radical–monomer pairs of (St*, M1) and (PACS*, M1); M1 = 50 monomers) were localized in a specific region. Similarly, the data points for their monomers (i.e., (, St) and (
, PACS)) were also located relatively close by. In contrast, within the chemical space defined by a conventional descriptor set (a list of RDKit descriptors for two monomers M1 and M2; see Table S5† for the details), the data points for St and PACS were separated as shown in Fig. 3b. This trend was also observed for alkoxyacrylates (see Fig. S6†). This indicates that the CopDDB descriptors differ significantly from the RDKit descriptors, which were designed to describe the properties of general molecules, and that the CopDDB descriptors effectively capture the specific properties as monomers for radical copolymerization.
![]() | ||
Fig. 3 Visualization of 2500 radical–monomer pair data points within the chemical space spanned by all the descriptors in the CopDDB (a) and the RDKit descriptors listed in Table S5† (b). The data points were projected into two dimensions using t-distributed stochastic neighbor embedding (t-SNE),44 with standardized descriptors and a perplexity parameter of 50. In panel (a), the data points for (St*, M1), (PACS*, M1), (![]() ![]() ![]() |
Next, we examined the 2500 radical–monomer pair data points in the chemical space defined by the CopDDB descriptors, along with the representative descriptors. As shown in Fig. 4, data points with high or low activation barriers (depicted in red and blue, respectively) were scattered throughout the chemical space. Focusing on the reaction energy, there were some clusters of data corresponding to highly exothermic (in blue), moderately exothermic (in light green), and slightly endothermic reactions (in red). However, data points with different reaction energies were not widely separated. In contrast, for the SOMO–HOMO and SOMO–LUMO orbital energy gaps, the data points with similar gaps tended to be close together, while those with different gaps were generally further apart. These findings suggest that the orbital energy gaps are likely major factors in characterizing the chemical space defined by the CopDDB descriptors. Nonetheless, other descriptors also could contribute to representing the complexity of this multidimensional chemical space.
![]() | ||
Fig. 5 Distributions of mean r1 values: (a) for all 114 datasets and (b) for the 97 datasets after preprocessing. |
To evaluate the effectiveness of our descriptors, we constructed ML models to predict r1 values using two different sets of descriptors and compared their prediction accuracies. The first set, obtained from CopDDB, combined DFT-based descriptors for (, M1) and (
, M2). The second set, derived from the RDKit package, combined descriptors for M1 and M2 (details of the descriptors are shown in Tables S4 and S5†). Before constructing the ML models, the descriptors were preprocessed as follows. Descriptors with a correlation coefficient above 0.9 were reduced to one. For those with a correlation coefficient in the range of 0.8–0.9, we manually selected which descriptor to remove based on scatter plots. As a result, 22 DFT-based and 24 RDKit descriptors were retained. The 97 data points were divided into subsets of 87 (∼90%) for training and 10 (∼10%) for testing. Random forest (RF) regression models were constructed, with hyperparameters optimized through 5-fold cross-validation of the training data using the Optuna package.47 Model performance was validated using R2 scores.
Fig. 6 shows the y–y plots of r1 values predicted by RF models using the two sets of descriptors. The model using DFT-based descriptors achieved higher R2 scores for both training and test data (0.86 and 0.84, respectively) compared to the RDKit descriptors (0.80 and 0.65, respectively). Thus, our descriptors demonstrated excellent performance for ML models, offering better predictive accuracy.
![]() | ||
Fig. 6 y–y plots of RF models for r1 values constructed using different descriptor sets: (a) DFT-based descriptors from the CopDDB and (b) RDKit descriptors. y–y plots with different train/test splits are also shown in Fig. S7.† |
As discussed in our previous study,30 the DFT-based descriptors demonstrated higher extrapolation accuracy than the RDKit descriptors for predicting MMA_conv., M1_conv., and M1_CR. In this study, we extended this approach to predict molecular weights with high accuracy by utilizing the updated CopDDB descriptors and applying additional feature engineering through dimensional compression. We examined two types of descriptors for M1. One set comprised 66 parameters, which were combinations of the CopDDB descriptors (, M1), (
, MMA), and (MMA*, M1). The other consisted of nine parameters, derived from compressing the CopDDB descriptors (
, M1), (
, MMA), and (MMA*, M1) into three dimensions each using principal component analysis (PCA)48 and variational autoencoder (VAE).49 Details of the dimensional compression using the VAE are shown in Fig. S8.† To estimate extrapolation performance, leave-one-monomer (M1)-out cross-validation (LOOCV) was conducted, following the procedure outlined in Examination 2 of Fig. 3 in ref. 30. The ML models were built using Gaussian progress regression (GPR) with the sum of two Matern kernels, with ν values of 0.5 and 1.5.50
Table 1 shows the R2 scores of the GPR models built using the three types of parameters mentioned above (see Fig. S9† for the y–y plots). The prediction accuracies for M1_conv., MMA_conv., and M1_CR improved when using the compressed parameters, except for M1_conv. with parameters compressed by PCA. For the molecular weights Mn and Mw, the prediction accuracies were dramatically improved with the compressed parameters. This improvement could be attributed to the reduced number of descriptors, which suppressed overfitting of the training data. Although feature engineering, such as dimensional compression, is sometimes necessary, the descriptors in CopDDB remain valuable for developing ML models of various physical properties of copolymers.
Descriptors | Original parameters | 9 parameters compressed by PCA | 9 parameters compressed by VAE |
---|---|---|---|
a The predicted values were the averages of five values with different random numbers in the GPR model. M1 represents five monomer species: St, GMA, CHMA, PACS, and THFMA. | |||
MMA conv. | 0.60 | 0.67 | 0.72 |
M1 conv. | 0.67 | 0.57 | 0.80 |
M1_CR | 0.72 | 0.87 | 0.82 |
M n | 0.35 | 0.76 | 0.67 |
M w | 0.33 | 0.81 | 0.73 |
The three-dimensional latent descriptors compressed using PCA and VAE can be regarded as effective descriptors for improving prediction accuracy by achieving a better balance among all objective variables. To verify the importance of the DFT-based descriptors, the loadings of the compressed DFT-based descriptors obtained through PCA are shown in Fig. S10.† This figure confirms the significance of descriptors related to radical reactivity (SOMO and LUMO of radicals, and ΔEhead), transition state energies (ΔETS and ΔEbarrier), and the partition coefficients (logP). These factors are indeed crucial in polymer chain elongation reactions and provide information that cannot be obtained from experimental process variables alone. The partition coefficients reflect the hydrophilicity or hydrophobicity of monomers and polymers. The balance between these properties is considered to influence the stability of the copolymer's molecular structure, self-assembly, proton transfer with the solvent, and, ultimately, the molecular weights Mn and Mw.
We performed the BO with the target value set to a 50:
50 composition ratio (i.e., 50% M1_CR; M1 = HEMA) for the synthesized copolymer. The objective variable for the GPR model was defined as the squared difference between the target M1_CR and the measured M1_CR (note that this BO procedure followed our previous study,51 in which BO was applied for free radical copolymerization using single-molecular datasets). After training with the initial data, four candidate points, each consisting of five process variables, such as initiator concentration, proportion of HEMA (molar ratio of HEMA to HEMA and MMA in the preparation), reaction temperature, solvent-to-monomer (SM) ratio, and reaction time, were generated within the BO design space shown in Table S6.† These candidate points are summarized in Fig. 7, where the predicted objective variable by the GPR model is color-coded in the partial dependence plots (PDPs),52 which confirms that the proposed process variables were sampled within the optimal region (shown in purple in Fig. 7). Focusing on the proposed process variables, three variables such as initiator concentration, HEMA proportion, and reaction temperature were within a narrow range, indicating that only a specific range of these values could achieve the desired HEMA-CR. In particular, the proposed HEMA proportion was limited, ranging from 45.91% to 47.45%, which indicated that the lower proportion of HEMA than MMA in the preparation was required due to their different reactivities. In contrast, a wide range of values was chosen for SM ratio and reaction time.
![]() | ||
Fig. 7 Four proposed sets of process variables (shown as white stars) on the PDP color maps for each pair of five process variables within the ranges defined in Table S6.† Color-coded numerical values are the predicted means of the GPR, with colors closer to purple corresponding to values closer to the target HEMA_CR. The detailed process variables are shown in Table S7.† |
To validate the proposed process variables (i.e., candidate points), the MMA and HEMA copolymers were synthesized under the four proposed process variable sets. The four observed HEMA_CR values were 49.21%, 49.91%, 47.81%, and 49.21%, all of which were quite close to the desired ratio of 50% (see Table S8† for the detailed results). To validate the effect of reaction time, for which a wide range of values were proposed, we also performed the experiments where only the reaction time was changed to 1/2 and 1/3 of the proposed time. Indeed, the effect of reaction time on the HEMA-CR was quite small, though that on other properties such as the monomer conversions and molecular weights was large as shown in Table S8.† As shown above, we succeeded in proposing process variables to achieve the desired property of copolymers of MMA with an untested M1 monomer (M1 = HEMA) because we were able to transfer the search space of process variables for the tested M1 monomers (M1 = St, GMA, PACS, THFMA, and CHMA) to that for HEMA via CopDDB descriptors.
In our current database, the monomer scope is limited to monofunctional monomers, and the descriptor set includes only the energetics of the early stages of polymerization, without considering the energetics of later stages or variations involving different radical initiators and terminators. Expanding the descriptor set will be essential as we broaden the scope to include polymers with multifunctional monomers and physical properties influenced by the energetics of the later stages of polymerization.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00266k |
This journal is © The Royal Society of Chemistry 2025 |