Open Access Article
Haoke
Qiu
ab,
Lunyang
Liu
*a,
Xuepeng
Qiu
bc,
Xuemin
Dai
c,
Xiangling
Ji
ab and
Zhao-Yan
Sun
*ab
aState Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences, Changchun 130022, China. E-mail: lyliu@ciac.ac.cn; zysun@ciac.ac.cn
bSchool of Applied Chemistry and Engineering, University of Science and Technology of China, Hefei 230026, China
cCAS Key Laboratory of High-Performance Synthetic Rubber and its Composite Materials, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences, Changchun 130022, China
First published on 6th December 2023
Language models exhibit a profound aptitude for addressing multimodal and multidomain challenges, a competency that eludes the majority of off-the-shelf machine learning models. Consequently, language models hold great potential for comprehending the intricate interplay between material compositions and diverse properties, thereby accelerating material design, particularly in the realm of polymers. While past limitations in polymer data hindered the use of data-intensive language models, the growing availability of standardized polymer data and effective data augmentation techniques now opens doors to previously uncharted territories. Here, we present a revolutionary model to enable rapid and precise prediction of Polymer properties via the power of Natural language and Chemical language (PolyNC). To showcase the efficacy of PolyNC, we have meticulously curated a labeled prompt–structure–property corpus encompassing 22
970 polymer data points on a series of essential polymer properties. Through the use of natural language prompts, PolyNC gains a comprehensive understanding of polymer properties, while employing chemical language (SMILES) to describe polymer structures. In a unified text-to-text manner, PolyNC consistently demonstrates exceptional performance on both regression tasks (such as property prediction) and the classification task (polymer classification). Simultaneous and interactive multitask learning enables PolyNC to holistically grasp the structure–property relationships of polymers. Through a combination of experiments and characterizations, the generalization ability of PolyNC has been demonstrated, with attention analysis further indicating that PolyNC effectively learns structural information about polymers from multimodal inputs. This work provides compelling evidence of the potential for deploying end-to-end language models in polymer research, representing a significant advancement in the AI community's dedicated pursuit of advancing polymer science.
Besides, sequence-based language models also offer a promising solution for polymer property prediction. Polymer structures can be effectively represented as language-strings such as SMILES,28 SELFIES,29 and big-SMILES,30 and language models have fewer restrictions on input format compared to conventional ML models, allowing for the utilization of custom and non-conventional polymeric patterns as inputs, especially when considering stoichiometry, molecular weight and additives. This will alleviate the burden on researchers in obtaining molecular representations that satisfy specific requirements like input shapes and makes language models highly promising for AI-driven polymer discovery. In the last five years, there has been an emergence of language-like models in the field of polymers. For instance, long short-term memory (LSTM)31,32 and recurrent neural networks (RNNs)33 have been employed to learn the sequence representation of polymer structures. More recently, transformer models specialized for sequence-based tasks have achieved significant success in addressing polymer scientific challenges incorporating self-attention mechanisms34 in a pretrained and finetune manner. Notable examples include TransPolymer15 and polyBERT.16 TransPolymer employed the RoBERTa35 model to generate machine-based fingerprints through an unsupervised approach, utilizing 5 million unlabeled polymer SMILES. Similarly, polyBERT employed a DeBERTa-based36 transformer to convert SMILES strings into numerical representations suitable for downstream multi-task regression models. These remarkable findings demonstrate the significant efficacy of the transformer family in the fields of polymers.
Indeed, polymer property prediction tasks are similar to text-based language model tasks such as machine translation and text generation. Text-based language models generate corresponding outputs based on given text inputs. Similarly, in polymer property prediction tasks, properties are predicted based on given prompts and chemical structures. In the past, attempts to solely rely on language models for polymer property prediction tasks were hindered by the scarcity and unattainability of high-quality labeled polymer datasets,37 while the availability of high-quality open-source polymer datasets is steadily increasing.38–41 More encouragingly, extensive work has shown that data augmentation-based approaches are effective in addressing the scarcity of polymer data,15,42,43 and harnessing the intelligence of general language models proves beneficial for comprehending scientific language via language models.44–47 To the best of our knowledge, a completely end-to-end language-based approach for directly predicting the properties of polymers from natural and chemical languages, rather than being used as intermediates to connect molecular structures to downstream models, is currently lacking. This concept draws parallels to how chemists can infer fundamental properties of common molecular structures through visual observation, without the need for additional analytical characterization (Fig. 1). By integrating natural language, chemical language (e.g., SMILES) and chemical knowledge (properties), language-to-property AI agents hold promise to perceive and establish a multi-domain and multi-task understanding directly from the polymer structure to its diverse properties, which presents an opportunity to drive advancements in existing robo-chemists and autonomous laboratories.48–50 In addition to regression-based property prediction tasks, we also aim for a unified model that can simultaneously handle multiple types of tasks, such as both classification and regression tasks. The ability to handle multiple types of tasks simultaneously is a capability that milestone models mentioned earlier have yet to explore owing to the challenges stemming from data distribution shifts51 and the inherent specificity of ML models themselves.
Herein, we propose the PolyNC, a fully end-to-end and multi-task language model for polymer discovery. Our model enables the execution of complex polymer discovery workflows with a single model, a previously unreported ability, surpassing even the capabilities of prevailing LLMs like ChatGPT, Claude, Llama and PaLM, due to their lack of domain knowledge. Given the limited information that can be directly extracted from SMILES by chemists, our model introduces a new paradigm for polymer discovery, design and development based on SMILES, offering remarkable convenience. In comparison to descriptor-based and graph-based models, PolyNC exhibits impressive performance across the four tasks central to our polymer research: three property prediction tasks (regression) and one polymer classification task (classification). Handling multi-task and multi-type problems is a capability hitherto unattainable by previous ML models. Notably, PolyNCs' ability to generalize to unknown structures is particularly impressive, as confirmed through experimental validation. Attention analysis reveals that this generalization capacity stems from the model's comprehension of both natural language and chemical language. This work extends the powerful natural language understanding capabilities of AI to the field of polymer research, marking an impressive step towards the development of expert models and human-level AI for understanding polymer knowledge.
(a) Glass transition temperature (Tg). Tg is a critical property that characterizes the transition from a rigid, glassy state to a more flexible, rubbery state in polymers, which is essential in understanding the processing, stability, and mechanical behavior of polymers, making it a key parameter in material design and applications.
(b) Band gap of polymer crystals (BC). The BC of polymer crystals refers to the energy difference between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) in the crystalline state of a polymer, which is crucial for developing polymer-based electronic and photonic devices, as it influences their performance in areas like organic photovoltaics and light-emitting diodes.
(c) Atomization energy (AE). AE represents the energy required to completely separate the constituent atoms of a polymer molecule. It reflects the strength of the chemical bonds within the polymer and provides insights into its stability and reactivity. Atomization energy is relevant in various aspects of polymer chemistry, including synthesis, degradation, and understanding the relationship between structure and properties.
(d) Heat resistance class (HRC). To assess the capability of PolyNC across various task types, we also established this classification task as an example. HRC refers to the ability of polymers to withstand high temperatures without significant degradation or loss of its essential properties, particularly important in the case of high-end polymers like polyimides (PIs). Therefore, we focus on the heat resistance of PIs in this study while also investigating the performance of PolyNCs in handling tasks related to polymers. Based on the Tg of PIs and industry standards,55 we can classify them into three categories: class 1, class 2, and class 3. Class 1 indicates PIs with a Tg above 400 °C, class 2 represents PIs with a Tg ranging from 300 to 400 °C, and class 3 refers to PIs with a Tg below 300 °C.
Due to the data-hungry nature of language models, data augmentation is implemented to improve model performance under an equal mixing strategy.56 A summary of the four datasets for downstream tasks is shown in Table 1.
| Property | Source | Unit | # Entries (training/test) | # Aug. times | # Aug. entries |
|---|---|---|---|---|---|
| T g | DFT & exp. | °C | 685(615/70) | 10 | 6850 |
| BC | DFT | eV | 236(212/24) | 20 | 4720 |
| AE | DFT | eV | 390(351/39) | 15 | 5850 |
| HRC | Exp. | — | 370(333/37) | 15 | 5550 |
| Total | DFT & exp. | — | 1681 | — | 22 970 |
673) and 10% test set (2297). To the best of our knowledge, this is one of the largest labeled datasets available for polymer ML tasks. These input prompts are tokenized at the character level, dividing them into natural language tokens and chemical tokens. This tokenization strategy has been proven to provide better performance and stronger expressive capabilities. By observing the distribution of token sizes in the training and validation sets (depicted in the ESI, S1†), we determined to set the input token size to 150 and the output token size to 8 to accommodate all instances. To initialize our model, we chose both the t5-base and Text + Chem T5.56,57 The former is a pre-trained model based on natural language text, while the latter is a pre-trained model specifically designed for chemical text tasks such as molecular descriptions. We found that the inclusion of scientific domain knowledge weighting significantly enhances the performance of the model on polymer domain tasks (as detailed in the Ablation studies section), which implies the transferability of PolyNC to other polymeric tasks.
PolyNC used a whole encoder and decoder architecture each with 12 layers and 12 attention heads (Fig. 2). The encoder is responsible for extracting semantic information from the multi-domain input, while the decoder analyzes this semantic information and generates outputs based on the given conditions. Within them, self-attention was used to maintain the relationships among tokens. In the decoder, in addition to self-attention for capturing relationships within a single sequence, cross-attention is also utilized to capture relationships between the input and output sequences. This helps in learning the mapping between natural language and chemical language and their respective properties. For each output of attention block, a fully connected network is used to perform non-linear projection. We set 768 as the hidden dimension for PolyNC and 3072 for the intermediate feed forward layer with the ReLU activation function and a dropout rate of 10%. The output head for all tasks is the same as the output layer of t5-base from the huggingface transformers package.58 What sets our work apart from previous efforts (TransPolymer and polyBERT) is our ability to directly handle multitasking in a single unified model without the need for separate regression or classification heads, thus enabling PolyNC to achieve seamless end-to-end property prediction. A cosine learning rate decay strategy with a 20% warmup ratio is used to dynamically adjust the learning rate to speed up convergence and avoid skipping optimal solutions on two RTX 3090 GPUs.
![]() | ||
| Fig. 3 (a) Distributions of chemical space for each dataset based on t-SNE.59 It can be observed that the majority of molecules corresponding to each property are distinct, which aids the models in learning a more comprehensive mapping between molecular structures and properties from limited data. Additionally, for each individual task, the distribution of the training and testing sets is uniform, which helps assess the model's generalization ability. (b) Value distribution of each dataset. The y-axis is plotted on a logarithmic scale. This subfigure highlights the significant differences in the value ranges among the different properties. (c) R2 metric (↑). (d) MAE metric (↓). (e) RMSE metric (↓). PolyNC demonstrated impressive performance in each prediction task, particularly excelling in the multi-property prediction task, showcasing its powerful capability in handling multi-task scenarios. | ||
Under the same chemical space, we trained and validated PolyNC and these baseline models. In this study, the descriptors of baseline models implemented using the scikit-learn package were computed using RDKit, where the descriptors for each molecule were computed using the Descriptors.descList module, yielding a total of 209 descriptors, and descriptors containing missing values were excluded, resulting in a final set of 197 valid descriptors (the details are publicly available at https://github.com/HKQiu/Unified_ML4Polymers/blob/main/data/exp_val/data_with_descriptors.csv). The descriptors of the baseline model implemented using Deepchem were computed using its default settings, with 75 features for each atom (the details are publicly available at https://github.com/deepchem/deepchem/blob/master/deepchem/feat/graph_features.py#L282). The coefficient of determination (R2), mean absolute error (MAE) and mean squared error (MSE) were used as evaluation metrics for these models, as detailed in S3.1.†
Single-property prediction tasks were used to evaluate the performance of the models across different properties. And due to the interdependencies among the properties of polymers, accurately predicting multiple properties simultaneously is of interest to polymer scientists. Therefore, we employed a multi-property prediction task (denoted as ‘All’) to test the models' ability to predict multiple properties simultaneously. This task poses a challenge due to the significantly varied value ranges of different properties, as shown in Fig. 3b, thus it can serve as an indication of the model's potential in handling multi-task scenarios.
The performance comparison of the different models is shown in Fig. 3c–e and S3.1.† PolyNC demonstrated impressive performance for each single-property prediction task, demonstrating that the entirely language-based model PolyNC can achieve prediction accuracy comparable to other ML approaches. For the multi-property prediction task scenario, the predictive performances of almost all ML models exhibited a decrease compared to single-task settings. This is because traditional ML models struggle to fit datasets with different distributions.
Of note, both GCN and PolyNC showed improvements in performance for this task. In the case of GCN, GCN takes into account the topology and connectivity of molecules, allowing it to extract more useful information compared to handcrafted descriptors. This enables GCN to exhibit superior performance in multi-task settings. This finding also underscores the importance of extracting as much raw information as possible from molecules. Similarly, benefiting from its learning of both natural language and chemical language, PolyNC exhibits a deeper understanding of the structure–property mapping of polymers and facilitates PolyNC in learning multiple properties of diverse molecules simultaneously, resulting in the best performance, which highlights the significant potential of language models in constructing polymer property–structure landscapes.
We compared the performance of PolyNC with eight baseline models for the classification task, including Logistic Regression (LRC), Naive Bayes (NBC), Support Vector Machine (SVC), AdaBoost (ADAC), Decision Tree (DTC), Random Forest (RFC), K-Nearest Neighbors (KNNC) and XGBoost (XGBC). While XGBoost was implemented using the xgboost package,62 the remaining ML models were implemented using the scikit-learn package.60 We used four evaluation metrics for classification problems, namely Accuracy, Precision, Recall, and F1 Score. The HRC task is characterized by an imbalanced dataset, which we addressed by ensuring that each class had a consistent proportion in both the training and testing sets and we assigned a weighti to class i when assessing the model performance, as detailed in S3.2.†
The performance results of each model are depicted in Fig. 4 and S3.† Impressively, PolyNC achieves the best performance in the HRC task, with all metrics exceeding 0.81, benefiting from the intelligence of language models in handling classification tasks, such as sentiment classification.63 As a point of comparison, the baseline models achieved an accuracy of around 0.7. Compared to LRC and SVC, PolyNC did not make any highly inaccurate prediction such as predicting class 3 for samples actually belonging to class 1 or vice versa though class 1 had fewer samples and class 3 had more. This demonstrates that PolyNC is not significantly affected by imbalanced sample sizes, thus avoiding the generation of biased outputs. Besides, it is worth noting that during the training phase, PolyNC learns simultaneously for all tasks (regression and classification), allowing it to capture correlations between properties. Although the mapping relationship between Tg and HRC is not explicitly informed to PolyNC, it can also spontaneously and implicitly learn these details from latent data. As evidenced in Fig. S2d,† PolyNC tends to underestimate Tg predictions, which in turn affects its classification performance, as reflected in misclassifying 44% of class 1 as class 2 and 32% of class 2 as class 3. Despite the negative impact, this also serves as evidence that PolyNC learns the correlations between different tasks. Though under these perplexing interferences, PolyNC also outperformed other baseline models in this classification task, making it a unified model capable of handling both regression and classification tasks simultaneously. As far as we know, an ML model that can simultaneously handle classification and regression problems in the field of polymers has not been seen before.
![]() | ||
| Fig. 5 (a)The Polymer Tree of each structure within the training and test datasets of Tg. Different molecules with distinct chemical structures are located in different branches of the Polymer Tree. For instance, the left portion of the graph primarily consists of structures with fewer heterocycles, predominantly linear polymers, while the right portion comprises structures with a higher number of heterocycles (see our repository https://github.com/HKQiu/Unified_ML4Polymers/tree/main/TMAP or https://try-tmap.gdb.tools/tmap/discreet-ammonite-of-mathematics for more details of Polymer Tree). (b) and (c) Generalization ability of PolyNC in the estimation of Tg as an example. PolyNC demonstrates exceptional performance in predicting the Tg of unknown structures with a 5 °C and 20 °C deviation for each sample. The limited generalizability is a universal issue for off-the-shelf ML models,66 where PolyNC might have learned more appropriate polymer representations. | ||
Then, we analyzed the attention scores of PolyNC with respect to the input sequences corresponding to these two examples. The encoder of PolyNC consists of 12 attention heads, each focusing on different contexts to extract distinct knowledge (all the 12 attention heads of the encoder are as depicted in S5†). The attention scores for the fifth and ninth attention heads of PI-1 and PI-2 are shown in Fig. 6. It can be observed that the fifth attention head primarily attends to adjacent tokens for each token to obtain local environments, while attention head 9 mainly focuses on the tokens themselves. From Fig. 6, we can summarize PolyNC's ability to recognize complex SMILES in three aspects. (1) PolyNC exhibits higher attention scores in the feature groups (imide groups) of PI-1 and PI-2 (seen in the pink and light yellow parts in the figure). (2) PolyNC also assigns higher attention scores to the natural language part and polymerization sites corresponding to human cues and polymerization information (seen in the light purple part in the figure). (3) The structural difference between PI-1 and PI-2 is the presence of an additional benzene ring in PI-2 (seen in the green part in Fig. 6b). Adding a benzene ring changes the order of elements in the SMILES and makes it more challenging for human interpretation. However, PolyNC recognizes the benzene ring structure in complex SMILES, as evident in the corresponding attention matrices in the left panel of Fig. 6b (the black dashed box regions, which exhibit similar attention matrices along the diagonal direction). Specifically, since the purple benzene ring connects to a different chemical structure (imide structures), the lower edge of the purple attention matrix also changes accordingly. The aforementioned findings suggest that PolyNC possesses intelligent chemical perception, enabling it to pave the way for the recognition of complex SMILES expressions for further molecular property prediction and inverse design.
![]() | ||
Fig. 6 The attention scores for the fifth (left) and ninth attention (right) heads of PI-1 (a) and PI-2 (b). It can be observed that the fifth attention head primarily attends to adjacent tokens for each token, while attention head 9 mainly focuses on the tokens themselves. For linear polymers, the order of elements in their SMILES representation aligns with the order of atoms in the molecular structure, making it easy to observe the molecular structure from the SMILES. However, for molecules with many rings, things become more complicated. Based on the rules of SMILES, the rings will be broken and flattened at specific atoms, causing adjacent atoms to appear in non-adjacent positions in the SMILES, which poses challenges for parsing SMILES. The input sequences of PI-1 and PI-2 were ‘Predict the Tg of the following SMILES: [*]C1 CC C(C2 CC C(N3C(C(C CC(C4 CC CC(C(C5[*]) O) C4C5 O) C6) C6C3 O) O)C C2)C C1’ and ‘predict the Tg of the following SMILES: [*]C1 CC C(C2 CC C(C3 CC C(N4C(C(C CC(C5 CC CC(C(C6[*]) O) C5C6 O) C7) C7C4 O) O)C C3)C C2)C C1’, respectively. PolyNC succeeded in extracting and differentiating subfunctional group information and structural difference directly from SMILES. The attention scores for all the 12 attention heads of the encoder can be found in S5.† | ||
To assess the transferability of domain knowledge across different domains, we tested two initial weight configurations: one based on natural language weights (t5-base) and the other based on chemical text tasks, such as molecular descriptions (Text + Chem T5).44,56 Based on these two weight configurations, PolyNC was trained from scratch under the same parameters, and their learning curves are depicted in Fig. 7b. It can be observed that PolyNC, based on the Text + Chem T5 configuration, outperforms the performance achieved by the t5-base configuration (denoted as T5). This indicates the preservation of domain knowledge during transfer, ultimately enabling the model to develop a stronger understanding of downstream tasks within the same training duration. Of particular inspiration, this finding also reveals the potential success of fine-tuning PolyNC as a foundational model in other polymeric tasks.
Generative language models produce content probabilistically. To ensure result reproducibility, PolyNC was configured to select the token with the highest probability as the generated content during inference, while also controlling influential factors like token size. Moreover, language models should answer scientific facts objectively, even if different prompts are given. We verified the sensitivity of PolyNC to different SMILES of the same molecule, and the results (S7†) demonstrate that PolyNC gives satisfactory results for different SMILES of the same molecule to a great extent. Of note, akin to language models like ChatGPT, PolyNC does not evaluate the plausibility of the input and is capable of generating inference outcomes for any given input. Therefore, it is imperative that individuals employ PolyNC under the supervision of proficient chemists to mitigate potential risks.
970 entries, where the ‘*’ sign in SMILES represents the polymerization points.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3sc05079c |
| This journal is © The Royal Society of Chemistry 2024 |