Jiajun
Zhou
a,
Yijie
Yang
a,
Austin M.
Mroz
ab and
Kim E.
Jelfs
*a
aDepartment of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, UK. E-mail: k.jelfs@imperial.ac.uk
bI-X Centre for AI in Science, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, UK
First published on 28th November 2024
Polymers play a crucial role in a wide array of applications due to their diverse and tunable properties. Establishing the relationship between polymer representations and their properties is crucial to the computational design and screening of potential polymers via machine learning. The quality of the representation significantly influences the effectiveness of these computational methods. Here, we present a self-supervised contrastive learning paradigm, PolyCL, for learning robust and high-quality polymer representation without the need for labels. Our model combines explicit and implicit augmentation strategies for improved learning performance. The results demonstrate that our model achieves either better, or highly competitive, performances on transfer learning tasks as a feature extractor without an overcomplicated training strategy or hyperparameter optimisation. Further enhancing the efficacy of our model, we conducted extensive analyses on various augmentation combinations used in contrastive learning. This led to identifying the most effective combination to maximise PolyCL's performance.
The design of polymer representations, which refers to the machine-readable way that the molecular features of polymers are encoded, is critical for the performance of property prediction models. Conventional methods for creating machine-readable polymer representations involve creating handcrafted fingerprints, where molecular structural information features are depicted through manually designed descriptors.16 Indeed, several types of handcrafted fingerprints17–19 and refined fingerprint strategies10,15 within this category have found success in the polymer literature. While these methods have found success, handcrafted fingerprints are often designed using the expert's chemical intuition and heuristic principles. In addition to the potential to introduce bias, these methods are fairly labour-intensive and time-consuming.
Deep neural networks have been increasingly used to automatically extract dense molecular representations from polymers.20 This approach leverages the power of deep learning to alleviate the aforementioned challenges associated with manual feature extraction. Polymers can be abstracted into molecular graphs.13,21,22 Alternatively, molecules can be converted to one-dimensional sequence representations, such as SMILES.23 Here, polymer-SMILES representations are used that include brackets with a special character “[*]” to represent connection points between monomers, to reproduce the repeating nature of these materials (example shown in Fig. 1). For example, a Long Short-Term Memory (LSTM) model was trained on polymers represented as SMILES strings for property prediction.10,24 In addition, BigSMILES was designed to extend SMILES for the representation of the stochasticity of polymer molecules by introducing additional notations.14 Deep neural networks have also been employed with tailored representations of polymers to predict specific properties, such as degree of polymerization, monomer stoichiometry, gas permeability and radius of gyration.25–27
![]() | ||
Fig. 1 Example of (a) SMILES of vinyl chloride and (b) polymer-SMILES of polyvinyl chloride, along with the corresponding chemical structures. |
Machine learning-based predictive models are typically trained in a supervised fashion and act as automatic feature extractors. While this supervised training pattern is beneficial for specific downstream tasks, it may lead to learnt representations exhibiting domain-dependent characteristics and suffering from limited generalisability to other tasks.28–31 Further, supervised learning methods rely on labelled data of both high quantity and high quality. Within (polymer) chemistry, acquiring high-quality, labelled data is resource-intensive. The scarcity of labelled data in chemistry32 may lead to overfitting and, therefore, impair the model's generalisability to other data in the target domain.33 The limitations of supervised learning directly motivates self-supervised learning for chemical property prediction.
Self-supervised models learn from the inherent structure of the data, without the requirement of data labelling.34 In polymer science, the value of creating a universal representation that is a target-agnostic feature of self-supervised learning has already been observed.35 Initial demonstrations of self-supervised learning in polymer science have largely focused upon transformer architectures.36 Masked language modeling,37 where random tokens are obscured from the input to be predicted by the transformer, served as the training strategy to guide the pre-training of transformers. This training strategy was proven effective in Transpolymer and polyBERT for the production of machine-learnt polymer representations using transformer architecture.16,35 However, these works did not directly assess the effectiveness of the representation learnt by their pre-trained strategies and only inferred the quality of the learned representation from the performance of the downstream tasks using the model. Yet, inferring representation performance from the model performance is especially challenging when the model includes complex ML techniques, such as data-augmentation (including non-canonical SMILES strings to increase the size of the dataset),35 and multi-task learning (training downstream tasks on all datasets simultaneously).16
Contrastive learning is among the most competitive forms of self-supervised learning that learn meaningful representations from comparing and contrasting data.38 The idea of contrastive learning is to pull together similar data samples and separate dissimilar samples in the representation space.39 This idea has been demonstrated to achieve successful representation learning in molecular systems.40–42 In addition, contrastive learning is capable of incorporating extra modality to form modality pairs such as structures and text description,43 SMILES and IUPAC names,44 SMILES and the molecular graph,45 into the molecular representation via multi-modal alignment. However, contrastive learning is yet, to the best of our knowledge, to be applied to polymer science.
As illustrated in Fig. 2, an efficient approach to contrastive learning entails the formation of positive pairs by creating two distinct representations of the same, original polymer-SMILES molecule (here, termed anchor molecule) through data augmentation. This process is critical as it enables the learning model to recognise and reinforce the essential features of the molecule by comparing these different views. Concurrently, the anchor molecule and other molecules in the current (with their respective positive pair) are automatically considered negative pairs.46 Thus, the construction of positive pairs is exceptionally important because their formation directly impacts the identification of negative pairs, which are imperative to helping the algorithm understand the relationship between different positive pairs and, ultimately, better map out the representation space.
In chemistry, common approaches to augmentation are explicit – allowing observable modifications to the representation structure (e.g. removing a token from a SMILES string). Typical explicit augmentation modes for molecular graphs include node dropping, edge masking/perturbation, attribute masking and subgraph extraction.40,45 Implementation of explicit augmentation methods for SMILES representations remain limited and under-explored.45 In addition, augmentation can also be implemented in an implicit fashion, where different perturbations to the embedding are implemented during the training process (e.g. natural dropout).47,48 Despite the demonstrated effectiveness of implicit augmentation, this approach also remains an area of limited attention. Furthermore, there is a need to understand the effects arising from the heterogeneous combination of both types of augmentation strategies (i.e. implicit and explicit).
Here, we present PolyCL, a contrastive learning framework for polymer representation learning for improved predictive performance. To construct effective positive pairs, we also proposed a novel combinatorial augmentation strategy to include both explicit and implicit augmentations. Our results show that PolyCL outperforms other supervised and pre-trained models under the lightweight and flexible transfer learning setting where the fine-tuning of PolyCL is not required. Here, we emphasise that this construction eliminates the need to fine-tune an entire model (e.g. pre-trained model + prediction head), instead PolyCL may be independently implemented as a feature extractor for polymer property prediction tasks. We find that the learnt representation from our contrastive learning strategy has improved quality, exhibits practical robustness to non-canonized SMILES input, and show how our polymer representation can be used for a variety of downstream tasks via simple transfer learning. The dataset and model are available at: http://github.com/JiajunZhou96/PolyCL.
![]() | (1) |
Here, we used the pre-trained PolyBERT16 as our encoder, f(·), and maintained the default settings for all hyperparameters in the transformer architecture. The projector g(·) is a two-layer MLP that maps the pooled 600-dimensional representation h to a 128-dimensional latent vector z for similarity evaluations. During the contrastive pre-training, we enabled mixed precision training. AdamW52 was used as the optimiser with a learning rate of 1 × 10−5 to minimise the NT-Xent loss. A gradient clipping mechanism was employed with a max grad norm set to 1.0. We trained the model for 10 epochs in total, taking approximately 22 hours on 8 NVIDIA Tesla V100 GPUs.
Augmentations can be empirically categorised into two modes; “explicit” and “implicit”. Explicit augmentations are direct and observable modifications to the input data. As shown in Fig. 2, explicit augmentations include enumeration, token masking (Masking) and token drop (Drop). Enumeration randomly generates one non-canonical version of a polymer SMILES based on its canonical SMILES string. Drop deletes 10% of tokens in the SMILES string. Masking substitutes 10% of tokens in the SMILES string with a special token. Therefore, the same molecule can be transformed into two different SMILES strings to construct an effective positive pair.
Beyond explicit augmentations, the subtle modifications in how the input data is represented in the intermediate layers within the model are referred to as implicit augmentations, as shown in Fig. 2. Following the work of SimCSE,46 we used the inherent dropout module inside our transformer encoder to create differences in molecular embedding for the same input. With implicit augmentations enabled, the dropout ratio for hidden layers and attention probabilities in the configuration of the transformer encoder is 0.1; when disabled, both values are set to 0. In addition, we have also combined both explicit and implicit augmentations for the construction of positive pairs to study the cooperative effect of augmentation strategies.
We employed a two-step strategy to identify the best-performing combinations of augmentations. First, we applied our contrastive pretraining approach to all possible combinations of explicit augmentations. Subsequently, we evaluated the predictive performance of the resulting models across all downstream tasks within a transfer learning framework. The contrastive learning model pretrained without augmentations served as the baseline for comparison. We assessed the number of downstream tasks in which each model outperformed the baseline. Based on this evaluation, we selected the most promising combinations of explicit augmentations as candidates. These combinations were then integrated with implicit augmentations for further analysis and evaluations of performance.
The experimental setup employs an MLP regressor featuring a single hidden layer and ReLU activation, integrated with the PolyCL feature extractor. “[CLS]” pooling serves as the readout function, extracting a 600-dimensional polymer representation. Specifically, this approach transforms token-level embeddings for each polymer sequence into a comprehensive sentence-level embedding, wherein sequence information is encapsulated by the appended “[CLS]” token. The hidden size within the MLP is consistent with the input size for all pre-trained models (including the benchmarking study of PolyBERT and Transpolymer). A dropout ratio of 0.1 is applied polymer-SMILES strings are encoded by the tokeniser of PolyBERT.16 An l2 loss function is implemented for regression tasks. During the regression phase, AdamW52 was used as the optimiser with a learning rate of 0.001 and no weight decay. For each downstream dataset, a 5-fold cross-validation strategy is employed, accompanied by a 500-epoch training protocol. An early-stopping monitor is activated after 50 epochs, with a patience setting of 50 epochs. The performance on the unseen validation datasets is evaluated using the root-mean-square error (RMSE) and the coefficient of determination (R2). To show the general expressiveness of the learnt representation, all hyperparameters for the transfer learning performed on PolyCL are set by simple heuristics and not tuned specifically by a validation process.
![]() | (2) |
Uniformity is a measure of the distribution of learnt representations in the unit hypersphere; this is defined by the log of the mean Gaussian potential between each embedding pair , where each variable in the pair is an independent and identically distributed random variable, as shown in eqn (3). A lower uniformity indicates the learnt embedding distribution is capable of preserving maximal information:
![]() | (3) |
To effectively evaluate the alignment and uniformity, a dataset distribution that has never been seen by any of the pre-trained models needed to be constructed to ensure a fair cross-model comparison. Here, we randomly sampled 60000 polymers from the excluded development dataset of polyBERT16 for evaluation. Each polymer was augmented once by the SMILES enumeration method to create a positive pair, thereby preserving semantics.
In the transfer learning phase, we extracted the representation using the pre-trained model and then used a prediction head to predict any property of interest. The pre-trained transformer encoder is employed to encode polymers to their representations. Here, we demonstrate how the prediction process using a simple prediction head h(·), constructed with two-layered multi-layer perceptrons (MLP) with random initialisation, can be used to train the mapping from polymer representations to properties ŷ. However, the transfer learning process is flexible in selecting predictive models that best serve the requirements of downstream tasks.
There are two key advantages to this strategy. Firstly, this approach ensures that the representation is independent of the further fine-tuning of the underlying pre-trained model during the transfer learning, allowing for a fairer evaluation of the representation's quality. Secondly, this approach aligns with common real-world applications better; typically, only the polymer representation is incorporated into subsequent models, instead of the specialised fine-tuning of the pre-trained model with these later models. PolyCL should serve simply as a flexible representation generator and, therefore, requires no extra computational resources to fine-tune pre-trained parameters during usage. To explore how our model performs, we compared PolyCL with supervised models including random forest (RF), XGBoost (XGB), and neural networks (NN), each trained on either ECFP fingerprints56 or the domain-specific Polymer Genome (PG) fingerprints.12 We also implemented cross-modal comparison between the above fingerprints and graph representations encoded by graph convolutional networks (GCN)57 and graph isomorphic networks (GIN).58 Finally, we compared with other machine-learnt representations via self-supervised learning strategies: PolyBERT16 and Transpolymer.35
The results of our transfer learning are shown in Table 1. We conducted our transfer learning on seven different datasets sourced from Xu et al.,35 including band gap (both chain (Egc) and bulk (Egb)), electron affinity (Eea), ionisation energy (Ei), DFT-calculated dielectric constant (EPS), crystallisation tendency (Xc), and refractive index (Nc). Following previous works,12,35 we assessed the five-fold average R2 on the unseen validation datasets. Among seven supervised models and three self-supervised models, PolyCL achieves the overall best R2 and four individual best performances across the seven property datasets. PolyCL has a significant advantage in predictive performances over the second-best model in the ionisation energy (Ei), dielectric constant (EPS) and refractive index (Nc) datasets, by 2.4%, 2.4%, 4.3%, respectively. This performance shows that the chemical and structural information captured in the SMILES representation by our model can be generalised to different types of properties, and help to construct more efficient models for quantitative structure–activity relationships. Therefore, the contrastive learning strategy enables the generation of a more expressive representation.
Model information | Datasets | ||||||||
---|---|---|---|---|---|---|---|---|---|
Model | #Params | E ea | E gb | E gc | E i | EPS | N c | X c | Avg. R2 |
a The R2 values of these two lines are directly taken from the single-task learning experiments of Kuenneth et al.12 | |||||||||
RFECFP | — | 0.8401 | 0.8643 | 0.8704 | 0.7421 | 0.6840 | 0.7540 | 0.4345 | 0.7413 |
XGBECFP | — | 0.8350 | 0.8568 | 0.8679 | 0.7221 | 0.6728 | 0.7574 | 0.3842 | 0.7280 |
NNECFP | 264k | 0.8543 | 0.8708 | 0.8838 | 0.7562 | 0.7473 | 0.8066 | 0.3975 | 0.7595 |
GPPGa | — | 0.90 | 0.91 | 0.90 | 0.77 | 0.68 | 0.79 | <0 | < 0.71 |
NNPGa | — | 0.87 | 0.90 | 0.89 | 0.74 | 0.71 | 0.78 | < 0 | < 0.70 |
GCN | 70k | 0.8544 | 0.8043 | 0.7988 | 0.6646 | 0.7404 | 0.5238 | 0.3316 | 0.6739 |
GIN | 218k | 0.8829 | 0.8350 | 0.8181 | 0.7841 | 0.6925 | 0.6317 | 0.3902 | 0.7192 |
TransPolymer35 | 82.1M | 0.8943 | 0.8961 | 0.8756 | 0.7919 | 0.7568 | 0.8109 | 0.4552 | 0.7830 |
PolyBERT16 | 25.2M | 0.9065 | 0.8830 | 0.8783 | 0.7670 | 0.7694 | 0.8017 | 0.4367 | 0.7775 |
PolyCL | 25.2M | 0.9071 | 0.8884 | 0.8832 | 0.8112 | 0.7876 | 0.8460 | 0.4043 | 0.7897 |
Compared with supervised learning methods, polymer representations produced by self-supervised learning achieved a higher overall performance (Avg. R2) and robustness across all datasets. Only Polymer Genome (PG) fingerprints12 can reach comparable performance in given tasks – specifically, the band gap for a chain (Egb) and in the bulk (Egc). However, this fingerprinting method also shows lower robustness, which is observed considering its prediction on different datasets, for example, crystallisation (Xc). ECFP generally exhibits superior overall performance and robustness compared to PG fingerprints; however, in specific tasks involving prediction, PG fingerprints tend to outperform due to their higher target-specificness. In addition, our implementation of graph neural networks suggests that graph representation remains an efficient way to represent polymers; however, these representations do not show better predictive performance than traditional fingerprinting methods.
Furthermore, we replaced the canonical polymer SMILES in all downstream datasets with one of their enumerated SMILES strings and used these modified datasets to evaluate models on the transfer learning setting as shown above. Since non-canonicalized SMILES are commonly encountered in practical applications, this modification reflects real-world usage. The results are summarized in Table 2. Compared with other self-supervised learning strategies, PolyCL shows the most robust results across all datasets. Compared with the use of canonicalized datasets, the use of non-canonicalized datasets leads to an average performance decay of 4.7% and 8.3% for Transpolymer and PolyBERT, respectively. In contrast, the performance of PolyCL is still maintained at a comparable level with a slight performance decrease of only 2.6%. The results demonstrate that the polymer representation obtained by PolyCL outperforms other self-supervised learning strategies in terms of robustness and invariance to different molecular representations.
Model | E eaEnum | E gbEnum | E gcEnum | E iEnum | EPSEnum | NcEnum | X cEnum | Avg. R2 |
---|---|---|---|---|---|---|---|---|
TransPolymer | 0.8674 | 0.8593 | 0.8597 | 0.7720 | 0.6942 | 0.7788 | 0.3920 | 0.7462 |
polyBERT | 0.8618 | 0.8191 | 0.8298 | 0.6884 | 0.7304 | 0.8109 | 0.2525 | 0.7133 |
PolyCL | 0.8870 | 0.8814 | 0.8748 | 0.7862 | 0.7405 | 0.8165 | 0.3962 | 0.7689 |
As an additional assessment, we also assessed the results of fine-tuning, which means that all parameters in both the pre-trained model and the prediction head are unfrozen and fine-tuned (as shown in Fig. S3†). Although fine-tuning is not our focus here, we show that our model achieves competitive results compared with other self-supervised models, including polyBERT and Transpolymer, in this experimental setting.
In addition, explicit augmentations yielded decreased performance relative to the baseline for the electron affinity (Eea) and ionisation energy (Ei) datasets. This decline in performance likely stems from two factors: the strong baseline performance and the sensitivity of these properties to precise molecular orbital and electronic configurations. Augmentations that change or obscure parts of the structure may distort the representation that hints at the electronic properties. Hence explicit augmentations have detrimental effects, which involves making direct, observable changes to the molecule (e.g. removing an atom, or breaking a bond, etc.). Alternatively, this may also be an indication of the underlying difficulty of the downstream task. For example, Xc is a non-trivial polymer property to assess experimentally and computationally,49 whereas methods to assess electron affinity and ionisation potential are better established. Indeed, the group contribution method used to obtain Xc may be particularly susceptible to the data augmentation strategies that we use, which increases the complexity of this task and introduces noise. The remaining downstream properties are generally correlated with broader structural motifs. The augmentations allow for augmented data to benefit model robustness by introducing slight structural perturbations.
Considering all of the downstream datasets, we observe that some explicit augmentation combinations demonstrate superior performance relative to others (Fig. 3(a)). The combinations of Original-Drop and Enumeration-Masking are the best explicit combinations, leading to improved performances compared to the no-augmentation baseline in six of the seven downstream datasets. The second best combination of augmentations are Original-Enumeration and Enumeration-Drop, which demonstrated improved performance for five of the seven downstream datasets. From the results for the augmentation combination study, we observe that including either the original or enumeration augmentation strategies improves performance. It can be intuitively explained that these two augmentations preserve the original and complete semantics of polymer molecules. This aligns with previous research, which has shown that enumeration as an augmentation strategy can enhance the performance of machine learning tasks.35,45,59,60 During Masking and Drop, though the local data structure of polymer-SMILES is preserved, these augmentation types introduce semantic impairment. Therefore, combinations that result in superior performance preserve the full semantics in one branch; this serves as an anchor to give a hint to the parallel branch to complete its full semantics. The strategy behind these combinations might encourage the contrastive pre-training objective to learn more effective representations.
Implicit augmentations (3(b)) have a unique advantage in creating high-performing contrastive learning strategies, as it outperforms other strategies relative to the baseline for five of the seven datasets; this is comparable to the high-performing explicit augmentation combinations. After confirming the effectiveness of implicit augmentations, we combined the best-performing explicit combinations with implicit augmentations (listed as mixed augmentations in Fig. 3(b)) to identify whether this resulted in improved performance. The addition of implicit augmentations led to varying effects on the performance of explicit combinations. For Original-Enumeration and Enumeration-Masking, implicit augmentations further improved the expressiveness of the resulting representations. However, Original-Drop and Enumeration-Drop suffer from the loss of efficacy.
Surprisingly, Enumeration-Masking with implicit dropout was the overall best performing combination. This result might demonstrate that the diversified use of augmentation modes is beneficial to the construction of the contrastive learning objective. We can intuitively explain why this combination works. As analysed above, Masking of the original SMILES in one branch conceals part of the information and Enumeration in another branch assists recovery of the original SMILES from its enumerated form. In addition, the semantics in both branches are further disturbed to create slight differences by dropout noises to encourage the comparison. The entire process is comprehensive and effective. Therefore, we chose to apply this augmentation mode, which is the product of the combination of explicit and implicit augmentations, to train our final PolyCL model.
Contrary to the other augmentation combinations in Fig. 4, PolyCL applies Enumeration-Masking with implicit dropout. From Fig. 4, it can be seen that the alignment and uniformity converge to the ideal quadrant during the pre-training guided by the contrastive learning objective – indicating superior performance. This observation aligns with the transfer learning results in Section 3.2, and with conclusions from previous studies,46,53 which showed that improved alignment and uniformity is generally linked to improved performance of the pre-trained representation. The low alignment loss of PolyCL specifically demonstrates that effective alignment contributes to the robust predictive results in Section 3.2 when PolyCL representations are applied to enumerated datasets.
We have also evaluated the alignment and uniformity of the pre-trained models polyBERT and Transpolymer. PolyBERT has a balanced alignment and uniformity, with both values lying in the middle region, compared to other results. For Transpolymer
, the alignment loss is comparable to the best contrastive learning models, while the uniformity loss is similar to the Drop-only model.
While PolyCL outperforms other pre-trained models under these two evaluations, it should be noted that polyBERT is the prior of PolyCL. Therefore, the properly trained contrastive pre-training results in the improvement of the model in both alignment and uniformity. It also emphasises the importance of augmentation strategy, as not all augmentations will result in the improvement of both metrics through the training process. Though there is no evident link between the transfer learning performance on specific tasks and the alignment and uniformity, the overall transfer learning performance can be positively correlated to the alignment and uniformity matrix.
In Fig. 5(b), polymers from a sampled downstream property dataset (Egc) were embedded in the representation space coloured by the value of the band gap (chain) ground truth; the gradient of this representation shows that the sampled downstream property, the chain band gap (Egc), is highly related to the embedded structural features of polymers. In other property datasets, this gradient was also observed (as shown in Fig. S2†). Due to the limited number of datapoints in each remaining dataset, the gradient is less evident than the TSNE visualisation of Egc dataset (shown in Fig. 5(b)).
Our results also suggest that the representation space effectively captures changes in key physical properties implied by the structural features that the original t-SNE was trained on. In Fig. 5(c), all available data is encoded to a representation space and colored by the data source. Here, we observe that the initial, unsupervised dataset comprehensively covers the chemical space encompassed by all of the downstream datasets. We also visualised the molecular structures corresponding to randomly selected points in the embedding. The results show that the structural features learnt by contrastive learning align with human understanding, yet slight divergence. It can be seen from the visualisation that neighbouring representations do not necessarily have similar structures in their molecular graphs. This discrepancy may be due to the different emphasis of SMILES strings and molecular graphs on encoding molecular structures and the special focus of contrastive learning strategies to learn the representations.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00236a |
This journal is © The Royal Society of Chemistry 2025 |