Open Access Article
Francesco
Piccoli†‡
,
Gabriel
Vogel†
and
Jana M.
Weber
*
Department of Intelligent Systems, Delft University of Technology, Delft, 2629 HZ, The Netherlands. E-mail: j.m.weber@tudelft.nl; g.vogel@tudelft.nl
First published on 23rd January 2026
Recent advances in machine learning (ML) have shown promise in accelerating the discovery of polymers with desired properties by aiding in tasks such as virtual screening via property prediction. However, progress in polymer ML is hampered by the scarcity of high-quality labeled datasets, which are necessary for training supervised ML models. In this work, we study the use of the very recent ‘Joint Embedding Predictive Architecture’ (JEPA), a type of architecture developed for self-supervised learning (SSL), on polymer molecular graphs to understand whether pretraining with the proposed SSL strategy improves downstream performance when labeled data is scarce. We first pretrain our polymer-JEPA model on a large dataset of conjugated copolymer photocatalysts. The pretrained model is then fine-tuned on two distinct downstream tasks: predicting electron affinity in the same chemical space and classifying phase behavior in diblock copolymers, a different chemical space. Our results indicate that JEPA-based self-supervised pretraining enhances downstream performance, particularly when labeled data is very scarce, achieving improvements across both tested datasets. The method provides performance gains in cross-domain fine-tuning, highlighting its potential to extract general knowledge across different classes of polymers. By leveraging large amounts of unlabeled polymer structures for pretraining, the proposed strategy can further reduce the dependence on extensive labeled datasets.
In recent years, machine learning (ML) has shown potential in the discovery of new materials, including polymers.6,7 ML methods are increasingly applied in polymer science, particularly in two key areas: virtual screening of predefined candidate structures to predict properties and inverse polymer design to generate novel structures with desired properties.5,8,9
However, the application of ML in polymer science is still in its infancy, primarily due to the scarcity of high-quality, large, publicly available labeled datasets. This limitation arises from the time- and cost-intensive procedure of generating labeled polymer data (via experimental synthesis and testing or accurate molecular simulations).4,6,10 To overcome the problem of limited labeled data, several common strategies have been applied in the polymer domain. Transfer learning involves pretraining models on polymer properties with abundant labeled data and fine-tuning them for properties with limited data.11,12 Multitask learning is an effective approach to train predictive models on multiple properties with varying levels of labeled data, leveraging interdependencies between these properties.2,13,14 Lastly, self-supervised learning (SSL) makes it possible to pretrain models on large volumes of unlabeled data through tasks defined directly on the input data. The learned representations can then be fine-tuned on smaller labeled datasets.3,12,15–17
Among these strategies, SSL has been particularly transformative across different data structures such as images,18–21 natural language,22,23 and graphs.24–26 In the molecular domain, graph-based SSL has shown considerable success with small molecules.27–29
In the context of polymers, the focus has largely been on text-based SSL, learning representations through tasks derived from the textual pSMILES representation,3,12,16,17 with limited exploration of graph-based SSL approaches. Polymer graphs beyond the polymer repeat unit graph, including weighted edges that describe monomer ensembles, their topology and their stochasticity, as proposed in ref. 1, present unique structural characteristics that distinguish them from small molecular graphs, posing challenges for directly applying SSL techniques developed for small molecular graphs. A recent study proposed a self-supervised graph neural network for such polymer graphs.15 The authors employ two SSL tasks: one at the node/edge level, masking nodes and edges and learning to predict them, and the other at the graph level, predicting a pseudolabel corresponding to the molecular weight of the polymer, derived from the monomers' weights. They test both tasks separately and together, and they discover that pretraining via both tasks proves to be the most effective. This result aligns with findings in the literature30 that SSL on graphs works better when using both node-level and graph-level tasks together.
In this work, we study a new architecture family, called Joint Embedding Predictive Architecture (JEPA),31 which was developed for self-supervised representation learning of images. Unlike traditional graph-based SSL methods, such as node or edge masking, which focus on reconstructing masked features directly in the input graph space, JEPAs operate in an embedding space. Predicting in the embedding space facilitates the learning of semantically-rich representations, avoiding the need to predict and reconstruct every (potentially noisy and hard to predict) detail of the input space, that in high-dimensional domains often leads to overfitting.31,32 The way JEPAs learn, is by predicting the embedding of a “target view” of the graph based on the embedding of a “context view”, typically by employing two encoders. We apply this architecture for self-supervised pretraining on stochastic polymer graphs to improve the accuracy in downstream tasks (e.g. property prediction) in label-scarce data scenarios.
We first use the proposed method for pretraining on a larger unlabeled corpus of data, and then finetune the model on available labeled data in a supervised fashion. While some aspects of our analysis apply broadly to JEPAs across various types of graphs (i.e. in different domains) and extend the study of JEPAs for graphs initiated in ref. 32, other results and experiments are specific to JEPAs in the molecular graph domain, specifically for stochastic polymer graphs.
966 polymers, composed of nine A monomers and 862 B monomers. The polymers are created by combining each A monomer with each B monomer in three distinct chain architectures: alternating, random, and block. Additionally, they are further distinguished by three stoichiometry ratios: 1
:
1, 1
:
3, and 3
:
1. While this is a useful and extensive dataset, the variety of it is somewhat limited by the combinatorial approach to build the polymers. Aldeghi and Coley1 further performed oligomer DFT calculations to provide estimations for two properties, the electron affinity (EA) and ionization potential (IP), for each of the copolymers.1 While this provides a valuable resource for benchmarking polymer informatics, it potentially inherits biases from DFT approximations. We use this dataset for pretraining and keep aside a part of the dataset for finetuning.
![]() | ||
| Fig. 1 Polymer graph representation as introduced in ref. 1. The representation uses stochastic edges (dashed) reflecting the connection probabilities between monomers, i.e. reflecting the stochastic nature and chain architecture. | ||
![]() | ||
| Fig. 2 (a) Conjugated copolymer photocatalyst dataset1 with different stoichiometries and chain architectures. (b) Diblock copolymer dataset compiled by Arora et al.,34 reporting experimentally observed phase behavior (e.g., lamellae, cylinders, gyroid) for 50 diblock copolymers across varying stoichiometries and chain architectures. | ||
We test our approach on two distinct downstream tasks, both starting from a model pretrained on the aforementioned dataset. Firstly, we finetune on data from the same dataset (but different split). Secondly, we finetune on a different downstream task using a dataset of diblock copolymers.34 The dataset provides labels of the phase behavior (lamellae, hexagonal-packed cylinders, body-centred cubic spheres, a cubic gyroid, or disordered) of 49 diblock copolymers across various relative volume fractions, totaling 4780 labeled polymer samples (see Fig. 2b). Both datasets were downloaded from the open-source repository https://github.com/coleygroup/polymer-chemprop-data/, accessed on Jan. 30th, 2024.
• Motif-based subgraphing is a domain-specific approach to generate chemically meaningful subgraphs, such as functional groups or molecular subunits. We build on the BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures) algorithm,36 specifically using the implementation in ref. 37. The motif-based subgraphing method ensures meaningful subgraphs but potentially limits the variability of subgraphs due to its deterministic nature.
• METIS35 is a popular subgraphing algorithm, using a clustering-based method, that partitions graphs into clusters while minimizing edge cuts and maximizing within-cluster links. Its widespread use is due to its low computational cost and the quality of the produced subgraphs. Despite being computationally efficient, METIS-based subgraphs lack chemical meaning compared to motif-based subgraphing.
• Random-walk subgraphing uses a stochastic approach to generate diverse subgraphs. It ensures greater flexibility and control over subgraph sizes while producing varying subgraphs at each training iteration. This method aligns well with the requirements for JEPA, particularly the need for dynamic changes to prevent overfitting.
The next step focuses on the prediction of the target subgraph embeddings through a multi-layer perceptron (MLP) from the context embedding and positional information of the target. We employed positional encoding (PE) via random-walk structural encoding (RWSE)38,39 at two levels: at the node level and at the subgraph (patch) level. The node-level PE allows us to maintain node positional information when working with subgraphs. The subgraph (patch)-level positional encoding contains information about the connectivity of two subgraphs, here the context and target subgraph. Given the output of the context encoder, sx, we wish to predict the m target subgraphs representations sy(1), …, sy(m). To that end, for a given target subgraph embedding sy(i), the predictor hϕ takes as input sx summed with the linearly transformed target subgraph positional token
i:
ŝy(i) = hϕ(sx + i ) | (1) |
∈ ![[Doublestruck R]](https://www.rsc.org/images/entities/char_e175.gif)
×d, where
is the dimension of the positional encoding token and d is the embedding dimension. The predictor outputs the predicted target embedding ŝy(i). Since we wish to make predictions for m target blocks, we apply our predictor m times, obtaining predictions ŝy(1), …, ŝy(m). In practice, the predictor hϕ is implemented via a MLP. For each data point, the loss is the average L2 distance (Mean Square Error (MSE)) between the m predicted target subgraph representations and the m true target subgraph representations. The MSE loss in the embedding space can lead to fluctuating or increasing loss values, e.g., with changing magnitudes of embedding vectors over epochs. In Appendix E, we investigate the effect of layer normalization on training stability and convergence of JEPA pretraining.
The pretraining phase always entails training the JEPA architecture on 40% (17
186 entries) of the conjugated copolymer dataset.1 After pretraining, only the trained target encoder is utilized in the finetuning step to obtain the polymer graph embedding. For the downstream task (e.g. polymer property prediction), an MLP is employed on top of the polymer graph embedding obtained from the target encoder. Finetuning is done end-to-end, wherein not only the MLP but also the target encoder weights are updated during the optimization process. This allows the polymer graph embeddings to also finetune to the specific downstream task. As mentioned in Section 2.1, we perform the finetuning on the two different datasets to investigate the difference between using the same chemical space for pretraining and finetuning in contrast to using two different polymer chemical spaces. The results for this study are presented in Sections 3.1 and 3.2.
and the pseudolabel loss
, aggregated with equal weights:The pseudolabel loss was computed as a mean squared error (MSE) between the predicted and target pseudolabels,
311 polymers) labeled data points. In Fig. 5 we report results only up to the 8% scenario to better visualize the impact in low data regimes. The pretrained model especially demonstrates performance improvements in scenarios up to a data size of 4% (1728 polymers) labeled data points. However, beyond this threshold, the benefits of pretraining plateau. This suggest that the available labeled data is sufficient for supervised learning, rendering the transferred pretraining knowledge redundant. In practice, a small change in the R2 value (e.g. ±0.01) does not significantly impact molecular design task, however, in the low labeled data scenarios (i.e. 0.4% and 0.8%), pretraining leads to a significant improvement of the property prediction performance.
![]() | ||
| Fig. 5 Effectiveness of our pretraining strategy for different finetune dataset sizes. The performance is evaluated on predicting the electron affinity of test data from the conjugated copolymer dataset1 with the performance measured as R2. | ||
![]() | ||
| Fig. 6 Effectiveness of our pretraining strategy for transfer learning and different finetune dataset sizes. Self-supervised pretraining is performed on data from the conjugated copolymer dataset1 and finetuning is performed on test data from the diblock copolymer dataset.34 The classification performance for predicting the phase behavior is measured as AUPRC (area under the precision–recall curve). | ||
We see particular promise in scenarios, where a model pretrained on a large, unlabeled polymer space is fine-tuned across domains on a smaller, labeled dataset from another polymer space. We expect this to be especially advantageous for small datasets with complex structure–property relationships.
Overall, the performance increase using input space self-supervised pretraining and embedding space self-supervised pretraining (JEPA, our method) is comparable. Fig. 7 reveals that our method is slightly better in the very low data scenarios, and Gao et al.'s method15 performs better with more available labelled data.
![]() | ||
| Fig. 7 Comparison between our pretraining strategy and the best performing SSL model from ref. 15. For different finetune dataset sizes we compare the R2 for prediction of EA of the conjugated copolymer test data.1 | ||
Using an additional pseudolabel objective as described in Section 2.4 leads to a consistent improvement in ref. 15. While we observed significant improvements in single scenarios, overall the performance improvements attributed to using an additional pseudolabel objective are smaller for our method than for the input-space SSL approach as shown in Fig. 8. This suggests that our strategy potentially already captures relevant information related to the polymer molecular weight pseudolabel.
![]() | ||
| Fig. 8 The effect of using an additional pseudolabel objective in input space SSL15 and embedding space SSL (ours). For different finetune dataset sizes we compare the R2 for prediction of EA of the conjugated copolymer test data.1 We include both the scenarios when only the wD-MPNN encoder weights are transferred (no PL), and the scenario when also the pseudolabel (molecular weight) predictor weights are transferred (PL). | ||
While the polymer-JEPA demonstrated improvements over our baseline model without pretraining in low-data regimes, we observed that the tree-based models outperform the pretrained wD-MPNN in the very low labeled data regimes. This trend is evident in Fig. 9a, where the random forest and XGBoost model exhibited an advantage over the pretrained wD-MPNN when using 0.4% (192 points) and 0.8% (384 points) of the dataset for finteuning. The advantage of the RF model vanishes as the dataset size increases to 1.6% and the advantage of the XGBoost model vanishes as the size increases to 4%. In practice, only for a small regime around 4.0% finetune data, we observe that the pretrained model outperforms both the random forest and the wD-MPNN without pretraining. There is no scenario in which the pretrained wD-MPNN model outperforms both the XGBoost model and the baseline wD-MPNN without pretraining.
![]() | ||
| Fig. 9 (a) Comparison between our pretraining-finetuning strategy and two tree-based baseline models (random forest and XGBoost) on the conjugated copolymer dataset,1 predicting the EA property, for different finetune dataset sizes. (b) Comparison of phase behavior classification performance between our pretraining–finetuning strategy and two tree-based baseline models (random forest and XGBoost) on the diblock copolymer dataset,34 for different finetune dataset sizes. | ||
For the smaller diblock dataset (Fig. 9b), we observed that the tree-based baseline models outperform our model throughout all data scenarios. This may be due to highly informative fingerprints, that correlate with the classification label. As Aldeghi and Coley1 point out, simply the mole fraction, correlated with the volume fractions of the two blocks is highly informative for determining the copolymer phase. They trained a RF model on mole fractions only which outperformed the wD-MPNN (no data scarce scenarios) without providing information about the chemistry. However, in many polymer systems the structure-to-property relationships are more complex and cannot be captured by a single, easily computed feature. While polymer fingerprinting combined with smaller (e.g. tree-based) models remains a strong baseline, several studies have demonstrated that learned representations and deep neural networks surpass classical models across a variety of molecular and polymer property prediction tasks.42–44 We anticipate that as polymer-specific representations mature – better encoding higher-order structural features such as branching, molecular-weight distributions, and sequence heterogeneity—representation learning approaches will deliver even greater performance gains.
Based on the results in our study, we advise to consider also the application of simpler models for small labeled datasets with relatively simple structure to property relationships. However, one key advantage of our proposed method lies in its ability to eliminate the reliance on handcrafted descriptors. By learning directly from the polymer graph structure, JEPA offers greater adaptability to new datasets without the need to tune fingerprints for specific polymer structures. Specifically, the used fingerprinting1 which involves the generation of oligomer ensembles and averaging their fingerprint, requires thoughtful engineering by experts and significant computation time.
Lastly, we hypothesize that more diverse pretraining datasets and more finetune data could further elevate the performance of our JEPA (wD-MPNN) pretrained model, beyond the two tasks covered in this study.
Each experiment involved pretraining the model on 40% of the conjugated copolymer dataset (Section 2.3) and finetuning on 0.4% (192 data points) to simulate a label-scarce scenario. Overall, across ablation experiments, self-supervised pretraining with our model consistently improved downstream property prediction performance. For the ablation experiments related to context subgraph size, target subgraph size and number of targets, we used the random-walk algorithm for subgraph creation due to its direct control over subgraph size.
Within the searched subgraphing settings, we observe the highest performance increase using random walk subgraphing, a context size of 60% and one target with a subgraph size of 10% of the polymer graph. Yet, the model's sensitivity to subgraphing hyperparameters is comparatively low, indicating that (i) the resulting parameters should be interpreted as one out of many suitable configurations and that (ii), in the tested parameter space, JEPA pretraining provides performance improvements irrespective of the hyperparameter configuration.
| Context size | R 2 ↑ | RMSE ↓ |
|---|---|---|
| No pretraining | 0.46± 0.15 | 0.44± 0.06 |
| 20% | 0.56 ± 0.07 | 0.39 ± 0.03 |
| 40% | 0.60 ± 0.06 | 0.37 ± 0.03 |
| 60% | 0.65 ± 0.03 | 0.35 ± 0.02 |
| 80% | 0.62 ± 0.07 | 0.37 ± 0.03 |
| 95% | 0.61 ± 0.05 | 0.37 ± 0.02 |
| Target size | R 2 ↑ | RMSE ↓ |
|---|---|---|
| No pretraining | 0.46± 0.15 | 0.44± 0.06 |
| 5% | 0.61 ± 0.07 | 0.37 ± 0.03 |
| 10% | 0.66 ± 0.02 | 0.35 ± 0.01 |
| 15% | 0.65 ± 0.03 | 0.35 ± 0.02 |
| 20% | 0.63 ± 0.03 | 0.36 ± 0.02 |
| Number of targets | R 2 ↑ | RMSE ↓ |
|---|---|---|
| No pretraining | 0.46 ± 0.15 | 0.44 ± 0.06 |
| 1 | 0.67 ±0.01 | 0.34 ±0.01 |
| 2 | 0.64 ± 0.03 | 0.36 ± 0.01 |
| 3 | 0.66 ± 0.02 | 0.35 ± 0.01 |
| 4 | 0.65 ± 0.05 | 0.35 ± 0.02 |
| 5 | 0.61 ± 0.04 | 0.37 ± 0.02 |
| Subgraphing | R 2 ↑ | RMSE ↓ |
|---|---|---|
| No pretraining | 0.46 ± 0.15 | 0.44 ± 0.06 |
| Motif-based | 0.63 ± 0.05 | 0.36 ± 0.02 |
| Metis | 0.67 ± 0.04 | 0.34 ± 0.02 |
| Random walk (RW) | 0.67 ±0.01 | 0.34 ±0.01 |
Using a fixed context size of 60% and target size of 10% (single target), we compared the random-walk, motif-based, and METIS subgraphing. Interestingly, the motif-based method, which leverages domain knowledge to produce chemically meaningful subgraphs, exhibits slightly lower performance than the other two algorithms. While motif-based subgraphing generates chemically meaningful subgraphs, it tends to produce a relatively small number of subgraphs, in a deterministic fashion, potentially limiting model generalization by increasing the likelihood of encountering similar or identical subgraphs (both context and target ones) throughout training. On the other hand, the METIS algorithm, while also producing consistent subgraphs at each epoch, generates on average a higher number of subgraphs compared to the motif-based approach, introducing more variability across epochs. Finally, random-walk subgraphing generates different subgraphs at every epoch, thanks to the stochastic nature of the subgraphing process. Beyond the factor of variability, we investigated the adherence of these methods to the specified subgraph size. In Appendix D we provide a detailed analysis on the subgraph sizes for the different subgraphing types given the desired specification. While all methods systematically overshoot the specified subgraph size due to implementation constraints (chemical validity, connectivity preservation, and meaningfulness of partitions), the adherence to the specified size of the extracted subgraphs differs. The random-walk approach provided the best adherence to the desired context size (close to 60%). Conversely, METIS achieved the closest adherence to the specified target size (10%). In the previous sections, we identified a too small context subgraph size to be the main driver of performance drops. Since our implementation leads to larger subgraphs than specified for all methods, we naturally prevent too small context subgraphs.
While we observe the best performance for the random-walk approach, other subgraphing methods remain competitive and may favor slightly different optimal configurations (e.g., context size, target size, or number of targets). The analysis indicates that for our dataset the pretraining effect is consistently significant across most settings, whereas further hyperparameter tuning offers only minor improvements.
Our experiments show that self-supervised pretraining on conjugated copolymer data consistently improves the downstream prediction accuracy for a dataset that describes a different polymer space (diblock copolymers), showcasing the ability of transferring general knowledge across polymer datasets of different applications. When pretraining and finetuning on the same polymer space of conjugated copolymers our method helps in label-scarce data scenarios up to up to around 8% (ca. 3440 polymers) of labeled data availability. The performance improvement (in R2) varies from 39.8% in the smallest labeled data scenario to 0.4% in the scenario with 8% labelled data.
Comparing our polymer-JEPA self-supervised model (embedding space) with node/edge-masking self-supervised learning (input space), we observe that we achieve comparable performance on the tested downstream task. Further, both methods benefit from including a pseudotask prediction (molecular weight), with the benefit being less pronounced in our embedding space self-supervised pretraining strategy. Additionally, we showed that simple tree-based baselines like a random forest or XGBoost model outperform our deep learning models in certain scenarios in the two downstream tasks considered, especially when labelled data is very scarce. However, these methods rely on expert-engineered polymer fingerprints, which is not necessary with our method that learns from the polymer graph directly.
Looking ahead, the embedding-based nature of JEPA offers promising opportunities for integrating multimodal data and utilizing a variety of different experimental and synthetic datasets for pretraining. Generally, we hypothesize that more diverse pretraining datasets could contribute to further increasing the performance of our JEPA (wD-MPNN) model compared to not using a pretrained model, also beyond the two tasks covered in this study.
(1) In the case of a copolymer, the context subgraph should include elements from both monomers: predicting a part of monomer B, if monomer B is missing from the context is not possible.
(2) Every edge and every node should be in at least one subgraph32,39 to include full global information and full input representation.
(3) The context patch (subgraph) should be larger, hence more informative, than the targets patches (subgraphs) we are trying to predict from the context.45
(4) The target subgraphs should have minimal overlap with the context subgraph32,45 to make the prediction task less trivial.
(5) The context and targets subgraphs should change at every training loop to prevent overfitting.32,45
(6) For every directed edge evu in a subgraph, include also the edge euv, to comply with encoder architecture (wD-MPNN1).
| wD-MPNN | R 2 ↑ | RMSE ↓ |
|---|---|---|
| Edge-centred | 0.998 ± 0.0003 | 0.029 ± 0.002 |
| Node-centred | 0.998 ± 0.0002 | 0.027 ± 0.001 |
| wD-MPNN | R 2 ↑ | RMSE ↓ |
|---|---|---|
| Edge-centred | 0.998 ± 0.0007 | 0.022 ± 0.004 |
| Node-centred | 0.997 ± 0.0004 | 0.025 ± 0.003 |
Model II (in the main manuscript), in contrast, simplifies the architecture by replacing the transformer encoders with wD-MPNNs for both context and target encoding, while removing the initial GNN stage. Here, Econtext operates on the context subgraph, and Etarget directly processes the full graph, from which target subgraph embeddings are pooled. This design yields a more parameter-efficient model, avoids intermediate subgraph embeddings, and naturally preserves positional information.
Model II clearly outperforms Model I, demonstrating better stability and accuracy, particularly in low-data regimes. Several factors may explain this: (i) Model I is more complex, combining wD-MPNN and transformer encoders, and likely requires more data to converge effectively. (ii) Model I contextualizes targets only through small subgraphs, which may discard positional cues critical for polymer graphs composed of multiple monomer units. In contrast, Model II trains the target encoder on the full graph, preserving this information. (iii) All hyperparameters (subgraph sizes, target number, and partitioning strategy) were tuned for Model II, potentially biasing results in its favor.
Regarding pretraining, the two variants behave differently. Model II benefits consistently from pretraining across all finetuning dataset sizes. In contrast, Model I shows improvement only in the most data-scarce scenarios; as the amount of finetuning data increases, the effect of pretraining becomes redundant or even slightly detrimental, suggesting that noise may dominate the learned representations in this configuration.
| Method | Context | Target | Ratio | Adherence |
|---|---|---|---|---|
| Motif-based | 14.6 ± 2.8 (69.2%) | 4.8 ± 0.9 (23.2%) | 3.1 : 1 |
Context 115%, target 232% |
| METIS | 14.6 ± 3.2 (68.5%) | 3.1 ± 0.7 (14.6%) | 4.7 : 1 |
Context 114%, target 146% |
| Random walk | 13.4 ± 2.6 (62.9%) | 3.8 ± 1.4 (18.6%) | 3.5 : 1 |
Context 105%, target 186% |
We observe that all methods produce larger-than-expected targets due to implementation constraints (chemical validity, connectivity preservation, and meaningfulness of partitions). Nevertheless, the methods differ in how closely they approximate the intended proportions: random walk adheres most closely to the context size, METIS yields the most accurate sizes for targets, and motif-based partitions result in the largest context and targets compared to desired sizes.
With the normalization layer, JEPA shows the expected smooth decline of both training and validation losses, indicating numerically stable optimization. In contrast, when the normalization layer is removed, the loss curves fluctuate and generally increase with increasing epochs. This behavior arises because the magnitude of latent embeddings can drift during training, leading to apparent divergence in the raw L2 loss scale.Despite this, we found the variant without normalization (closer to original JEPA implementation) to achieve superior downstream performance. We attribute this to increased representational flexibility: normalization constrains the embedding space and may reduce expressiveness. Moreover, in JEPA-style objectives, context–target pairs are resampled each epoch via random-walk subgraphing, so the prediction targets continually change, preventing the loss from flattening to a stationary minimum. Similar patterns have been reported in recent JEPA implementations (e.g., C-JEPA), and reflect the stochastic nature of the training dynamics rather than instability.
As a reference, we also provide the learning curves for self-supervised pretraining using node/edge masking (Fig. 13a) and pseudolabel prediction (Fig. 13b) of polymer graphs, as introduced by Gao et al.15
![]() | ||
| Fig. 13 MSE loss curves for node/edge masking and pseudolabel prediction, as introduced by Gao et al.15 (a) MSE loss curve for masked node/edge prediction. (b) MSE loss curve for pseudolabel prediction. | ||
We conduct a nine-fold cross validation, each fold holding out one A-monomer. The compared models include the JEPA-pretrained wD-MPNN, the same model fine-tuned only, the input-space SSL model of Gao et al.,15 and tree-based baselines random forest and XGBoost trained on oligomer fingerprints, as described in the main manuscript. Because the monomer-holdout setup induces substantial variability between folds, results are summarized as boxplots across all folds and random seeds in Fig. 14 and 15. Additionally, we provide per-monomer RMSE curves for the different finetune dataset size, shown in Fig. 16.
![]() | ||
| Fig. 14 R 2 performance under the monomer-holdout split for wD-MPNN without and with JEPA-based pretraining, tree-based baselines, and input-space SSL (Gao et al.15). Higher is better. | ||
![]() | ||
| Fig. 16 Monomer-specific RMSE curves for JEPA pretraining, input-space SSL (Gao et al.15), and the non-pretrained baseline, illustrating variability in generalization across monomer types. The titles of the subfigures are the respective monomer SMILES of the held out test A-monomer. | ||
![]() | ||
| Fig. 17 RMSE performance under the random split for wD-MPNN without and with JEPA-based pretraining, tree-based baselines, and input-space SSL (Gao et al.15). Lower is better. | ||
The tree-based baselines random forest and XGBoost trained on handcrafted fingerprints are less competitive under the monomer-holdout split compared to the random split scenario. This indicates that fingerprint-based models tend to be more sensitive to distribution shifts than graph-based methods that learn directly from molecular structure.
Interestingly, JEPA-based and input-space SSL models excel on different monomers. For instance, JEPA performs better for (*)c1ccc(*)cc1, whereas input-space SSL performs better for (*)c1cc2cc3sc(*)cc3cc2s1. For each held-out monomer, the plots also report the maximum Tanimoto similarity to any training-set monomer (upper-right corner). However, we observe no consistent relationship between similarity, RMSE magnitude, and the relative benefit of pretraining.
Footnotes |
| † These authors contributed equally to this work. |
| ‡ Francesco Piccoli completed this work before joining Amazon. |
| This journal is © The Royal Society of Chemistry 2026 |