Integrating equivariant architectures and charge supervision for data-efficient molecular property prediction

Zixiao Yang ab, Hanyu Gao c and Xian Kong *ab
aSouth China Advanced Institute for Soft Matter Science and Technology, School of Emergent Soft Matter, South China University of Technology, Guangzhou, China. E-mail: xk@scut.edu.cn
bGuangdong Provincial Key Laboratory of Functional and Intelligent Hybrid Materials and Devices, South China University of Technology, Guangzhou, China
cDepartment of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Hong Kong, SAR, China

Received 15th September 2025 , Accepted 21st December 2025

First published on 24th December 2025


Abstract

Understanding and predicting molecular properties remains a central challenge in scientific machine learning, especially when training data are limited or task-specific supervision is scarce. We introduce the molecular equivariant transformer (MET), a symmetry-aware pretraining framework that leverages quantum-derived atomic charge distributions to guide molecular representation learning. MET combines an equivariant graph neural network (EGNN) with a transformer architecture to extract physically meaningful features from three-dimensional molecular geometries. Unlike previous models that rely purely on structural inputs or handcrafted descriptors, MET is pretrained to predict atomic partial charges, which are quantities grounded in quantum chemistry. This enables MET to capture essential electronic information without requiring downstream labels. We show that this pretraining scheme improves performance across diverse molecular property prediction tasks, particularly in low-data regimes. Analyses of the learned representations reveal chemically interpretable structure–property relationships, including the emergence of functional group patterns and smooth alignment with molecular dipoles. Ablation studies confirm that the EGNN encoder plays a crucial role in capturing transferable spatial features, while the transformer layers adapt these features to specific prediction tasks. This architecture draws direct analogies to quantum mechanical basis transformations, where MET learns to transition from coordinate-based to electron-based representations in a symmetry-preserving manner. By integrating domain knowledge with modern deep learning techniques, MET offers a unified and interpretable framework for data-efficient molecular modeling, with broad applications in computational chemistry, drug discovery, and materials science.



Design, System, Application

We propose a molecular design/optimization strategy that learns symmetry preserving 3D representations by pretraining an equivariant GNN + transformer to predict quantum-derived atomic partial charges. Aligning the latent space with electronic distributions yields physically grounded, transferable descriptors that require minimal labeled data across downstream tasks. The desired system functionality is an interpretable, data-efficient property predictor that (i) preserves rotation/translation equivariance, (ii) captures long-range interactions through attention, and (iii) remains lightweight for routine screening. Key design constraints include access to reasonable 3D coordinates/conformers and charge labels for pretraining, with current validation focused on small organic molecules; the architecture is modular to incorporate multi-conformer inputs, alternative charge schemes, and task-specific heads. Immediate applications include rapid screening of orbital energies, dipoles, total energies, and reactive sites, as well as uncertainty-aware active learning in low-data regimes. Longer-term, the charge-aware latent space can seed closed-loop generative design, accelerate lead optimization and materials discovery, and integrate with physics-guided fine-tuning to extend to larger molecules and condensed-phase systems.

1 Introduction

Accurately predicting molecular properties remains a central challenge in computational chemistry and molecular design. While quantum mechanical methods such as density functional theory (DFT) offer reliable insights into electronic structure, their high computational cost limits their applicability in large-scale screenings.1 To balance efficiency and accuracy, hierarchical screening protocols are often employed, beginning with rapid approximate evaluations followed by refined quantum-level assessments.

Recent progress in deep learning has opened new avenues for molecular property prediction, hinging on the quality of molecular representations. Traditional encodings—such as SMILES strings and 2D molecular graphs—fail to capture the full complexity of molecular behavior, particularly when three-dimensional (3D) structure plays a critical role. Properties such as dipole moments, reactivity, and orbital energies depend sensitively on the spatial arrangement of atoms and electrons, motivating the development of 3D-aware representations. Graph neural networks (GNNs), including SchNet2 and MPNN,3 have advanced this goal by learning structural features from molecular graphs. However, many models neglect a key physical principle: equivariance under rotation and translation.4,5 Without this symmetry, predictions may vary under rigid transformations of the input, limiting generalizability in tasks involving stereochemistry or conformational flexibility.6 Moreover, most molecular machine learning models rely on unsupervised or self-supervised training due to the scarcity of labeled data. Models such as MolCLR7 and ChemBERTa8 extract general-purpose embeddings from large unlabeled datasets. However, these approaches often bypass a fundamental determinant of molecular properties: the electronic density. Direct supervision using physically meaningful quantities could yield more informative and compact representations, especially in data-limited scenarios. Recent work has injected chemical knowledge into contrastive learning frameworks—such as constructing element- and functional-group-based knowledge graphs—achieving consistent gains across multiple datasets and offering a promising direction for parallel enhancements. Even though this method lead to high performance in downstream tasks, it suffers from the lack of accurate data, which makes it hard to pretrain in large scale.9

In many practical molecular design and screening problems, the target properties are experimentally measured quantities such as reaction yields, catalytic activities, and solubilities in complex environments. These properties often have limited and noisy datasets—typically only a few hundred to a few thousand reliable labels—and no large corpus exists for property-specific pretraining. As a result, self-supervised approaches alone may not provide sufficiently informative representations, and property-specific models pretrained on massive quantum datasets are not available for most practical observables. In such cases, it is advantageous to pretrain a single encoder on a tractable quantum-derived quantity that captures essential electronic structure, and then fine-tune that encoder across diverse downstream tasks. Recent studies have shown that this cross-task transfer within the same small-molecule domain can substantially improve stability and accuracy in low-data regimes.10

In this work, we propose a streamlined and physically motivated framework that integrates two complementary strategies: symmetry-aware message passing via an equivariant graph neural network (EGNN)11,12 and global representation learning through a transformer architecture. Crucially, we introduce atomic partial charges—readily obtainable proxies for electron density—as supervised targets during pretraining. This charge-guided supervision grounds representation learning in quantum mechanics, resulting in descriptors that are compact, transferable, and data-efficient. Compared to state-of-the-art models that are architecturally complex and data-intensive, our method is intentionally lightweight and accessible, making it feasible for academic laboratories with modest computational resources. By embedding physical priors directly into the model design and training strategy, we offer a more interpretable and scalable approach to accurate molecular property prediction. Compared with deep wavefunction approaches that directly solve the electronic Schrödinger equation,13 our work focuses on data-driven supervised property prediction, offering a more practical trade-off between accuracy and computational cost for large-scale screening.

In summary, our approach bridges the gap between deep learning and first-principles chemistry by incorporating symmetry, geometry, and electron density into a unified molecular representation. This enables robust performance with limited data and presents a promising direction for the next generation of molecular machine learning models.

2 Methods

2.1 Model architecture

2.1.1 General architecture. We introduce a hierarchical molecular property prediction framework called molecular equivariant transformer (MET), which integrates an equivariant graph neural network (EGNN) with a transformer architecture to capture both local geometric and global electronic features, as depicted in Fig. 1(a). The architecture addresses two critical challenges: (1) presenting a universal structure leveraging physically interpretable DFT-derived partial charges for supervised pretraining and (2) dismissing the common boundary effect in traditional graph model. Overall, the combination of EGNN and transformer modules enables our model to effectively encode both the local geometric features and global electronic distributions of molecules, which are critical determinants of molecular properties, thereby enabling accurate and scalable predictions of molecular properties.
image file: d5me00173k-f1.tif
Fig. 1 Framework a: general flowchart of the network. b: Diagram illustrating how the EGNN integrates 3D positional information into the main message flow. c: Illustration of the transformer's data processing structure. The upper part of (a) is the data message stream in the pretraining method, while in the application of the pretraining model, the pooling method is discarded, and the lower part of (a) is activation.

The initial inputs to the network are the indices of atoms in each molecule, which are embedded into a latent space as atomic embeddings, as shown in Fig. 1(a). EGNN then combines atomic embeddings with positional information and the embeddings themselves, whereas the transformer combines embeddings only with themselves. After the transformer, the atomic embeddings in the latent space are pooled to a single dimension, corresponding to the charges on each atom. Once the pretraining phase is complete, the charge pooling module is discarded, and an additional transformer module is introduced to process the molecular representations in the latent space for subsequent tasks. Below, we describe the network's operation and information processing in detail.

2.1.2 Equivariant graph neural network (EGNN). The EGNN module in our architecture, named ComENet,14 enforces geometric equivariance via a rotation- and translation-invariant message passing mechanism. From a representation learning perspective, jointly learning invariant and equivariant components can better preserve geometric information and enhance downstream generalization, as evidenced by both theoretical and empirical studies in label-free settings.15 The network updates the node (atomic) embeddings by incorporating invariant geometric features derived from interatomic distances, bond angles, and torsional angles.16 In each message passing layer, the node embeddings x(l)i are first transformed by a linear mapping and a non-linear activation:
 
zi = σ(W0x(l)i),(1)
where σ(·) denotes the Swish activation function and W0 is a learnable weight matrix. Subsequently, two parallel branches are used to incorporate different edge features. In the first branch, the edge features f(1)ij (obtained from torsional and distance information) are transformed by a function ϕ1 (implemented as a two-layer MLP) to modulate the messages from neighboring nodes:
 
image file: d5me00173k-t1.tif(2)
Similarly, in the second branch, edge features f(2)ij (derived from angular information) are processed via a transformation ϕ2:
 
image file: d5me00173k-t2.tif(3)

The outputs of the two branches are then concatenated and fused via a linear mapping, with a residual connection added from the initial transformed node feature:

hi = Wcat[m(1)im(2)i] + zi.
A further residual update is applied to enhance non-linearity, i.e.hihi + σ(Wreshi). After normalization using GraphNorm, we have [h with combining macron]i = GraphNom(hi), and the updated node embedding for the next layer is obtained by
 
x(l+1)i = Wfinal[h with combining macron]i.(4)

After a series of such message-passing layers, the final node features are encoded into a latent representation via an additional linear transformation. These enriched features, which encapsulate the molecular geometric and chemical information, are then further processed by a transformer module.

2.1.3 Transformer component. Following the EGNN module, a transformer architecture is incorporated to aggregate global information across the entire molecular graph.17,18 The transformer processes the latent representations z (output from the EGNN's latent space conversion) to capture long-range dependencies and integrate global atomic distribution information, which is essential for accurate property prediction, as illustrated in Fig. 1(c).

In the transformer module, the query (Q), key (K), and value (V) matrices are computed by applying linear transformations to the input latent representations z: Q(l) = W(l)Qz(l), K(l) = W(l)Kz(l), and V(l) = W(l)Vz(l), where W(l)Q, W(l)K, and W(l)V are learnable weight matrices specific to the query, key, and value transformations in layer l with dimension [scr D, script letter D] × [scr D, script letter D] (where [scr D, script letter D] is the dimension of atoms in the latent space).19 These transformations project the input embeddings into different subspaces, enabling the model to compute attention scores that determine the relevance of each atom's information with respect to others in the molecular graph.

The self-attention mechanism computes a weighted sum of the value vectors V based on the compatibility between query Q and key K vectors,

 
image file: d5me00173k-t3.tif(5)
where Q, K, and V are the query, key, and value matrices, respectively, computed as linear transformations of the latent embeddings z. The dividing by image file: d5me00173k-t4.tif is to scale the dot product to prevent excessively large values. This mechanism allows the transformer to focus on relevant parts of the molecular graph, effectively capturing global interactions and dependencies among atoms.

2.1.4 Task-specific output adaptation. During pretraining, the model predicts atomic charges. After the transformer layers, atomic embeddings are pooled to a single dimension to yield per-atom partial charge. This charge-focused output configuration is essential for learning foundational molecular representations.

For downstream applications predicting other chemical properties, the output architecture is adapted to match the task granularity. The pretrained model's charge-pooling module is removed, decoupling it from the core representation. A task-specific output module is then introduced to process the latent-space representations. For downstream tasks such as HOMO/LUMO energy prediction, the pre-trained model was adapted by replacing its pooling layers with task-specific MLP heads. Specifically, the output z from the EGNN backbone was processed as, ŷ = Pooling(MLPtask(z)), where MLP(1)task aggregates node-level representations into a molecule-level embedding. Fine-tuning was performed with a frozen EGNN backbone and the same method used in pretraining, while the reason for freezing EGNN backbone will be discussed in section 3.3.

This module flexibly generates either: 1) atomic-level properties (e.g., reactive site identification) by transforming embeddings through dedicated decoders, or 2) molecular-level properties (e.g., HOMO/LUMO energies, reaction propensity) by aggregating atomic features (via pooling or attention) followed by prediction heads. This adaptive design maintains the hierarchical feature synergy from the EGNN and transformer while enabling diverse chemical predictions.

2.2 Data selection and preprocessing

2.2.1 Dataset. The QM9 dataset20 was used, containing 133[thin space (1/6-em)]885 small organic molecules with up to 9 heavy atoms. We selected molecules with complete electronic structure data including: atomic Mulliken population partial charges21 (B3LYP/6-31G(d) level), HOMO/LUMO energies, molecular dipole moments. A more detailed summary of the chemical composition of QM9 and QM7 is provided in Fig. S1, which shows representative structures, elemental distributions, molecular-weight histograms, and functional-group statistics. These analyses highlight that both datasets consist of small neutral organic molecules with similar elemental makeup and a broad variety of common functional motifs, making QM9 a chemically well-matched pretraining source for downstream evaluation on QM7. The dataset was randomly split into training and test sets using an 80%[thin space (1/6-em)]:[thin space (1/6-em)]20% ratio.
2.2.2 Graph representation. Molecules were represented as undirected graphs, where nodes correspond to atoms and edges reflect spatial proximity. Two atoms are connected by an undirected edge if their Euclidean distance is below a cutoff radius rcut = 8 Å. This value was determined from a geometric analysis on a random 20% subset of QM9 (26[thin space (1/6-em)]777 molecules), where the average number of neighbors per atom saturates beyond 8 Å (Table 1). From a chemical perspective, covalent bond lengths typically range between 0.7 and 2.5 Å. Thus, a cutoff of 8 Å captures interactions up to three times longer than a typical bond, encompassing both direct and medium-range structural correlations, while avoiding an excessive number of long-range edges that contribute little to local message passing. Geometries were taken from QM9 and QM7 datasets and processed using a custom PyTorch-based data loader.
Table 1 Average number of neighboring atoms per atom and per molecule
r cut (Å) Avg. neighbours/atom Avg. neighbours/molecule
2.0 2.63 47.32
3.0 8.21 147.81
4.0 12.49 224.83
6.0 16.86 303.39
8.0 17.44 313.87
10.0 17.48 314.52
12.0 17.48 314.55
15.0 17.48 314.55


2.3 Training procedure

The MET model was pre-trained end-to-end for the task of predicting atomic partial charges.22,23 The training objective is defined as the mean squared error (MSE) between the predicted atomic charges [q with combining circumflex]i and the reference values qi:
image file: d5me00173k-t5.tif
which directly penalizes deviations in charge prediction. Compared to the popular self-supervised methods used in other model like MG-BERT24 or Unimol+,25 which requires complicated loss functions, MET represents uses a very simple loss function. This is based on simple chemical knowledge that charge distribution is a crucial factor for extracting molecular information.

3 Results

3.1 Benchmarking

To comprehensively evaluate the predictive performance and generalization capabilities of the molecular equivariant transformer (MET), we conducted benchmarks using two datasets from MoleculeNet:26 QM9,20 QM7,27 which contain quantum mechanically computed properties (e.g., HOMO/LUMO energies, total energies). These datasets were strategically selected to assess MET's ability to generalize across various molecular properties and computational paradigms. Table 2 summarizes MET's performance across these datasets, each evaluation repeated with three random seeds to ensure reproducibility and statistical robustness. Baseline models were taken from the Uni-Mol benchmark,28 reflecting state-of-the-art molecular representation methods.
Table 2 Performance comparison of MET against baseline models on transfer learning tasks. The table is transposed to facilitate comparison across models for each dataset. It presents the root mean square error (RMSE), where lower values are better
Dataset Unit D-MPNN Attentive FP N-Gram_RF N-Gram_XGB Pretrain GNN GEM Uni-Mol MET
Note: error bars represent the standard deviation across three runs with different random seeds. The evaluation metric for QM7 is total energy, and for QM9, it includes HOMO, LUMO, and energy gap. MET outperforms baseline models in all tasks.
QM9 Hartree 0.00814 0.00812 0.01037 0.00964 0.00922 0.00746 0.00467 0.00344 ± 0.00006
QM7 kcal mol−1 103.5 72.0 92.8 81.9 113.2 58.9 41.8 35.3 ± 5.8


On QM9, our primary dataset comprising approximately 134[thin space (1/6-em)]000 small organic molecules, MET was pretrained using atomic partial charges as supervised labels. This physically motivated pretraining strategy enabled efficient transfer learning for downstream property prediction tasks, notably HOMO, LUMO, and energy gap estimation. As shown in Table 2, MET achieved a remarkably low RMSE of 0.00324 hartree on the three tasks, significantly surpassing baseline models. This highlights MET's effective utilization of charge-based representations, even for tasks beyond its direct pretraining objective.

Despite the differences in both molecular composition and quantum-chemical protocol compared with QM9, MET achieved an RMSE of 35.3 ± 5.8 kcal mol−1 on QM7 (Table 1), improving upon the best Uni-Mol baseline (41.8 kcal mol−1) by about 16%. To put this error into context, the reference energies in QM7 span a range from −2192.0 to −404.9 kcal mol−1 with a standard deviation of 223.9 kcal mol−1; the MET RMSE therefore corresponds to roughly 2% of the total energy range and about 0.16 times the dataset standard deviation. This accuracy is not yet sufficient to replace high-level DFT calculations when chemically precise absolute energies are required, but it is adequate for coarse ranking and pre-screening and demonstrates that the charge-supervised representation transfers reliably to a distinct dataset computed at a different level of theory.

Collectively, these benchmarking results underscore MET's exceptional capability for predicting molecular properties by leveraging physically meaningful atomic charges and symmetry-aware architectures. MET effectively reduces data demands and computational overhead through transfer learning and fine-tuning strategies, establishing a powerful yet computationally accessible framework for molecular property prediction in computational chemistry.

3.2 Structure analysis

To systematically evaluate the architectural dependencies and the effectiveness of MET's pretraining strategy, we conducted three comparative analyses. First, we examined the role of symmetry-aware processing by replacing the EGNN backbone with a conventional, non-equivariant GNN to assess how explicit rotational and translational equivariance contributes to the quality and transferability of molecular representations. Second, we compared the performance of pretrained MET models against models trained from scratch across varying dataset sizes to highlight the necessity and benefits of pretraining, particularly in small-data regimes. Finally, we assessed how atomic embedding dimensionality influences both pretraining and downstream prediction fidelity, identifying optimal configurations that balance expressivity, stability, and computational efficiency.
3.2.1 Role of equivariance: EGNN versus GNN. We isolated the architectural contribution of symmetry-aware processing by replacing MET's EGNN layers with conventional graph neural network (GNN) layers, which lack explicit rotational and translational equivariance and rely solely on distance-based edge features. On the atomic charge prediction pretraining task, the EGNN-based MET achieved a validation R2 of 0.999, while the GNN-based variant reached only 0.983 (Fig. 2(a)). This significant gap underscores the critical role of equivariant layers in capturing geometric and spatial dependencies that are essential for constructing transferable molecular representations. By embedding physical symmetries directly into the message-passing process, EGNN layers enhance predictive performance and data efficiency, particularly during fine-tuning where labeled data may be scarce.
image file: d5me00173k-f2.tif
Fig. 2 The influence of architecture and MET advance: (a) real charge values versus predicting charge values for 1k QM9 test dataset from EGNN model (blue) and GNN model (red) under 80k QM9 training dataset. The abscissa value indicates true dipole of molecules and ordinate value indicates dipole from model prediction. The closer the points in the plot are to the diagonal line, the more accurate the predictions. (b) MET pretrained on atomic charges (blue) outperforms direct training (red) in different datasets scale (1000–100[thin space (1/6-em)]000 molecules), demonstrating improved generalization from physically motivated inductive biases. (c) Real energy values versus predicting energy values on a test dataset from QM7 powered by from pretrained model (blue) and direct training model (red) under 1000 samples trained by 500 molecules. (d) Dipole moment prediction accuracy sharply declined at smaller dimensions, plateauing beyond 128 dimensions. Results indicate 128 dimensions provide an optimal balance. Blue: charge prediction accuracy (R2) remained stable across embedding dimensions. Square mark: test on dataset with 1000 molecules. pentagon mark: test on dataset with 5000 molecules. Triangle mark: test on dataset with 100[thin space (1/6-em)]000 molecules.
3.2.2 Pretraining versus training from scratch. To evaluate the necessity and effectiveness of MET's pretraining strategy, we systematically compared the performance of MET models pretrained on atomic partial charges against models trained entirely from scratch. We first employed dipole moment prediction from the QM9 dataset as a representative downstream task, noting that this task shares the same molecular configurations used during pretraining, albeit targeting a distinct molecular property. As shown in Fig. 2(b), the pretrained MET consistently outperformed models trained from scratch, particularly in small- to moderate-sized datasets (up to approximately 20[thin space (1/6-em)]000 molecules). This superior performance in data-scarce regimes highlights that pretraining effectively incorporates physically informed inductive biases, enhancing robustness and mitigating overfitting. However, when dataset sizes surpassed roughly 20[thin space (1/6-em)]000 molecules, the benefits of pretraining diminished as models trained from scratch gradually achieved comparable performance. This shift underscores that sufficiently large datasets enable end-to-end trained models to directly capture task-specific molecular features without explicit inductive priors.

To further probe the generalization capability and data efficiency imparted by MET's pretraining, we conducted an additional evaluation using the QM7 dataset, predicting molecular conformational energies computed at the B3LYP/6-31G* level. To intentionally simulate a data-limited scenario, we randomly selected only 500 molecules from QM7 for training. Remarkably, the pretrained MET model attained a predictive performance of R2 = 0.548, surpassing the R2 = 0.336 achieved by the model trained from scratch (Fig. 2(c)). These results convincingly demonstrate that MET's pretraining strategy significantly enhances predictive reliability, even when transferring across distinct molecular datasets and properties, thus affirming the practical advantages of leveraging physically motivated pretraining in resource-constrained scenarios.

3.2.3 Embedding dimension analysis. We investigated the impact of atomic embedding dimensionality on both atomic charge prediction (pretraining objective) and downstream dipole moment prediction accuracy. Embedding dimensions were systematically varied across 8, 16, 32, 64, and 128 dimensions, and the corresponding predictive performance, quantified by R2, was analyzed (Fig. 2(d)).

For the pretraining task of atomic charge prediction, predictive performance remained robust and exhibited minimal sensitivity to changes in embedding dimensionality. This insensitivity likely arises because atomic charges are predominantly influenced by local electronic environments and fundamental chemical features, which can be effectively captured by relatively compact representations even at lower dimensions. Consequently, additional embedding dimensions offer negligible incremental information for this inherently local and chemically constrained prediction task.

In contrast, downstream dipole moment predictions exhibited substantial sensitivity to embedding dimensionality. Predictive performance deteriorated sharply when the embedding dimension dropped below 64 and improved progressively with increasing dimensionality, reaching a plateau at 128 dimensions. Notably, similar trends were consistently observed across datasets of different sizes (1000, 5000, and 100[thin space (1/6-em)]000 molecules). Although absolute predictive performance improved with increasing dataset size, the saturation at 128 dimensions remained consistent, indicating a fundamental representational capacity limit independent of dataset size. This suggests that the complexity of global molecular properties such as dipole moments requires richer representations capable of capturing nuanced interatomic interactions and spatial correlations, which are effectively encoded within higher-dimensional embeddings.

Importantly, these dimensional analyses provide deeper insights into MET's representational capabilities and limitations. While local chemical properties (atomic charges) benefit less from extensive embedding dimensionality, global electronic properties require more expressive embeddings to fully capture the complex, long-range correlations within molecular structures. The saturation observed at 128 dimensions suggests an optimal trade-off, balancing model expressivity with computational efficiency and preventing parameter redundancy. Accordingly, 128-dimensional embeddings were employed throughout subsequent studies, ensuring robust and computationally efficient model performance.

These results emphasize the complementary roles of pretraining, symmetry-aware EGNN backbones, and embedding dimensionality in determining MET's performance. Pretraining on quantum-derived partial charges provides inductive priors that mitigate overfitting in low-data regimes, while EGNN layers deliver symmetry-preserving representations that generalize across molecular conformations and computational conditions. Appropriate control of embedding dimensionality further balances capacity and stability, ensuring that MET performs robustly across diverse molecular property prediction tasks. These insights facilitate informed architectural decisions and highlight key theoretical considerations underpinning successful molecular representation learning strategies.

3.3 Component ablation test

While the preceding structure analysis highlighted the influence of pretraining and embedding dimensionality on model performance, the specific contributions of MET's architectural components to downstream predictive tasks remained unclear. To systematically assess each component's relative importance, we conducted a detailed ablation study, systematically evaluating their relative contributions to downstream property predictions. Here, we focused on predicting HOMO energies using a representative subset of 1000 molecules from the QM9 dataset. Specifically, we progressively unfroze layers—from the transformer (top-most layers) down to the embedding layer (bottom-most layer)—to investigate the influence of increasing the number of trainable parameters on predictive performance. This approach enabled us to precisely assess the incremental value added by each architectural component and provided insight into MET's internal representational hierarchy.

Fig. 3 illustrates the evolution of the model's predictive capability, quantified by the validation R2 scores, alongside the corresponding number of trainable parameters as additional layers are progressively unfrozen. Starting from the left, only the final transformer layers were trainable, representing minimal adaptation to the downstream task. This configuration exhibited the lowest predictive performance, highlighting that fine-tuning exclusively at the transformer level is insufficient to achieve high-quality predictions. Predictive performance significantly improved as more layers became trainable, initially encompassing the transformer, encoder, and linear layers. This observed improvement clearly indicates that while the pretrained layers encode useful generic molecular features, their representations still require refinement and adaptation through task-specific fine-tuning to achieve optimal predictions.


image file: d5me00173k-f3.tif
Fig. 3 Ablation study of layer freezing for 1000 molecules on HOMO prediction. The task is conducted on the same dataset of 1000 molecules from QM9 with consistent construction. Performance (RMSE) across varying frozen layers (horizontal axis: number of frozen layers from bottom to top). The model architecture includes embedding (L1), EGNN (L2–L5), linear (L6–L9), and transformer (L10) layers.

Notably, the largest gains in performance emerged upon unfreezing layers up to, but not including, the equivariant graph neural network (EGNN) layers. Once the EGNN layers became trainable, the incremental improvement in prediction accuracy plateaued, suggesting that EGNN layers inherently capture core molecular features crucial for accurate property prediction. This result confirms that EGNNs effectively encode geometric and spatial invariances that generalize across molecular systems, making them particularly suitable for transfer learning scenarios involving small-scale fine-tuning datasets. In other words, EGNN layers, by virtue of their symmetry-aware inductive biases, substantially reduce the amount of data required for effective downstream fine-tuning.

Furthermore, fully unfreezing the embedding layers—effectively transforming the model into a scenario analogous to training from scratch—led to a notable degradation in predictive performance, with accuracy becoming even poorer than scenarios with only the transformer layers unfrozen. This performance decline is attributable to the substantial increase in trainable parameters overwhelming the limited dataset, resulting in severe overfitting. Consequently, these findings emphasize the necessity of pretraining, demonstrating that pretrained embeddings and EGNN features are vital for stabilizing model training and maintaining robust predictive capabilities, especially when fine-tuning on small datasets.

This ablation analysis reveals that the EGNN component serves as a fundamental encoder of chemically meaningful and transferable geometric representations, conferring data efficiency and generalization advantages in fine-tuning scenarios. Additionally, the detrimental effect of fully trainable embeddings underscores the critical role of pretraining in providing stable initializations, especially in data-scarce regimes. These insights provide valuable guidelines for future model development and practical deployment, underscoring how thoughtful selection of trainable parameters during fine-tuning can balance representational flexibility with model generalizability.

3.4 Alignment and uniformity analysis

To further investigate MET's capability to generate meaningful molecular representations, we conducted a comprehensive alignment analysis, examining how chemically relevant features and properties are organized in the model's latent embedding space. Alignment performance refers to the extent to which molecular representations systematically capture and cluster chemically or physically similar molecules, thereby enhancing interpretability and predictive reliability.

We first applied the dimensionality reduction technique t-distributed Stochastic Neighbor Embedding (t-SNE)29 to visualize MET-derived latent vectors from the QM9 dataset. As shown in Fig. 4(a), we plot the first two principal t-SNE components, coloring each molecule by its dipole moment. Notably, the embeddings occupy the latent space uniformly, forming a roughly spherical distribution that maximizes representation efficiency. Importantly, molecules with different dipole magnitudes exhibit systematic spatial arrangements; specifically, decreases in the second t-SNE component correspond closely to reductions in molecular dipole values. This systematic relationship strongly suggests that the pretrained latent vectors inherently encode electronic distribution features, reflecting our strategy of using partial charges as supervised labels during pretraining.


image file: d5me00173k-f4.tif
Fig. 4 (a) The t-SNE alignment test on total QM9 dataset with dipole label. (b) and (c) The t-SNE alignment test on total QM9 dataset with functional groups label for carboxy and fluoro, respectively. (d) The t-SNE alignment test on a small dataset in purpose of analysing molecular property in latent space.

To further probe the chemical interpretability of the embeddings, we visualized the presence or absence of specific functional groups in the latent space (Fig. 4(b) and (c)). As illustrative examples, we chose fluoro- and carboxy-containing molecules from the QM9 dataset. In both cases, clear separations between molecules bearing these functional groups and those without are apparent. It is noteworthy that similar clustering behaviors were consistently observed for other representative functional groups (e.g., hydroxy, cyano). These distinct and chemically meaningful clusters demonstrate strong alignment, confirming that MET's embedding space accurately reflects structural and functional chemical information.

To systematically evaluate MET's ability to distinguish among chemically similar functional groups, we constructed a targeted test set of molecules with controlled functional variations. Specifically, we fixed an alkane backbone (C8H18) and systematically replaced hydrogen atoms with representative functional groups (fluoro, hydroxy, cyano, and carboxy), yielding 50 distinct molecules per group. The resulting latent space t-SNE analysis (Fig. 4(d)) reveals clear separation among these functional classes. Particularly, MET embeddings group fluoro- and hydroxy-substituted molecules closely together, aligning with established chemical knowledge that both groups exhibit strong electron-withdrawing inductive effects. Similarly, molecules bearing cyano and carboxy functionalities cluster adjacent to each other, consistent with their shared resonance-driven electron-withdrawing properties. This chemically meaningful embedding arrangement further validates MET's effective utilization of partial charge pretraining, reinforcing its robustness and interpretability.

The alignment analysis highlights MET's capability to generate chemically informative and systematically organized molecular representations.30 By employing physically meaningful pretraining targets, MET achieves superior alignment performance, thus facilitating interpretability and enhancing its predictive utility in molecular property predictions.

4 Discussion and conclusion

This work presents a symmetry-preserving, pretrained molecular modeling framework that utilizes atomic partial charges as physically grounded supervision targets. By combining an equivariant graph neural network (EGNN) with a transformer-based architecture, the model captures both geometric invariants and electronic structure features. It demonstrates strong predictive performance with limited training data and requires only modest computational resources, making it accessible and practical for a wide range of applications.

The effectiveness of the model can be understood from the perspective of quantum chemistry. Molecular properties are governed by both atomic coordinates and the associated electronic distribution. In our approach, the EGNN acts as a geometric encoder that learns symmetry-consistent features from the spatial configuration of atoms. The transformer component further refines these features into representations that reflect underlying electronic properties. This two-step process resembles the transformation of a molecular wavefunction from real space into a basis that encodes chemically meaningful quantities, such as partial charges. Through this design, the model acquires a latent representation that simultaneously preserves spatial symmetries and captures essential electronic information, enabling robust transfer across diverse molecular tasks.

We have shown that the pretrained model performs well across a variety of downstream prediction tasks, including dipole moments, frontier orbital energies. Analyses on the embedding space revealed that it encodes both global molecular characteristics and local functional group information. The benefit of pretraining is particularly pronounced in data-limited settings, where fine-tuning models initialized from pretrained representations leads to substantially better performance compared to training from scratch. We also demonstrated successful generalization to the QM7 dataset, even with a training set comprising only 500 molecules, further highlighting the versatility of the proposed method.

Beyond strong performance in small-data settings, MET offers distinct advantages for tasks that require a unified latent representation across molecular properties. While property-specific models are optimized for a single target, MET learns a chemically meaningful embedding space through charge-supervised pretraining. This embedding reflects both global electronic features and local structural motifs, enabling consistent organization of molecules by dipole moment and functional groups. Such a representation can be reused across diverse objectives—property prediction, optimization, or generative design—without retraining separate models. In this way, MET serves as a general-purpose molecular encoder that supports cross-task transfer, interpretable learning, and multi-objective workflows, even when property-specific data is scarce or unavailable.

Despite these strengths, the current framework has certain limitations. It has not yet been tested on larger molecules or materials systems, and its performance on experimental datasets such as those involving solubility or pharmacokinetics requires further investigation. Future efforts should focus on extending the model to incorporate multi-conformational inputs, improving atomic coordinate generation, and integrating hybrid learning strategies that combine supervised learning with self-supervised or physically constrained objectives.

In conclusion, this study provides a physically informed machine learning strategy for molecular property prediction. By grounding the learning process in atomic charges and enforcing spatial symmetries, the proposed model achieves both accuracy and generalizability. This work establishes a foundation for future developments in physics-aware molecular modeling, with promising implications for applications in drug discovery, materials design, and chemical informatics.

Conflicts of interest

There are no conflicts to declare.

Data availability

All code and preprocessed data are available in a public repository: https://github.com/mint258/MET.

Supplementary information (SI): the SI (Fig. S1) summarizes the chemical composition and coverage of the QM9 (pretraining) and QM7 (evaluation) datasets, including representative molecules, heavy-atom counts by element, molecular-weight distributions, and functional-group frequency charts, showing that both datasets comprise small neutral organic molecules with similar elemental makeup and a broad variety of common functional motifs. See DOI: https://doi.org/10.1039/d5me00173k.

Acknowledgements

This research has been supported by the Major Research Plan of the National Natural Science Foundation of China (92372104), Guangdong Science and Technology Department (2023QN10C439), Guangdong Basic and Applied Basic Research Foundation (2022A1515110016, 2022B1515020013 and 2024B1515040023), Guangzhou Municipal Science and Technology Bureau (2023A04J1364), the 111 Project (B18023), Fundamental Research Funds for the Central Universities (2024ZYGXZR043). This work is partially supported by High Performance Computing Platform of South China University of Technology.

References

  1. F. A. Faber, et al., Machine learning prediction errors better than dft accuracy, arXiv, 2017, preprint, arXiv:1702.05532,  DOI:10.48550/arXiv.1702.05532.
  2. K. Schütt, et al., Schnet: A continuous-filter convolutional neural network for modeling quantum interactions, Adv. Neural Inf. Process. Syst., 2017, 30, 991–1001 Search PubMed.
  3. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, Neural Message Passing for Quantum Chemistry, Proceedings of the 34th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res., 2017, 70, 1263–1272 Search PubMed.
  4. V. Garcia Satorras, E. Hoogeboom and M. Welling E(n) equivariant graph neural networks, arXiv, 2021, preprint, arXiv:2102.09844,  DOI:10.48550/arXiv.2102.09844, https://arxiv.org/abs/2102.09844.
  5. F. B. Fuchs, D. E. Worrall, V. Fischer and M. Welling, Se(3)-transformers: 3d roto-translation equivariant attention networks, arXiv, 2020, preprint, arXiv:2006.10503,  DOI:10.48550/arXiv.2006.10503, https://arxiv.org/abs/2006.10503.
  6. J. Gasteiger, J. Groß and S. Günnemann, Directional Message Passing for Molecular Graphs, arXiv, 2020, preprint, arXiv:2003.03123,  DOI:10.48550/arXiv.2003.03123.
  7. Y. Wang, J. Wang, Z. Cao and A. Barati Farimani, Molecular contrastive learning of representations via graph neural networks, Nat. Mach. Intell., 2022, 1–9 Search PubMed.
  8. W. Ahmad, E. Simon, S. Chithrananda, G. Grand and B. Ramsundar, Chemberta-2: Towards chemical foundation models, arXiv, 2022, preprint, arXiv:2209.01712,  DOI:10.48550/arXiv.2209.01712.
  9. Y. Fang, Q. Zhang, H. Yang, X. Zhuang, S. Deng, W. Zhang, M. Qin, Z. Chen, X. Fan and H. Chen, Molecular Contrastive Learning with Chemical Element Knowledge Graph, Proc. AAAI Conf. Artif. Intell., 2022, 36(4), 3968–3976 Search PubMed.
  10. C. Li, et al., Applying pretrained molecular representation for reaction prediction and efficient catalyst screening, Cell Rep. Phys. Sci., 2025, 6(11), 102931 CrossRef.
  11. J. Gasteiger, S. Giri, J. T. Margraf and S. Günnemann, Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules, arXiv, 2020, preprint, arXiv:2011.14115,  DOI:10.48550/arXiv.2011.14115.
  12. J. Gasteiger, J. Gro and S. Günnemann, Gemnet: Universal directional graph neural networks for molecules, NeurIPS, arXiv, 2021, preprint, arXiv:2106.08903,  DOI:10.48550/arXiv.2106.08903.
  13. J. Hermann, Z. Schätzle and F. Noé, Deep-neural-network solution of the electronic schrödinger equation, Nat. Chem., 2020, 12, 891–897 CrossRef PubMed.
  14. L. Wang, Y. Liu, Y. Lin, H. Liu and S. Ji, Comenet: Towards complete and efficient message passing for 3d molecular graphs, Adv. Neural Inf. Process. Syst., 2022, 35, 650–664 Search PubMed.
  15. R. Winter, M. Bertolini, T. Le, F. Noé and D.-A. Clevert, Unsupervised learning of group invariant and equivariant representations, Adv. Neural Inf. Process. Syst., 2022, 35, 31942–31956 Search PubMed.
  16. Z. Liu, L. He, Y. Liu and S. Ji, Spherical message passing for 3d molecular graphs, arXiv, 2022, preprint, arXiv:2102.05013,  DOI:10.48550/arXiv.2102.05013, https://arxiv.org/abs/2102.05013.
  17. Ł. Maziarka, et al. Molecule attention transformer, arXiv, 2020, preprint, arXiv:2002.08264,  DOI:10.48550/arXiv.2002.08264.
  18. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen and T.-Y. Liu, Do Transformers Really Perform Bad for Graph Representation?, arXiv, 2021, preprint, arXiv:2106.05234,  DOI:10.48550/arXiv.2106.05234.
  19. A. Vaswani, et al., Attention is All you Need, Adv. Neural Inf. Process. Syst., 2017, 30, 5998–6008 Search PubMed.
  20. R. Ramakrishnan, P. O. Dral, M. Rupp and O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Sci. Data, 2014, 1, 140022 CrossRef PubMed.
  21. R. S. Mulliken, Electronic population analysis on lcao–mo molecular wave functions. i, J. Chem. Phys., 1955, 23, 1833–1840 CrossRef.
  22. O. T. Unke and M. Meuwly, PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments, and Partial Charges, J. Chem. Theory Comput., 2019, 15(6), 3678–3693 CrossRef PubMed.
  23. O. T. Unke, S. Chmiela and M. Gastegger, et al., Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects, Nat. Commun., 2021, 12, 7273 CrossRef PubMed.
  24. P. BehnamGhader, H. Zakerinia and M. Soleymani Baghshah, MG-BERT: Multi-Graph Augmented BERT for Masked Language Modeling, Proc. Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15), 2021, pp. 125–131,  DOI:10.18653/v1/2021.textgraphs-1.12.
  25. S. Lu, Z. Gao, D. He, L. Zhang and G. Ke, Highly accurate quantum chemical property prediction with uni-mol+, arXiv, 2023, preprint, arXiv:2303.16982,  DOI:10.48550/arXiv.2303.16982.
  26. Z. Wu, et al., Moleculenet: a benchmark for molecular machine learning, Chem. Sci., 2018, 9, 513–530 RSC.
  27. M. Rupp, A. Tkatchenko, K.-R. Müller and O. A. von Lilienfeld, Fast and accurate modeling of molecular atomization energies with machine learning, Phys. Rev. Lett., 2012, 108, 058301 CrossRef PubMed.
  28. G. Zhou, et al., Uni-Mol: A Universal 3D Molecular Representation Learning Framework, ChemRxiv, 2023, preprint,  DOI:10.26434/chemrxiv-2022-jjm0j-v4.
  29. L. van der Maaten and G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res., 2008, 9, 2579–2605 Search PubMed.
  30. T. Wang and P. Isola, Understanding contrastive representation learning through alignment and uniformity on the hypersphere, arXiv, 2020, preprint, arXiv:2005.10242,  DOI:10.48550/arXiv.2005.10242.

This journal is © The Royal Society of Chemistry 2026
Click here to see how this site uses Cookies. View our privacy policy here.