Zixiao
Yang
ab,
Hanyu
Gao
c and
Xian
Kong
*ab
aSouth China Advanced Institute for Soft Matter Science and Technology, School of Emergent Soft Matter, South China University of Technology, Guangzhou, China. E-mail: xk@scut.edu.cn
bGuangdong Provincial Key Laboratory of Functional and Intelligent Hybrid Materials and Devices, South China University of Technology, Guangzhou, China
cDepartment of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Hong Kong, SAR, China
First published on 24th December 2025
Understanding and predicting molecular properties remains a central challenge in scientific machine learning, especially when training data are limited or task-specific supervision is scarce. We introduce the molecular equivariant transformer (MET), a symmetry-aware pretraining framework that leverages quantum-derived atomic charge distributions to guide molecular representation learning. MET combines an equivariant graph neural network (EGNN) with a transformer architecture to extract physically meaningful features from three-dimensional molecular geometries. Unlike previous models that rely purely on structural inputs or handcrafted descriptors, MET is pretrained to predict atomic partial charges, which are quantities grounded in quantum chemistry. This enables MET to capture essential electronic information without requiring downstream labels. We show that this pretraining scheme improves performance across diverse molecular property prediction tasks, particularly in low-data regimes. Analyses of the learned representations reveal chemically interpretable structure–property relationships, including the emergence of functional group patterns and smooth alignment with molecular dipoles. Ablation studies confirm that the EGNN encoder plays a crucial role in capturing transferable spatial features, while the transformer layers adapt these features to specific prediction tasks. This architecture draws direct analogies to quantum mechanical basis transformations, where MET learns to transition from coordinate-based to electron-based representations in a symmetry-preserving manner. By integrating domain knowledge with modern deep learning techniques, MET offers a unified and interpretable framework for data-efficient molecular modeling, with broad applications in computational chemistry, drug discovery, and materials science.
Design, System, ApplicationWe propose a molecular design/optimization strategy that learns symmetry preserving 3D representations by pretraining an equivariant GNN + transformer to predict quantum-derived atomic partial charges. Aligning the latent space with electronic distributions yields physically grounded, transferable descriptors that require minimal labeled data across downstream tasks. The desired system functionality is an interpretable, data-efficient property predictor that (i) preserves rotation/translation equivariance, (ii) captures long-range interactions through attention, and (iii) remains lightweight for routine screening. Key design constraints include access to reasonable 3D coordinates/conformers and charge labels for pretraining, with current validation focused on small organic molecules; the architecture is modular to incorporate multi-conformer inputs, alternative charge schemes, and task-specific heads. Immediate applications include rapid screening of orbital energies, dipoles, total energies, and reactive sites, as well as uncertainty-aware active learning in low-data regimes. Longer-term, the charge-aware latent space can seed closed-loop generative design, accelerate lead optimization and materials discovery, and integrate with physics-guided fine-tuning to extend to larger molecules and condensed-phase systems. |
Recent progress in deep learning has opened new avenues for molecular property prediction, hinging on the quality of molecular representations. Traditional encodings—such as SMILES strings and 2D molecular graphs—fail to capture the full complexity of molecular behavior, particularly when three-dimensional (3D) structure plays a critical role. Properties such as dipole moments, reactivity, and orbital energies depend sensitively on the spatial arrangement of atoms and electrons, motivating the development of 3D-aware representations. Graph neural networks (GNNs), including SchNet2 and MPNN,3 have advanced this goal by learning structural features from molecular graphs. However, many models neglect a key physical principle: equivariance under rotation and translation.4,5 Without this symmetry, predictions may vary under rigid transformations of the input, limiting generalizability in tasks involving stereochemistry or conformational flexibility.6 Moreover, most molecular machine learning models rely on unsupervised or self-supervised training due to the scarcity of labeled data. Models such as MolCLR7 and ChemBERTa8 extract general-purpose embeddings from large unlabeled datasets. However, these approaches often bypass a fundamental determinant of molecular properties: the electronic density. Direct supervision using physically meaningful quantities could yield more informative and compact representations, especially in data-limited scenarios. Recent work has injected chemical knowledge into contrastive learning frameworks—such as constructing element- and functional-group-based knowledge graphs—achieving consistent gains across multiple datasets and offering a promising direction for parallel enhancements. Even though this method lead to high performance in downstream tasks, it suffers from the lack of accurate data, which makes it hard to pretrain in large scale.9
In many practical molecular design and screening problems, the target properties are experimentally measured quantities such as reaction yields, catalytic activities, and solubilities in complex environments. These properties often have limited and noisy datasets—typically only a few hundred to a few thousand reliable labels—and no large corpus exists for property-specific pretraining. As a result, self-supervised approaches alone may not provide sufficiently informative representations, and property-specific models pretrained on massive quantum datasets are not available for most practical observables. In such cases, it is advantageous to pretrain a single encoder on a tractable quantum-derived quantity that captures essential electronic structure, and then fine-tune that encoder across diverse downstream tasks. Recent studies have shown that this cross-task transfer within the same small-molecule domain can substantially improve stability and accuracy in low-data regimes.10
In this work, we propose a streamlined and physically motivated framework that integrates two complementary strategies: symmetry-aware message passing via an equivariant graph neural network (EGNN)11,12 and global representation learning through a transformer architecture. Crucially, we introduce atomic partial charges—readily obtainable proxies for electron density—as supervised targets during pretraining. This charge-guided supervision grounds representation learning in quantum mechanics, resulting in descriptors that are compact, transferable, and data-efficient. Compared to state-of-the-art models that are architecturally complex and data-intensive, our method is intentionally lightweight and accessible, making it feasible for academic laboratories with modest computational resources. By embedding physical priors directly into the model design and training strategy, we offer a more interpretable and scalable approach to accurate molecular property prediction. Compared with deep wavefunction approaches that directly solve the electronic Schrödinger equation,13 our work focuses on data-driven supervised property prediction, offering a more practical trade-off between accuracy and computational cost for large-scale screening.
In summary, our approach bridges the gap between deep learning and first-principles chemistry by incorporating symmetry, geometry, and electron density into a unified molecular representation. This enables robust performance with limited data and presents a promising direction for the next generation of molecular machine learning models.
The initial inputs to the network are the indices of atoms in each molecule, which are embedded into a latent space as atomic embeddings, as shown in Fig. 1(a). EGNN then combines atomic embeddings with positional information and the embeddings themselves, whereas the transformer combines embeddings only with themselves. After the transformer, the atomic embeddings in the latent space are pooled to a single dimension, corresponding to the charges on each atom. Once the pretraining phase is complete, the charge pooling module is discarded, and an additional transformer module is introduced to process the molecular representations in the latent space for subsequent tasks. Below, we describe the network's operation and information processing in detail.
| zi = σ(W0x(l)i), | (1) |
![]() | (2) |
![]() | (3) |
The outputs of the two branches are then concatenated and fused via a linear mapping, with a residual connection added from the initial transformed node feature:
| hi = Wcat[m(1)i∥m(2)i] + zi. |
i = GraphNom(hi), and the updated node embedding for the next layer is obtained byx(l+1)i = Wfinal i. | (4) |
After a series of such message-passing layers, the final node features are encoded into a latent representation via an additional linear transformation. These enriched features, which encapsulate the molecular geometric and chemical information, are then further processed by a transformer module.
In the transformer module, the query (Q), key (K), and value (V) matrices are computed by applying linear transformations to the input latent representations z: Q(l) = W(l)Qz(l), K(l) = W(l)Kz(l), and V(l) = W(l)Vz(l), where W(l)Q, W(l)K, and W(l)V are learnable weight matrices specific to the query, key, and value transformations in layer l with dimension
×
(where
is the dimension of atoms in the latent space).19 These transformations project the input embeddings into different subspaces, enabling the model to compute attention scores that determine the relevance of each atom's information with respect to others in the molecular graph.
The self-attention mechanism computes a weighted sum of the value vectors V based on the compatibility between query Q and key K vectors,
![]() | (5) |
is to scale the dot product to prevent excessively large values. This mechanism allows the transformer to focus on relevant parts of the molecular graph, effectively capturing global interactions and dependencies among atoms.
For downstream applications predicting other chemical properties, the output architecture is adapted to match the task granularity. The pretrained model's charge-pooling module is removed, decoupling it from the core representation. A task-specific output module is then introduced to process the latent-space representations. For downstream tasks such as HOMO/LUMO energy prediction, the pre-trained model was adapted by replacing its pooling layers with task-specific MLP heads. Specifically, the output z from the EGNN backbone was processed as, ŷ = Pooling(MLPtask(z)), where MLP(1)task aggregates node-level representations into a molecule-level embedding. Fine-tuning was performed with a frozen EGNN backbone and the same method used in pretraining, while the reason for freezing EGNN backbone will be discussed in section 3.3.
This module flexibly generates either: 1) atomic-level properties (e.g., reactive site identification) by transforming embeddings through dedicated decoders, or 2) molecular-level properties (e.g., HOMO/LUMO energies, reaction propensity) by aggregating atomic features (via pooling or attention) followed by prediction heads. This adaptive design maintains the hierarchical feature synergy from the EGNN and transformer while enabling diverse chemical predictions.
885 small organic molecules with up to 9 heavy atoms. We selected molecules with complete electronic structure data including: atomic Mulliken population partial charges21 (B3LYP/6-31G(d) level), HOMO/LUMO energies, molecular dipole moments. A more detailed summary of the chemical composition of QM9 and QM7 is provided in Fig. S1, which shows representative structures, elemental distributions, molecular-weight histograms, and functional-group statistics. These analyses highlight that both datasets consist of small neutral organic molecules with similar elemental makeup and a broad variety of common functional motifs, making QM9 a chemically well-matched pretraining source for downstream evaluation on QM7. The dataset was randomly split into training and test sets using an 80%
:
20% ratio.
777 molecules), where the average number of neighbors per atom saturates beyond 8 Å (Table 1). From a chemical perspective, covalent bond lengths typically range between 0.7 and 2.5 Å. Thus, a cutoff of 8 Å captures interactions up to three times longer than a typical bond, encompassing both direct and medium-range structural correlations, while avoiding an excessive number of long-range edges that contribute little to local message passing. Geometries were taken from QM9 and QM7 datasets and processed using a custom PyTorch-based data loader.
| r cut (Å) | Avg. neighbours/atom | Avg. neighbours/molecule |
|---|---|---|
| 2.0 | 2.63 | 47.32 |
| 3.0 | 8.21 | 147.81 |
| 4.0 | 12.49 | 224.83 |
| 6.0 | 16.86 | 303.39 |
| 8.0 | 17.44 | 313.87 |
| 10.0 | 17.48 | 314.52 |
| 12.0 | 17.48 | 314.55 |
| 15.0 | 17.48 | 314.55 |
i and the reference values qi:| Dataset | Unit | D-MPNN | Attentive FP | N-Gram_RF | N-Gram_XGB | Pretrain GNN | GEM | Uni-Mol | MET |
|---|---|---|---|---|---|---|---|---|---|
| Note: error bars represent the standard deviation across three runs with different random seeds. The evaluation metric for QM7 is total energy, and for QM9, it includes HOMO, LUMO, and energy gap. MET outperforms baseline models in all tasks. | |||||||||
| QM9 | Hartree | 0.00814 | 0.00812 | 0.01037 | 0.00964 | 0.00922 | 0.00746 | 0.00467 | 0.00344 ± 0.00006 |
| QM7 | kcal mol−1 | 103.5 | 72.0 | 92.8 | 81.9 | 113.2 | 58.9 | 41.8 | 35.3 ± 5.8 |
On QM9, our primary dataset comprising approximately 134
000 small organic molecules, MET was pretrained using atomic partial charges as supervised labels. This physically motivated pretraining strategy enabled efficient transfer learning for downstream property prediction tasks, notably HOMO, LUMO, and energy gap estimation. As shown in Table 2, MET achieved a remarkably low RMSE of 0.00324 hartree on the three tasks, significantly surpassing baseline models. This highlights MET's effective utilization of charge-based representations, even for tasks beyond its direct pretraining objective.
Despite the differences in both molecular composition and quantum-chemical protocol compared with QM9, MET achieved an RMSE of 35.3 ± 5.8 kcal mol−1 on QM7 (Table 1), improving upon the best Uni-Mol baseline (41.8 kcal mol−1) by about 16%. To put this error into context, the reference energies in QM7 span a range from −2192.0 to −404.9 kcal mol−1 with a standard deviation of 223.9 kcal mol−1; the MET RMSE therefore corresponds to roughly 2% of the total energy range and about 0.16 times the dataset standard deviation. This accuracy is not yet sufficient to replace high-level DFT calculations when chemically precise absolute energies are required, but it is adequate for coarse ranking and pre-screening and demonstrates that the charge-supervised representation transfers reliably to a distinct dataset computed at a different level of theory.
Collectively, these benchmarking results underscore MET's exceptional capability for predicting molecular properties by leveraging physically meaningful atomic charges and symmetry-aware architectures. MET effectively reduces data demands and computational overhead through transfer learning and fine-tuning strategies, establishing a powerful yet computationally accessible framework for molecular property prediction in computational chemistry.
000 molecules). This superior performance in data-scarce regimes highlights that pretraining effectively incorporates physically informed inductive biases, enhancing robustness and mitigating overfitting. However, when dataset sizes surpassed roughly 20
000 molecules, the benefits of pretraining diminished as models trained from scratch gradually achieved comparable performance. This shift underscores that sufficiently large datasets enable end-to-end trained models to directly capture task-specific molecular features without explicit inductive priors.
To further probe the generalization capability and data efficiency imparted by MET's pretraining, we conducted an additional evaluation using the QM7 dataset, predicting molecular conformational energies computed at the B3LYP/6-31G* level. To intentionally simulate a data-limited scenario, we randomly selected only 500 molecules from QM7 for training. Remarkably, the pretrained MET model attained a predictive performance of R2 = 0.548, surpassing the R2 = 0.336 achieved by the model trained from scratch (Fig. 2(c)). These results convincingly demonstrate that MET's pretraining strategy significantly enhances predictive reliability, even when transferring across distinct molecular datasets and properties, thus affirming the practical advantages of leveraging physically motivated pretraining in resource-constrained scenarios.
For the pretraining task of atomic charge prediction, predictive performance remained robust and exhibited minimal sensitivity to changes in embedding dimensionality. This insensitivity likely arises because atomic charges are predominantly influenced by local electronic environments and fundamental chemical features, which can be effectively captured by relatively compact representations even at lower dimensions. Consequently, additional embedding dimensions offer negligible incremental information for this inherently local and chemically constrained prediction task.
In contrast, downstream dipole moment predictions exhibited substantial sensitivity to embedding dimensionality. Predictive performance deteriorated sharply when the embedding dimension dropped below 64 and improved progressively with increasing dimensionality, reaching a plateau at 128 dimensions. Notably, similar trends were consistently observed across datasets of different sizes (1000, 5000, and 100
000 molecules). Although absolute predictive performance improved with increasing dataset size, the saturation at 128 dimensions remained consistent, indicating a fundamental representational capacity limit independent of dataset size. This suggests that the complexity of global molecular properties such as dipole moments requires richer representations capable of capturing nuanced interatomic interactions and spatial correlations, which are effectively encoded within higher-dimensional embeddings.
Importantly, these dimensional analyses provide deeper insights into MET's representational capabilities and limitations. While local chemical properties (atomic charges) benefit less from extensive embedding dimensionality, global electronic properties require more expressive embeddings to fully capture the complex, long-range correlations within molecular structures. The saturation observed at 128 dimensions suggests an optimal trade-off, balancing model expressivity with computational efficiency and preventing parameter redundancy. Accordingly, 128-dimensional embeddings were employed throughout subsequent studies, ensuring robust and computationally efficient model performance.
These results emphasize the complementary roles of pretraining, symmetry-aware EGNN backbones, and embedding dimensionality in determining MET's performance. Pretraining on quantum-derived partial charges provides inductive priors that mitigate overfitting in low-data regimes, while EGNN layers deliver symmetry-preserving representations that generalize across molecular conformations and computational conditions. Appropriate control of embedding dimensionality further balances capacity and stability, ensuring that MET performs robustly across diverse molecular property prediction tasks. These insights facilitate informed architectural decisions and highlight key theoretical considerations underpinning successful molecular representation learning strategies.
Fig. 3 illustrates the evolution of the model's predictive capability, quantified by the validation R2 scores, alongside the corresponding number of trainable parameters as additional layers are progressively unfrozen. Starting from the left, only the final transformer layers were trainable, representing minimal adaptation to the downstream task. This configuration exhibited the lowest predictive performance, highlighting that fine-tuning exclusively at the transformer level is insufficient to achieve high-quality predictions. Predictive performance significantly improved as more layers became trainable, initially encompassing the transformer, encoder, and linear layers. This observed improvement clearly indicates that while the pretrained layers encode useful generic molecular features, their representations still require refinement and adaptation through task-specific fine-tuning to achieve optimal predictions.
Notably, the largest gains in performance emerged upon unfreezing layers up to, but not including, the equivariant graph neural network (EGNN) layers. Once the EGNN layers became trainable, the incremental improvement in prediction accuracy plateaued, suggesting that EGNN layers inherently capture core molecular features crucial for accurate property prediction. This result confirms that EGNNs effectively encode geometric and spatial invariances that generalize across molecular systems, making them particularly suitable for transfer learning scenarios involving small-scale fine-tuning datasets. In other words, EGNN layers, by virtue of their symmetry-aware inductive biases, substantially reduce the amount of data required for effective downstream fine-tuning.
Furthermore, fully unfreezing the embedding layers—effectively transforming the model into a scenario analogous to training from scratch—led to a notable degradation in predictive performance, with accuracy becoming even poorer than scenarios with only the transformer layers unfrozen. This performance decline is attributable to the substantial increase in trainable parameters overwhelming the limited dataset, resulting in severe overfitting. Consequently, these findings emphasize the necessity of pretraining, demonstrating that pretrained embeddings and EGNN features are vital for stabilizing model training and maintaining robust predictive capabilities, especially when fine-tuning on small datasets.
This ablation analysis reveals that the EGNN component serves as a fundamental encoder of chemically meaningful and transferable geometric representations, conferring data efficiency and generalization advantages in fine-tuning scenarios. Additionally, the detrimental effect of fully trainable embeddings underscores the critical role of pretraining in providing stable initializations, especially in data-scarce regimes. These insights provide valuable guidelines for future model development and practical deployment, underscoring how thoughtful selection of trainable parameters during fine-tuning can balance representational flexibility with model generalizability.
We first applied the dimensionality reduction technique t-distributed Stochastic Neighbor Embedding (t-SNE)29 to visualize MET-derived latent vectors from the QM9 dataset. As shown in Fig. 4(a), we plot the first two principal t-SNE components, coloring each molecule by its dipole moment. Notably, the embeddings occupy the latent space uniformly, forming a roughly spherical distribution that maximizes representation efficiency. Importantly, molecules with different dipole magnitudes exhibit systematic spatial arrangements; specifically, decreases in the second t-SNE component correspond closely to reductions in molecular dipole values. This systematic relationship strongly suggests that the pretrained latent vectors inherently encode electronic distribution features, reflecting our strategy of using partial charges as supervised labels during pretraining.
To further probe the chemical interpretability of the embeddings, we visualized the presence or absence of specific functional groups in the latent space (Fig. 4(b) and (c)). As illustrative examples, we chose fluoro- and carboxy-containing molecules from the QM9 dataset. In both cases, clear separations between molecules bearing these functional groups and those without are apparent. It is noteworthy that similar clustering behaviors were consistently observed for other representative functional groups (e.g., hydroxy, cyano). These distinct and chemically meaningful clusters demonstrate strong alignment, confirming that MET's embedding space accurately reflects structural and functional chemical information.
To systematically evaluate MET's ability to distinguish among chemically similar functional groups, we constructed a targeted test set of molecules with controlled functional variations. Specifically, we fixed an alkane backbone (C8H18) and systematically replaced hydrogen atoms with representative functional groups (fluoro, hydroxy, cyano, and carboxy), yielding 50 distinct molecules per group. The resulting latent space t-SNE analysis (Fig. 4(d)) reveals clear separation among these functional classes. Particularly, MET embeddings group fluoro- and hydroxy-substituted molecules closely together, aligning with established chemical knowledge that both groups exhibit strong electron-withdrawing inductive effects. Similarly, molecules bearing cyano and carboxy functionalities cluster adjacent to each other, consistent with their shared resonance-driven electron-withdrawing properties. This chemically meaningful embedding arrangement further validates MET's effective utilization of partial charge pretraining, reinforcing its robustness and interpretability.
The alignment analysis highlights MET's capability to generate chemically informative and systematically organized molecular representations.30 By employing physically meaningful pretraining targets, MET achieves superior alignment performance, thus facilitating interpretability and enhancing its predictive utility in molecular property predictions.
The effectiveness of the model can be understood from the perspective of quantum chemistry. Molecular properties are governed by both atomic coordinates and the associated electronic distribution. In our approach, the EGNN acts as a geometric encoder that learns symmetry-consistent features from the spatial configuration of atoms. The transformer component further refines these features into representations that reflect underlying electronic properties. This two-step process resembles the transformation of a molecular wavefunction from real space into a basis that encodes chemically meaningful quantities, such as partial charges. Through this design, the model acquires a latent representation that simultaneously preserves spatial symmetries and captures essential electronic information, enabling robust transfer across diverse molecular tasks.
We have shown that the pretrained model performs well across a variety of downstream prediction tasks, including dipole moments, frontier orbital energies. Analyses on the embedding space revealed that it encodes both global molecular characteristics and local functional group information. The benefit of pretraining is particularly pronounced in data-limited settings, where fine-tuning models initialized from pretrained representations leads to substantially better performance compared to training from scratch. We also demonstrated successful generalization to the QM7 dataset, even with a training set comprising only 500 molecules, further highlighting the versatility of the proposed method.
Beyond strong performance in small-data settings, MET offers distinct advantages for tasks that require a unified latent representation across molecular properties. While property-specific models are optimized for a single target, MET learns a chemically meaningful embedding space through charge-supervised pretraining. This embedding reflects both global electronic features and local structural motifs, enabling consistent organization of molecules by dipole moment and functional groups. Such a representation can be reused across diverse objectives—property prediction, optimization, or generative design—without retraining separate models. In this way, MET serves as a general-purpose molecular encoder that supports cross-task transfer, interpretable learning, and multi-objective workflows, even when property-specific data is scarce or unavailable.
Despite these strengths, the current framework has certain limitations. It has not yet been tested on larger molecules or materials systems, and its performance on experimental datasets such as those involving solubility or pharmacokinetics requires further investigation. Future efforts should focus on extending the model to incorporate multi-conformational inputs, improving atomic coordinate generation, and integrating hybrid learning strategies that combine supervised learning with self-supervised or physically constrained objectives.
In conclusion, this study provides a physically informed machine learning strategy for molecular property prediction. By grounding the learning process in atomic charges and enforcing spatial symmetries, the proposed model achieves both accuracy and generalizability. This work establishes a foundation for future developments in physics-aware molecular modeling, with promising implications for applications in drug discovery, materials design, and chemical informatics.
Supplementary information (SI): the SI (Fig. S1) summarizes the chemical composition and coverage of the QM9 (pretraining) and QM7 (evaluation) datasets, including representative molecules, heavy-atom counts by element, molecular-weight distributions, and functional-group frequency charts, showing that both datasets comprise small neutral organic molecules with similar elemental makeup and a broad variety of common functional motifs. See DOI: https://doi.org/10.1039/d5me00173k.
| This journal is © The Royal Society of Chemistry 2026 |