Open Access Article
Mas Pieter Klein
a,
Irina Rudenko†
*b,
Evgeny A. Pidko†
*a and
Ivan Bushmarinov†
*c
aDepartment of Chemical Engineering, TU Delft, Delft, Netherlands. E-mail: mpklein@tudelft.nl; E.A.Pidko@tudelft.nl
bAvride Inc., Austin, TX, USA. E-mail: irina.vl.rudenko@gmail.com
cPerplexity AI, Belgrade, Serbia. E-mail: ivan.bushmarinov@perplexity.ai
First published on 19th May 2026
Molecular properties of chemical compounds are governed not by a single unique arrangement of atoms (2D molecular graph) but by ensembles of three-dimensional conformers, yet most molecular representations for machine learning approaches either ignore conformational diversity or use it implicitly to augment molecular graphs. Here we introduce ConforFormer, a geometry-first foundation model capable of learning conformation-robust molecular embeddings directly from the 3D atomic coordinates. By aligning representations across multiple conformers of the same molecules through a novel contrastive objective, ConforFormer produces compact, task-agnostic embeddings that can be generated once and directly applied to downstream tasks, including property prediction and structural similarity, without extensive fine-tuning. Across a range of quantum-chemical and bioactivity benchmarks, these frozen embeddings achieve competitive performance without task-specific fine-tuning, while offering improved stability on small datasets. Beyond property prediction, the learned embedding space allows to discriminate with high-precision molecular conformers and isomers, substantially outperforming classical fingerprint-based similarity measures. This implies that explicit exposure to conformational relationships induces representations that generalize beyond the conformer recognition task itself, capturing chemically meaningful structural constraints directly from 3D geometries. More broadly, our results suggest that incorporating conformation-awareness as a foundational learning task provides a fundamental route towards transferable, geometry-centered molecular representations particularly relevant for complex chemical systems, where conventional graph-based representations are ambiguous or ill-defined.
In chemistry, such models aim to learn transferable internal representations of the chemical space during pre-training that can be reused across a wide range of prediction tasks, reducing reliance on task-specific supervision. Importantly, the usefulness of a chemical foundation model is determined by the representations it learns during pre-training. Existing pre-trained chemical models are typically used to initialize weights for supervised prediction tasks, which are solved by fine-tuning the whole model for the objective.4–8 While this approach can achieve state-of-the-art performance on benchmarks, it often shows limited robustness on real-world chemical datasets, which in laboratory settings rarely exceed a few hundred experimentally measured points.9 This suggests that the common pre-training objectives do not always yield representations that are sufficiently stable or transferable under realistic chemical data constraints.
Most chemical foundation models still operate on simplified 2D representations of molecules, ignoring the conformation and configurational diversity that governs their real chemical behavior. In reality, each compound exists as an ensemble of 3D structures (conformers) whose distribution determines such properties as binding affinities, docking poses, and chemical reactivity.10–12 Typically, conformers differ from each other by rotations around single bonds, inversion of nitrogen lone pairs and other movements allowed by molecular flexibility. Conformers are distinct from isomers, which also are 3D geometries with the same composition, but one isomer cannot be produced from another without rearranging chemical bonds, i.e. a chemical reaction happening. Capturing the distribution of 3D geometries possible for a molecule is essential for the property prediction task, yet explicit incorporation of understanding conformations as a foundational learning objective remains largely unexplored in current chemical foundation models.
From a chemical perspective, conformers of the same molecule represent distinct geometrical realizations of an equivalent chemical entity, making their alignment a natural target for contrastive learning that would enforce equivalence across conformational space. Contrastive learning has emerged as a powerful strategy to enhance foundation models and refine embeddings without explicit labels by regularizing the embedding space in a way that it becomes organized so that distance correlates with semantic similarity. By structuring the embedding space to bring similar objects closer while pushing dissimilar ones apart, models learn more informative, general-purpose representations. Methods developed at Amazon13,14 illustrate how contrastive approaches can refine embeddings across modalities, improving downstream task performance. A notable example is Microsoft E5,15 trained in a weakly supervised manner on naturally occurring document pairs such as questions and answers from forums.
To our knowledge, no chemical embedding model incorporated conformational equivalence into its foundational learning objectives. Here we introduce ConforFormer, a foundational model that explicitly accounts for this diversity by aligning embeddings across multiple conformations of a molecule to produce compact, informative representations suitable for downstream tasks. In this work, we present (1) a new compact embedding for chemical structures, learnable from 3D geometries, (2) a novel contrastive learning process necessary to build it, (3) a benchmark evaluating the model's ability to distinguish pharmaceutically relevant molecules, and (4) the performance of the resulting embeddings on established chemical benchmarks.
Following ChemBERTa,19 Uni-Mol introduced a special CLS token to aggregate global molecular information. This token is assigned an “empty” atom type and placed at the geometric center of the molecule. It is processed alongside atomic tokens by the transformer but is excluded from the atom masking or distance prediction tasks. During downstream use, the embedding associated with the CLS token serves as a fixed-length representation of the entire molecule and is commonly passed to task-specific prediction heads.
We should note that Uni-Mol does not treat a molecule as a single fixed geometry at downstream inference time. Within the standard Uni-Mol protocol, molecular property predictions are obtained by averaging the predictions over up to 10 conformers generated per molecule. The resulting conformer ensemble reduces sensitivity to any particular RDKit-generated geometry and partially accounts for conformational variability within the limited sampled set.
Herein, we selected the original Uni-Mol as the backbone architecture, because it relies exclusively on the atom types and their 3D geometrical arrangements as input. This geometry-centric design makes Uni-Mol a suitable foundation for exploring learning objectives that explicitly account for the conformational diversity and fluxionality at the representation level.
A practical next step is to use the pre-trained model to produce a fixed representation of a data item (“embedding”) that can be reused across downstream tasks and for similarity search. General-purpose embeddings reduce computational cost to solve regression and classification tasks, for example, when combined with graph-based match algorithms such as HNSW.28 Chemistry applications span both data-scarce regimes (e.g. reactivity and selectivity datasets with hundreds to thousands labeled examples) to data-rich regimes (e.g. large screening candidate libraries or long molecular dynamics trajectories containing billions of structures). In both settings, repeatedly fine-tuning large models end-to-end is often impractical. Herein, we therefore focus on building compact, task-agnostic molecular embeddings that remain useful across multiple downstream tasks.
Real molecules sample distributions of geometries within extended regions of a multidimensional potential energy surface (Fig. 1, left panel). For stable organic molecules in the ground state, the molecular graph (in other words, atom connectivity) is generally preserved across the accessible conformational ensemble. Distinct conformers correspond to different local minima with the same topology separated only by low energy barriers due to e.g. bond rotations or pyramidal inversions. Chemically distinct species correspond to regions separated by sufficiently high energy barriers and their interconversion involves a chemical reaction. Structural formulae label these regions, and for most ground-state organic compounds, they capture connectivity rather efficiently. However, they do not uniquely specify the molecular geometry and do not readily encode the intrinsic variability in the 3D configurational fluxionality that often determines the chemical and physical properties of interest.
![]() | ||
| Fig. 1 Schematic illustration of the ConforFormer framework: model architecture with pretraining objectives. | ||
The notion of the chemical bond as a physical or conceptual entity has been the subject of long-standing discussion in the chemistry community.29–34 Although more than a century-old Lewis model35 provides an exceptionally successful language for chemical reasoning, the bond assignment is ultimately a model-dependent interpretation of the underlying electronic structure. Bader's Quantum Theory of Atoms in Molecules36,37 provides an influential electron density-based framework for analyzing molecular structure and interatomic interactions. At the same time, discussions in the theoretical chemistry community emphasized the role of Lewis structures and connectivity-based reasoning as conceptual scaffolds with remarkable practical resilience.31–33
In the present proof of concept, we deliberately focus on organic molecules, where graph- or string-based representations provide a sufficiently accurate and robust description. This makes the comparison conservative, because ConforFormer is evaluated in a regime where conventional graph-based methods are expected to work well. However, their limitations become apparent for more chemically complex systems, particularly organometallic and supramolecular compounds, where bonding patterns may be ambiguous, fluxional behavior is common, and chemically relevant distinctions often arise from “subtle” geometric rearrangements. The broader relevance of such systems should therefore be understood as motivation for future extensions rather than as a demonstrated application of the current workflow. Organometallic compounds with agostic, ηn-coordinated, or fluxional bonding illustrate this point. A 3D structure or structural ensemble provides a more direct description of such compounds, while the conventions for assigning and drawing individual bonds may be ambiguous.38 These challenges are especially pronounced for tasks such as predicting catalytic activity and selectivity, where 3D structure and conformational accessibility play a decisive role.39,40 We note that extending the present approach to such systems will require structural ensembles and chemically meaningful positive/negative labels from e.g. quantum-chemical sampling or molecular dynamics.
One may alternatively address these challenges by introducing richer molecular-graph encodings that better capture coordination, stereochemistry, and fluxional bonding. This is an important avenue of work. Here we explore a complementary approach, in which molecular graphs are not used as model input. This does not imply that molecular identity is defined without graph information during dataset construction. Instead, the model learns representations directly from atom types and 3D geometries. Graph information is only used during data preparation to associate conformers of the same molecule and construct contrastive training pairs. We show that the resulting representation retains discriminative properties typically associated with graph-based fingerprints, while being explicitly defined on structural ensembles.
Formally, let
denote the set of molecules in the training corpus and let
denote the set of conformers associated with molecule
. During training, pairs of conformers sampled from the same molecule are treated as positive pairs, while conformers originating from different molecules form negative pairs. The learning objective enforces alignment of embeddings within each equivalence class
, while maintaining separation between different molecular identities. Such a formulation does not assume functional equivalence of individual conformers. Instead, it defines a representation space, in which molecular identity is stable with respect to geometric variability. The contrastive learning objective implementing this formulation is introduced in the next section.
The model is trained to distinguish pairs of conformers among various molecules. Importantly, molecular graph information is not provided as the input to the model and it is only used at the data generation stage to label positive and negative pairs. This design enforces alignment based exclusively on 3D geometries and atomic identity.
We employ the normalized temperature-scaled cross-entropy (NT-Xent) loss function41 to teach the model to put the embeddings of different 3D representations close in the embedding space. This objective explicitly regularizes the embedding space such that representations of distinct 3D realizations of the same molecular identity are brought closer together, while embeddings of different molecules remain separated.
Let
denote the set of molecules and
an embedding function with embedding dimension d = 512. For vectors
, we define a cosine-style similarity
The [0,1] range is imposed by the embedding normalization. For molecules
, we write sim(f(x), f(x′)) and denote zi: = f(xi).
In each training batch, we sample n = 128 unique molecules and generate two distinct 3D representations (views) for each, yielding 2n embeddings
with index set
. Let
denote the set of ordered positive pairs, where
if and only if i ≠ j and both indices correspond to two conformers of the same molecule. For each positive pair (i, j) NT-Xent loss is defined as:
Higher values of τ reduce sensitivity to small embedding differences by flattening the softmax distribution. In all experiments reported here, we set τ = 0.07; additional temperature ablations are provided in the SI, further SI (Section E).
The contrastive loss is combined with the original Uni-Mol pre-training objectives to yield the total loss:
corresponds to the loss for masked token prediction,
is the loss associated with the coordinates denoising task, and
is the loss associated with the masked distance prediction. These objectives and their original batching protocol were introduced in ref. 4 and further detailed in ref. 22. Models trained with the additional contrastive objective are referred to as ConforFormer throughout the work.
Models trained with the additional contrastive objective on these datasets are referred to as ConforFormer-UniMol and ConforFormer-OMol, respectively. All model variants are trained using identical backbone architectures, embedding dimensions and optimization settings to ensure that observed differences arise solely from the pre-training objectives and data sources rather than architectural or procedural changes.
For downstream evaluation, we focus on the quality and stability of learned molecular embeddings. Unless explicitly stated otherwise, the Uni-Mol encoder parameters are frozen and representations are extracted from the CLS token. These frozen embeddings are then used as input to lightweight task-specific models or similarity analyses without additional fine-tuning of the backbone. This evaluation protocol follows standard transfer-learning practices for large pre-trained transformer models, as discussed in Section 2.3. Under this framework, task-specific evaluation is conducted using MoleculeNet. Metrics of ROC-AUC are used for classification benchmarks, and root-mean-squared deviation (RMSD) for regression benchmarks. The model's predictions are compared against classes or true values, which are identical across all conformers. Reducing the number of conformers used when training task-specific models was found to have little effect on final performance across all benchmarks (Tables S5 and S6). This, along with other ablation studies, can be found in section E of the SI.
As a reference, we also report the results for XGBoost45 trained on RDKit ECFP4 (1024-bit) Morgan fingerprints,46 which we identified in our screening as the strongest 2D baseline (denoted as the XGBoost ECFP4 (1024-bit) baseline). We additionally screened RDKit Morgan ECFP4 and ECFP6 fingerprints folded into 2048 and 16
384 bits, as well as Open Babel FP2, FP3, and FP4 fingerprints. This baseline predicts directly from the molecular graph via engineered local substructure features, whereas the Uni-Mol and ConforFormer embeddings are learned from 3D structures and then decoded by a lightweight predictor. Even without any end-to-end training aimed at obtaining a useful representation, the Uni-Mol backbone produces embeddings that are competitive with the fingerprint baseline on multiple tasks. Nevertheless, ConforFormer contrastive loss further improves performance on the regression tasks (ConforFormer-UniMol, Fig. 3). Specifically, models trained with the conformer-alignment contrastive objective consistently yield higher-quality frozen embeddings than the Uni-Mol baselines trained without this objective. This effect is most pronounced for geometry-sensitive datasets, including QM8 and QM9, where contrastively trained models achieve markedly lower error than Uni-Mol.
The Uni-Mol dataset is substantially skewed towards organic compounds and utilizes low-quality RDKit MMFF geometries. This limitation can be addressed by training on the recently released OMol dataset42 which is computed at ωB97M-V/def2-TZVPD level and has conformation data for 8.2 M unique molecules (see SI, section B for details). We therefore analyzed whether the improved quality of geometries alone or explicitly learning conformational relationships during pre-training could give rise to systematic improvements in frozen transfer. When trained on the OMol subset without the conformer-alignment contrastive objective (UniMol-OMol), performance remains broadly comparable to the Uni-Mol replicate under frozen regime (see SI). In contrast, adding the conformer-alignment objective yields consistently better embeddings. ConforFormer-OMol performs best on 4 out of 6 quantum-chemical regression benchmarks (Fig. 3) and 5 out of 8 classification benchmarks (Fig. 4), while remaining below the frozen Uni-Mol replicate baseline on BACE and ClinTox. These two datasets have a low number of datapoints and the best model for those in our setup was in fact the XGBoost ECFP4 (1024-bit) baseline. Importantly, ConforFormer-OMol shows a significant improvement over Uni-Mol frozen embeddings and ConforFormer-UniMol on challenging MUV and HIV benchmarks, as well as much more stable performance than ConforFormer-UniMol, with no benchmark demonstrating numbers significantly outside the literature range except for QM9. This indicates that a diverse pre-train data with better quality of the molecular geometries, available via the OMol dataset, is beneficial for the quality of the embeddings.
To ensure a controlled downstream comparison, we evaluated ConforFormer-OMol using the geometries from the MoleculeNet benchmark released by the Uni-Mol team (Table S4). This avoids conflating representation differences with differences in geometry generation pipelines at evaluation time. While higher-quality conformer generation could plausibly improve absolute metrics for all 3D-based approaches, a systematic study of geometry generation protocols is beyond the scope of the present work. We therefore emphasize representation-level differences under a consistent evaluation setup rather than maximizing absolute task performance. In this context, it is also expected that fully fine-tuned large models can achieve higher absolute performance on some tasks because end-to-end optimization can adapt the representation directly to downstream labels and better exploit task-specific supervision.
We also note that the Uni-Mol baseline already accounts for conformational variability at inference through prediction averaging over a conformer ensemble (up to 10 conformers per molecule). Thus, this does not contribute to the improvements observed for ConforFormer. Instead, the consistent gains under frozen evaluation are aligned with a representation-level mechanism. Conformer alignment during pre-training strengthens the representation itself and regularizes the embedding space, increasing robustness to geometric variability while remaining sensitive to conformational diversity. We believe that the ConforFormer-OMol representation captures some fluxional behavior beyond the 10 explicitly supplied conformers.
To summarize, the presented frozen evaluation demonstrates that conformer-aligned pre-training yields denser, more transferable 3D-derived molecular embeddings. Without task-specific fine-tuning, they provide competitive prediction of quantum-chemical properties and show strong performance on multiple pharmaceutically relevant classification benchmarks. This provides direct evidence that learning conformational relationships improves the chemistry captured by the representation, rather than simply improving downstream optimization.
To construct this benchmark, we used a portion of ZINC20 (ref. 47) not overlapping with Uni-Mol or OMol datasets, selected subsets of isomeric molecules and pre-generated batches containing isomers and conformers for consistent evaluation. Specifically, each batch contained 128 unique molecules, which are all isomers to each other. Each isomer had exactly 2 conformers, resulting in 256 datapoints per batch. An 80/10/10 train/validation/test split was employed for the dataset so that the performance of models trained specifically on it could be evaluated; metrics below are all reported on the test split. Overall, PharmaIsomer contains 3
261
807
960 datapoints in 12
741
440 batches (see Section C of the SI for details). The dataset is freely available under CC-BY license.48
The dataset contains four types of molecular pairs: backbone isomers where the molecules have a different bond order with the same composition (99.50% of all pairs); conformers (0.39%), optical isomers where molecules are mirror images of each other (0.05%); and diastereomers where the molecular topology is the same but the relative configuration of optical centers and/or double bonds is different (0.06%).
![]() | ||
| Fig. 5 Distribution of cosine similarities between CLS token values extracted from Uni-Mol, ConforFormer-UniMol and ConforFormer-OMol, as measured on the PharmaIsomer benchmark. | ||
After including a contrastive objective, ConforFormer-Unimol and ConforFormer-OMol learn to cleanly separate conformers and isomers without any additional training. So, besides the embeddings becoming more useful for property prediction, they can be competitive for the tasks of similarity search as well. For that, we needed a more formal evaluation of the model capability to distinguish conformers and isomers.
Let
be a dataset of molecule pairs, where
and yi ∈ {0, 1} indicates the pair type: yi = 1 for conformers and yi = 0 for isomers. Define index sets
and
(so N = NC + NI).
Reusing the same similarity as in pretraining, define
For a threshold θ ∈ [0, 1], predict conformer when si ≥θ:
ŷi(θ) : = 1si≥θ. |
Confusion counts and metrics are then defined as follows:
The precision/recall curves constructed by sweeping over θ ∈ [0, 1] can be found on Fig. 6. In this analysis, we treat enantiomers (mirror isomers) as the same molecule; the Uni-Mol backbone is based on a distance matrix, therefore has E(3) symmetry23 and treats enantiomers as the same by design. For Uni-Mol replicate, the precision at 50% recall was just 8%; for ConforFormer-OMol it was above 83%, with most of the errors coming from the low capability of the model to recognize diastereomers (on backbone isomers its precision at 50% recall was 94%).
Notably, post-training the model on the train part of the PharmaIsomer dataset saturates the backbone part of the benchmark with 99.9% precision at 50% recall but still reaches just 56% precision at 50% recall for diastereomers. For both isomers and diastereomers, the precision of the model is on par or higher than of a baseline utilizing the Tanimoto similarity between the FP2 fingerprints of the molecule pairs as the similarity score siT. This representation has 100% recall by design at siT = 1, but it cannot be adjusted to obtain higher precision.
The precision and recall curves (Fig. 6) for recognizing isomers of molecules outside both Uni-Mol and OMol training datasets conclusively show that our model has obtained the capability to make inference about unique chemical structures without being directly trained on molecular graphs. While Uni-Mol replicate model seems to consider overall shape of the molecule more in making these assessments, ConforFormer-OMol recognizes similarity based on underlying molecular graph which it inferred from training with the novel contrastive objective. See Fig. 7 for an example of conformers with very dissimilar shape and Fig. 8 for a pair of isomers with an overall similar one. Both have the same similarity of 0.93 in the Uni-Mol embedding space but differ strongly (0.99 vs. 0.26) in the ConforFormer-OMol one. Section F of the SI contains other examples of the models' disagreements in similarity evaluations for conformer and isomer pairs. Exploring the similarity relationships beyond the conformer/isomer pairs, molecules with close embeddings but not isomers tend to be chemically similar. At cosine similarity of ConforFormer-OMol embeddings >0.90 the molecular pairs invariably share most of the backbone; at similarities >0.75 the molecules typically share some major structural motif. A detailed study of the practical applicability of searching for such close neighbors in drug discovery and related tasks can be a focus of a future study.
![]() | ||
| Fig. 7 A pair of conformers of the same molecule having similarity of 0.93 in the Uni-Mol embedding space and 0.99 in ConforFormer-OMol (indicating they belong to the same molecule). | ||
These frozen embeddings are directly usable and show competitive performance across quantum-chemical, physico-chemical and bioactivity benchmarks. We observe systematic improvements when conformer alignment is included in the training objectives, especially when high-quality geometries are used during the pre-training (ConforFormer-OMol). Our results suggest that the conformation-aware pre-training can produce representations that transfer robustly beyond the pre-training objective, including in small-data regimes where fully unfrozen fine-tuning would give rise to instabilities.
Beyond property prediction, we find that the learned embedding space readily supports chemist-interpretable similarity analysis without task-specific retraining. On the PharmaIsomer benchmark, ConforFormer embeddings cleanly separate conformers from isomers and substantially outperform classical finger-print similarity and embeddings obtained without the contrastive objective in terms of precision at a given recall. For example, at 50% recall, precision increases from about 8% for a Uni-Mol replicate to >83% for ConforFormer-OMol, with most residual errors attributable to challenging sterochemical cases (notably, diastereomers). These results point to an emergent ability to encode graph-like structural constraints from 3D geometries alone, even though the model is not trained with an explicit objective to distinguish conformers from isomers.
From the practical perspective, the direct access to such robust molecular embeddings provides a computationally efficient alternative to retraining large backbone for each downstream application. For example, similarity search and screening can be performed directly in the learned embedding space, avoiding the need for task-specific objectives or dedicated fingerprint engineering. In line with this, our similarity analysis task on the PharmaIsomer dataset shows that nearest-neighbor relationships inferred directly from the embeddings enable efficient and chemically reasonable notion of closeness, while remaining far cheaper compared to full large-scale retraining on often proprietary pharma-related molecular datasets. Future work will focus on improving stereochemical sensitivity, where the current E(3)-invariant design is limiting, better modeling of conformational distributions and geometry quality, and extending conformer/isomer labeling via molecular dynamics simulations to more complex and fluxional chemical systems (including organometallic and coordination compounds), potentially augmented by additional training objectives.
Footnote |
| † Equal contribution. |
| This journal is © The Royal Society of Chemistry 2026 |