Lucy
Vost
a,
Vijil
Chenthamarakshan
b,
Payel
Das
b and
Charlotte M.
Deane
*a
aDepartment of Statistics, University of Oxford, Oxford, UK. E-mail: deane@stats.ox.ac.uk
bIBM Research, Yorktown Heights, New York, USA
First published on 24th March 2025
Traditional drug design methods are costly and time-consuming due to their reliance on trial-and-error processes. As a result, computational methods, including diffusion models, designed for molecule generation tasks have gained significant traction. Despite their potential, they have faced criticism for producing physically implausible outputs. As a solution to this problem, we propose a conditional training framework resulting in a model capable of generating molecules of varying and controllable levels of structural plausibility. This framework consists of adding distorted molecules to training datasets, and then annotating each molecule with a label representing the extent of its distortion, and hence its quality. By training the model to distinguish between favourable and unfavourable molecular conformations alongside the standard molecule generation training process, we can selectively sample molecules from the high-quality region of learned space, resulting in improvements in the validity of generated molecules. In addition to the standard two datasets used by molecule generation methods (QM9 and GEOM), we also test our method on a druglike dataset derived from ZINC. We use our conditional method with EDM, the first E(3) equivariant diffusion model for molecule generation, as well as two further models—a more recent diffusion model and a flow matching model—which were built off EDM. We demonstrate improvements in validity as assessed by RDKit parsability and the PoseBusters test suite; more broadly, though, our findings highlight the effectiveness of conditioning methods on low-quality data to improve the sampling of high-quality data.
While many models historically operated in 1D or 2D space,2–4 focus has recently shifted towards developing models capable of directly outputting both atom types and coordinates in 3D. Autoregressive models were once prominent in this domain, generating 3D molecules by adding atoms and bonds iteratively.5–7 However, such models suffer from an accumulation of errors during the generation process and do not fully capture the complexities of real-world scenarios due to their sequential nature, potentially losing global context.8,9 To address these limitations, recent studies have turned to diffusion models, which iteratively denoise data points sampled from a prior distribution to generate samples. Unlike autoregressive models, diffusion-based methods can simultaneously model local and global interactions between atoms. Nevertheless, diffusion in molecule generation has faced criticism for yielding implausible outputs.10,11 There have been ongoing efforts to improve the performance of models trained on small molecules such as those found in the QM9 dataset, and as such the models currently available are capable of reliably generating molecules of this size.12–16 However, achieving success in generating larger molecules, as encountered in datasets like GEOM,17 remains challenging without incorporating additional techniques such as energy minimisation or docking.18
In this paper, we focus on enhancing the ability of a diffusion model to generate plausible 3D druglike molecules. To achieve this, we use the property-conditioning method developed by Hoogeboom et al.13 Instead of conditioning a model on pre-existing properties, we condition on conformer quality, training the model to not only generate molecules, but also to distinguish high- and low-quality chemical structures (Fig. 1).
To achieve this, we generate distorted versions of each of the three datasets we evaluate the method on: QM9, GEOM, and a subset of ZINC. We sample molecules from each dataset and apply random offsets to their original coordinates, based on a maximum distortion value. Each distorted molecule is assigned a label representing the degree of warping applied and is added back to the dataset. Non-distorted molecules are also labeled, identifying them as high-quality conformers. Using these datasets of molecules with varying levels of quality, we train property-conditioned models, encouraging the model to learn to label molecule validity while simultaneously training it to generate molecules.
First, we evaluate our conditioning method with EDM, the first E(3) equivariant diffusion model for molecule generation.13 We then test it on two additional models: a geometry-complete diffusion model14 and a flow matching method,19 both designed to enhance the structural plausibility of generated molecules. Since existing models already achieve strong performance on QM9, leaving little room or need for improvement, we focus on evaluating our approach on slightly larger, more chemically complex molecules. To this end, we employ two datasets of druglike molecules: the GEOM dataset, and another derived from the ZINC database.
Our findings demonstrate that across the models tested, conditioning a model with low-quality conformers enables it to discern between favourable and unfavourable molecular conformations. This allows us to target the area of the learned space corresponding to high-quality molecules, resulting in an improvement of the validity of generated molecules. More broadly, this demonstrates the potential of supplementing molecule generation methodologies not solely with examples of desired molecules but also with instances exemplifying undesired outcomes.
D ∼ U(0,Dmax) |
This value represents the maximum distance in angstrom that could be added to atoms in that molecule: in other words, the sampled distortion value determines the maximum extent of perturbation to be applied to the molecule's structure. Following this, random offsets were generated within the range of −D to D, for each dimension of every atom's coordinates:
offsetx, offsety, offsetz ∼ U(−D,D) |
These offsets were then applied to the original coordinates:
sxi = xi + offsetx; syi = yi + offsety; szi = zi + offsetz |
Resulting in a ‘distorted’ version of the molecule. This distorted molecule, along with its corresponding sampled distortion value D, was subsequently added to the training set. Following the generation of the distorted datasets, we use the property-conditioning training protocol outlined by Hoogeboom et al.13 to train on them, using the distortion factor D as the property of interest, and follow the sampling protocol to generate molecules corresponding to D = 0 Å.
The QM9 dataset has been extensively used to develop and validate machine learning models for molecular property prediction. However, it has also recently become the central benchmark for de novo molecule generation, particularly in the development of diffusion models,13,14 and as such, new diffusion models are capable of reliably generating molecules similar to those found in QM9.
Similar to Peng et al.,22 we use a version of GEOM from which hydrogen have been removed (GEOMno h), as the positions of hydrogen atoms can often be inferred with a high level of confidence.33 This not only reduces the computational demand of training, but also facilitates more effective learning of heavy atom placements. This leads to the GEOMno h dataset becoming the quickest to train on among the three druglike datasets. We therefore use the GEOMno h dataset for conducting ablation tests.
We generate a training set by selecting a subset of 660000 molecules from the druglike catalog of the ZINC database. Unlike GEOM, this subset is curated without repeat conformers. Hydrogen atoms are not included, and the average molecule comprises 26.8 heavy atoms.
Dataset | RDKit sanitisation, % | Posebusters pass rate, % | |||||
---|---|---|---|---|---|---|---|
All atoms connected | Bond lengths | Bond angles | Internal steric clash | Internal energy | All tests passed | ||
Baseline | |||||||
QM9 | 92.2 (90.5–93.8) | 100.0 (100.0–100.0) | 100.0 (100.0–100.0) | 99.9 (99.7–100.0) | 100.0 (100.0–100.0) | 88.1 (85.9–90.1) | 81.1 (78.7–83.5) |
GEOMno h | 84.7 (82.5–86.9) | 74.4 (71.7–77.1) | 65.6 (62.5–68.7) | 73.8 (70.8–76.7) | 97.8 (96.7–98.7) | 75.2 (72.3–78.0) | 62.2 (58.3–66.1) |
ZINC | 70.6 (67.8–73.3) | 62.4 (59.4–65.3) | 65.0 (61.5–68.4) | 75.3 (72.2–78.5) | 79.5 (76.6–82.4) | 78.5 (75.5–81.4) | 40.0 (37.0–43.0) |
![]() |
|||||||
Distortion factor conditioning | |||||||
QM9 | 65.0 (62.0–68.0) | 84.0 (81.7–86.2) | 96.6 (95.2–98.0) | 95.8 (94.3–97.2) | 98.9 (98.0–99.7) | 91.1 (88.8–93.2) | 46.9 (43.9–50.0) |
GEOMno h | 92.4 (90.7–94.0) | 89.4 (87.5–91.3) | 96.6 (95.5–97.8) | 93.7 (92.1–95.2) | 100.0 (100.0–100.0) | 87.7 (85.5–89.7) | 68.8 (65.9–71.1) |
ZINC | 95.3 (93.8–96.6) | 95.7 (94.3–96.9) | 94.1 (92.5–95.6) | 93.7 (92.1–95.2) | 99.0 (98.4–99.6) | 89.4 (87.4–91.4) | 74.5 (71.7–77.3) |
Our baseline analysis reveals a clear relationship between molecule size and model performance. The QM9 dataset, comprising molecules smaller than 9 heavy atoms had the highest baseline performance with RDKit and PoseBusters pass rates of 92.2% and 81.1%, respectively. The non-conditioned EDM model performed exceptionally well with QM9, surpassing all conditional variants across both evaluation metrics. This superior performance may be attributed to EDM's specific development for QM9, coupled with the dataset's smaller molecular size, which appears to enable better discrimination between high-quality and low-quality conformers without requiring examples of the latter.
The GEOMno h dataset also showed strong baseline performance as assessed with RDKit, with generated molecules achieving a pass rate of 84.7%. However, the more stringent PoseBusters tests presented more of a challange, with generated molecules presenting a pass rate of 62.2% for PoseBusters. This can particularly be attributed to the tests concerning bond lengths (65.6% pass rate), bond angles (73.8%), connectivity (74.4%) and internal energy (75.2%). Conditional training improved these metrics substantially, reaching pass rates of 96.6% (bond lengths), 93.7% (bond angles), 89.4% (connectivity) and 87.7% (internal energy). Minor improvements were also seen in the steric clash tests (97.8% to 100%).
The baseline model trained on the ZINC subset exhibited markedly lower performance than the other two datasets, with RDKit sanitisation pass rates of 70.6% and PoseBusters pass rates of 40.0%. The most prevalent failures occurred in bond lengths and atom connectivity, with pass rates of only 65.0% and 62.4%, respectively. This performance decline relative to GEOMno h may be attributed to two factors. Firstly, the ZINC dataset's increased diversity, featuring unique conformers rather than multiple conformers per molecule (as found in GEOMno h), may cause the model to prioritise learning atom types over optimising 3D conformer generation. Secondly, and perhaps more significantly, the compositional differences between datasets may play a crucial role. While the ZINC subset exclusively contains medium-sized compounds, GEOMno h incorporates the entire QM9 dataset, resulting in a smaller average molecule size.
Conditional training on the ZINC dataset yielded improved RDKit sanitisation rates and PoseBusters scores. The most notable improvement was observed in the ZINC model's atom connectivity pass rate, which increased from 62.4% to 95.7%, but improvements were also seen in the pass rates of the tests assessing bond lengths (65.0% to 94.1%), bond angles (75.3% to 93.7%), internal steric clash (79.5% to 99.0%), and internal energy (78.5% to 89.4%).
These findings show that while the baseline, non-conditional EDM model excels at generating small compounds, its performance declines when restricted to medium-sized molecules, often producing physically implausible structures. In the next section, we present ablation tests exploring the impact of varying distortion magnitudes and the ratio of distorted to non-distorted molecules when applying our conditioning method.
D max (Å) | Ratio of distorted:non-distorted molecules | |||||
---|---|---|---|---|---|---|
1![]() ![]() |
1![]() ![]() |
1![]() ![]() |
||||
RDKit sanitisation, % | PoseBusters pass rate, % | RDKit sanitisation, % | PoseBusters pass rate, % | RDKit sanitisation, % | PoseBusters pass rate, % | |
0.1 | 96 (92–99) | 73 (64–81) | 96 (92–99) | 77 (69–85) | 96 (92–99) | 77 (68–85) |
0.25 | 95 (90–99) | 52 (42–62) | 97 (92–99) | 81 (73–88) | 96 (92–99) | 77 (68–85) |
0.5 | 97 (93–100) | 75 (66–83) | 97 (93–100) | 78 (70–86) | 95 (90–99) | 68 (59–77) |
1 | 93 (88–97) | 57 (47–67) | 89 (78–97) | 54 (38–70) | 62 (52–71) | 8 (3–14) |
We introduced varying numbers of distorted molecules at different distortion levels (ranging from 0 Å, indicating no distortion, to the maximum distortion, Dmax Å) into the original GEOMno h dataset. We defined dataset ratios based on the number of distorted and original molecules: for example, a 1:
50 ratio indicates one distorted molecule was added for every fifty original molecules. We evaluated each model's performance by training conditioned models and sampling 100 molecules, ensuring that the samples were from the low-distortion-factor region of the learned space (formally, enforcing D = 0 Å).
The model trained on a dataset with a ratio of 1:
50 distorted molecules and a maximum distortion of 0.25 Å exhibited the joint highest RDKit parsability rate of 97%, and the highest PoseBusters pass rate at 81%. While several models reached 97% RDKit sanitisation rates (namely 1
:
20, Dmax = 0.5 Å and 1
:
50, Dmax = 0.5 Å), these models exhibited slightly lower PoseBusters pass rates (75% and 78%, respectively). Increasing or decreasing Dmax further resulted in PoseBusters performance decreasing across all ratios, primarily due to failures in the internal energy test.
This observation suggests that if the training includes molecules that are too distorted, the model does not effectively learn to distinguish between subtly flawed and acceptable molecular structures. Distorted molecules should therefore still bear some resemblance to realistic conformers, albeit with deliberately infeasible bond lengths and angles. On the other hand, insufficient distortion compromises the effectiveness of the conditioning classifier, and the models struggle to distinguish between high-quality and low-quality conformations, leading to poor performance in generating desirable molecules.
These results demonstrate the concept of conditioned training on negative data, and give an idea of the extent of distortion and frequency of distorted molecules to add. We used a ratio of 1:
50, and Dmax = 0.25 Å for all subsequent tests, but note that any dataset would likely benefit from different exact values of these parameters.
We also examined the quality of molecules generated when sampling from the low-quality region of the learned space (formally, D = Dmax Å). The results of this are shown in the ESI.† The molecules sampled using D = Dmax Å are, as expected, worse than both the conditioned models and the baseline model in terms of PoseBusters pass rates, with the highest reaching only 53%. This poor performance is mainly attributed to failures in the internal energy test.
Having established the parameters for conditional training datasets in terms of quantity of distorted molecules and extent of distortion, and demonstrated that our conditioning method enhances the structural plausibility of generated molecules when EDM is trained on ZINC or GEOMno h, we now move on to testing this approach on other models.
RDKit sanitisation, % | PoseBusters pass rate, % | ||
---|---|---|---|
(a) GCDM | |||
Baseline | GEOMno h | 100.0 (100.0–100.0) | 77.8 (75.2–80.4) |
ZINC | 56.3 (53.3–59.3) | 40.8 (38–43.6) | |
Conditional | GEOMno h | 99.9 (99.7–100.0) | 79.7 (77.2–82.1) |
ZINC | 97.2 (95.8–98.4) | 66.5 (62.5–70.4) | |
![]() |
|||
(b) MolFM | |||
Baseline | GEOMno h | 98.6 (97.5–99.6) | 80.8 (77.3–84.1) |
ZINC | 72.0 (69.2–74.8) | 42.3 (39.2–45.4) | |
Conditional | GEOMno h | 94.5 (93.0–95.9) | 46.8 (43.7–49.8) |
ZINC | 93.3 (91.6–94.9) | 45.9 (42.6–49.1) |
For the GEOM dataset, the GCDM conditional model shows very marginal improvements in PoseBusters performance over the baseline (mostly due to the internal energy test, for which the pass rate increases from 86.2% to 88.6%). However, since baseline performance is already high, conditioning has limited overall impact. GCDM trained on the ZINC subset, on the other hand, shows a much more substantial improvement with conditioning. The baseline model struggles with the connectivity of molecules, which increases from 63.7% in the baseline model to 94.4% when our conditioning method is applied, resulting in an overall boost in both RDKit sanitisation and PoseBusters pass rate.
Training MolFM using the conditional method does not improve the plausibility of generated molecules when using GEOMno h, in which many molecules suffer from connectivity issues. It does, however, improve the plausibility of generated molecules when using the ZINC dataset, by a margin similar to that shown by EDM.
In conclusion, our conditioning method that was developed and tested with EDM is able to, without modification, enhance molecular plausibility across different models when looking at ZINC. These results suggest that the conditioning approach is broadly applicable.
The method shows strongest improvements for diffusion-based models, particularly those built on the EDM framework, which comprise a significant portion of current 3D molecule generation approaches. The approach is also more effective for datasets containing larger, drug-like molecules, as demonstrated by our results with GEOMno h and ZINC. This can be explained mechanistically; larger molecules have more complex conformational spaces where explicit examples of invalid states help define the boundary between high-quality and low-quality conformers. Conversely, the method provides limited benefits for datasets like QM9 where molecules are small (fewer than 9 heavy atoms) and have constrained conformational spaces. In these cases, it appears models can effectively learn to distinguish valid from invalid conformations from the training data alone.
Our findings underscore the importance of considering the quality of conformers in molecule generation processes. The results show that by training models to discern between favorable and unfavorable molecular conformations, we can selectively sample from the high-quality region of learned space, resulting in significant improvements in the validity of generated molecules.
Moving forward, further research could explore additional conditioning methods and datasets to continue improving the quality and diversity of generated molecules. Additionally, investigating the applicability of our approach to other areas of molecular design and exploration could yield valuable insights for drug discovery and beyond. Overall, our study provides a promising avenue for generating valid drug-sized molecules efficiently and effectively.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00331d |
This journal is © The Royal Society of Chemistry 2025 |