Open Access Article
Michal
Michalski
*a and
Slawomir
Berski
*b
aInstitute for Research in Biomedicine (IRB Barcelona), Barcelona Institute of Science and Technology, 08028 Barcelona, Spain. E-mail: michal.michalski@irbbarcelona.org
bFaculty of Chemistry, University of Wroclaw, 50383 Wroclaw, Poland. E-mail: slawomir.berski@uwr.edu.pl
First published on 13th May 2026
Chemical bonding can be characterized within quantum chemical topology (QCT), which provides a real-space description via the topological analysis of the electron density and the electron localization function (ELF). While QCT has traditionally been applied on a molecule-by-molecule basis, recent advances in machine learning (ML) and the availability of large quantum chemical datasets now enable bonding analysis at scale. Here, we integrate ELF-based topological descriptors with ML to establish a data-driven framework for mapping chemical bonding across the QM9 dataset. Wavefunctions computed at the B3LYP/6-31G(2df,p) level were used to extract ELF basin populations, which were paired with geometric and bonding descriptors to construct a bond-level dataset. Statistical analysis revealed relationships between ELF populations, bond lengths, and local chemical environments. Regression models were trained to predict ELF electron populations directly from molecular geometry. The best performance was obtained when local environmental descriptors were included, reducing the prediction error by a factor of two relative to models using only the bond type and bond length. These results demonstrate that real-space bonding parameters, such as bond electron populations, can be predicted from simple structural features, enabling scalable and interpretable exploration of chemical bonding across large chemical spaces.
In parallel with theoretical developments, the past decade has witnessed a transformative shift toward data-driven molecular modelling, driven by advances in machine learning (ML) and the emergence of large, curated quantum chemical datasets.21–23 Collections such as QM7,24,25 QM8,26,27 and particularly QM928 have provided standardized benchmarks for predicting molecular properties directly from atomic configurations. The QM9 dataset, which consists of numerous small organic molecules made up of elements up to the second row, provides high-quality DFT-calculated geometries. These datasets have enabled the training of ML models, ranging from kernel-based regressors29,30 to message-passing neural networks,31,32 that reproduce molecular properties such as total energies,33 dipole moments,34 and electronic descriptors35 directly from atomic configurations at near-chemical accuracy but several orders of magnitude faster. More recently, the development of large-scale molecular repositories, such as ChEMBL,36 PubChem,37 ZINC database,38 OMol25,39 NAPROC,40 and COCONUT41,42 extending into the millions of chemical structures, has opened new opportunities for exploring the chemical space of up to medium-sized molecules.
The present work integrates QCT with ML to establish a data-driven framework for understanding chemical bonding at a large scale. Specifically, ELF-derived topological descriptors were computed for the QM9 dataset, allowing the extraction of bond-level information such as ELF basin populations (
), interatomic distances (r), bond types, and chemical environments. Throughout this work, ELF basin populations corresponding to bonding basins are termed bond populations, using a bond-centric terminology to improve accessibility for audiences outside the QCT community. These descriptors were used to train and validate ML models capable of predicting bond populations directly from molecular geometry. By combining real-space bonding analysis with ML, this approach provides an interpretable pathway for mapping the chemical bonding across millions of molecules, bridging the gap between QCT and modern data science.
885 small organic molecules built from combinations of five chemical elements: carbon (C), hydrogen (H), nitrogen (N), oxygen (O), and fluorine (F). Each entry is accompanied by quantum-chemically derived geometric, energetic, electronic, and thermodynamic properties, providing a rich library for data-driven exploration of molecular structure–property relationships. The dataset encompasses 6095 constitutional isomers, representing a chemically diverse yet systematically defined subset of the vast organic chemical space. Owing to its computational accuracy, internal consistency, and broad chemical coverage, QM9 has become one of the most widely used benchmark datasets for the development, validation, and benchmarking of ML models in theoretical chemistry.
To extend the QM9 dataset with topological information derived from real-space electron density, wavefunction (WFN) files were generated using the Gaussian 16 (G16, version C.01) program package.43 Calculations were performed at the density functional theory (DFT) level employing the B3LYP functional44–47 and the 6-31G(2df,p)48–50 basis set, fully consistent with the computational protocol of the original QM9 study. The optimized molecular geometries available in QM9 were used directly as input to ensure comparability of electronic properties across the dataset. The resulting WFN files served as input for topological analysis of the ELF using the TopMod09 program package.51 The ELF calculations were carried out on a cubic grid with a step size of 0.05 bohr.
The basin populations (
) obtained from the integration of electron density for ELF-localization basins were processed to generate a ML dataset. The TopMod09 output files (.top) and corresponding wavefunction files (.wfn) were parsed to extract basin information and populations and to compute geometric and electronic descriptors for each ELF basin, corresponding to the description of chemical bonds. For each bonding attractor, the two nearest atomic centres were identified, and the corresponding interatomic distances were calculated. Valence bonding basins associated with the same atomic pair were merged, and the resulting data, comprising bond identity, defined as the pair of atoms forming a given chemical bond, electron (basin) population, bond length, and molecular identifier, were compiled into a unified dataset. Additionally, for each bond, the local chemical environment of the bonded atoms was described by identifying neighbouring atomic centres within a fixed spatial cut-off of 2.0 Å and enumerating their pairwise connections. The resulting environment labels capture the immediate bonding context around each atom, reflecting both coordination and compositional patterns within the molecule.
ML training was performed to predict ELF-derived
from geometric and bonding descriptors. The dataset was preprocessed by one-hot encoding the bond type and using the bond length as a continuous feature. Model hyperparameters were optimized through five-fold stratified cross-validation, in which stratification by bond type ensured balanced representation of different chemical bonds across training (80%) and test (20%) subsets, thereby preventing bias toward more frequent bond classes. Three regression algorithms were employed: (i) ridge regression, evaluated using regularization parameters α (0.1 to 10), with and without intercept fitting and with optional positivity constraints; (ii) gradient boosting random forest (GBRF) regression, implemented using the LightGBM framework with GPU acceleration, where the number of leaves (11 to 51), learning rate (0.001 to 0.1), and maximum tree depth (−1 to 10) were optimized; and (iii) a feed-forward neural network (FF-NN), constructed in Keras with the PyTorch backend, consisting of two hidden layers with neuron counts (512 to 5120) and dropout rates (0.25 to 0.75). The FF-NN was trained using the Adam optimizer with mean squared error (MSE) as the loss function and early stopping (patience 20 epochs) to prevent overfitting. After cross-validation, the production model was trained with the best-performing hyperparameter configurations in the full dataset. Model performance was assessed using mean absolute error (MAE). All calculations were carried out in Python 3.10 using scikit-learn 1.7.2,52 LightGBM 4.6.0,53 and Keras 3.11.354 with the PyTorch 3.655 backend.
![]() | ||
| Fig. 1 Probability density distributions of equilibrium bond lengths (r) in the QM9 dataset computed at the B3LYP/6-31G(2df,p) level. | ||
The C–C bond-length probability density displays a broader, asymmetric profile with a dominant maximum near 1.53 Å, consistent with the canonical single-bond distance observed in alkanes, but extending toward shorter values, where the tail of the distribution captures partial π-bonding and conjugated motifs.57 The C–N and C–O bonds exhibit similar behaviour, with probability maxima at 1.46 Å and 1.41 Å, respectively, accompanied by secondary shoulders at shorter distances corresponding to amide, imine, and carbonyl substructures.58 In both cases, the continuous nature of the distributions suggests a modulation of the bond length as a function of local hybridization and electronic delocalization rather than discrete bond classes. Additionally, both the C–N and C–O distributions display noticeable broadening, which can be attributed to their occurrence not only in typical functional groups but also within heterocyclic ring systems,59 where geometric constraints and aromatic delocalization further modulate bond lengths. The C–F bonds form a narrow distribution centred at 1.34 Å, reflecting the rigidity of this strongly polarized covalent bond.60 In contrast, the N–O bonds exhibit a distinct bimodal profile, with maxima near 1.22 Å and 1.41 Å, corresponding to N–O double bonds, as found in nitroso and nitro groups,61 and N–O single bonds, as in nitrate and nitrite groups,62 respectively. Although the N–N bonds in the QM9 dataset display an apparently symmetric, unimodal bond-length distribution centred at approximately 1.34 Å, from a chemical perspective, one would formally expect a bimodal or even trimodal distribution reflecting the coexistence of single, double, and triple N–N bonds. Typical equilibrium bond lengths for these interactions are well separated: N–N single bonds, as found in hydrazine-like motifs, occur near 1.45 Å,63 N
N double bonds, characteristic of azo and diazene fragments, are observed around 1.25 Å,64 and N
N triple bonds, exemplified by dinitrogen, exhibit a much shorter distance of approximately 1.10 Å.65 The absence of a clearly resolved trimodal structure in the QM9 statistics can be attributed to the limited chemical diversity of nitrogen-rich species in the dataset, as well as to the dominance of substituted or conjugated environments in which formal bond orders are partially delocalized. In such cases, resonance and hyperconjugation smear discrete bond-order classes into a continuous distribution of intermediate bond lengths. Consequently, the observed unimodal profile centred near 1.34 Å reflects an averaging over multiple bonding regimes rather than the exclusive presence of a single bond type.
For each bond type, the shape and dispersion of the probability density serve as statistical measures of structural variability and chemical-context dependence within the QM9 molecular space. Narrow distributions correspond to strongly localized σ-bonds with a minimal dependence on the surrounding molecular framework (C–H, N–H, O–H, and C–F), whereas broader or multimodal distributions reflect flexible bonding environments, resonance effects, and hybridization diversity (C–C, C–O, C–N, N–O, and N–N). The close correspondence between the modal bond lengths from QM9 and the experiment across all bond types confirms the geometric accuracy calculated at the B3LYP/6-31G(2df,p) level.
The analysis of
provides complementary insight into the electronic structure of chemical bonds within the QM9 dataset (Fig. 2). Each histogram represents the probability density function of the integrated electron population associated with a given bonding basin, as obtained from the topological ELF partitioning. The resulting distributions exhibit distinct and chemically interpretable features for each bond type, reflecting differences in covalent character, polarity, and electronic delocalization. The C–H, O–H, and N–H bonds display narrow, unimodal distributions with probability maxima near 2.0 e−, consistent with two electrons localized in a strongly covalent σ-bond. The limited width of these peaks indicates a high degree of uniformity in electron sharing across the dataset, in agreement with the largely single-bonded character of hydrogen-containing groups.
![]() | ||
Fig. 2 Probability density distributions of ELF populations ( ) in the QM9 dataset computed at the B3LYP/6-31G(2df,p) level. | ||
The C–C bonds show a broader and slightly asymmetric probability distribution centred near 2.0 e−, with a tail extending toward higher values up to 5.5 e−, corresponding to π-delocalized systems and double or triple bond environments. This variation reflects the diversity of carbon–carbon bonding motifs in QM9, ranging from single σ-bonds in alkanes to conjugated and cumulene-like structures exhibiting increased electron localization between carbon centres. Similarly, the C–N and C–O bonds display broad, multimodal distributions spanning from 1.0 to 5.0 e−, consistent with the coexistence of single and double bonds in both C–N and C–O and triple bonds in C–N. The presence of a significant population density below 2.0 e− indicates partial electron transfer toward the more electronegative atom, particularly pronounced for C–O bonds.
The C–F bonds yield a narrow but systematically shifted peak at 1.14 e−, highlighting their strongly polarized nature and reduced shared-electron character due to the high electronegativity of fluorine. A similar trend is observed for N–O bonds, which exhibit maxima around 1.02 e−, consistent with the mixed covalent-ionic character typical of nitro and oxo functional groups. Previously, one of us investigated the nature of the N–O bond, using the topological analysis of the ELF.66 Such a bond, formally being a single bond, may exhibit a different topology of the ELF, which is characterized by a single disynaptic V(N,O) attractor, two monosynaptic V(N) and V(O) attractors, a single monosynaptic attractor V(N) and a lack of any valence attractors in the bonding region. The N–N bonds show a broader distribution extending from 1.0 to 3.0 e−, reflecting their variable bond order and tendency toward delocalization. Across all bond types, the shape and dispersion of ELF population distributions provide statistical measures of electron-pair localization and its modulation by bond polarity and hybridization. High-probability peaks near 2.0 e− correspond to well-localized σ-bonds with electron sharing, whereas broader or multimodal distributions indicate variable bonding multiplicity, conjugation, or polarization effects.
The combined analysis of
and its correlation with bond lengths (r) provides a quantitative description of the relationship between geometric and electronic structures across the QM9 dataset. The probability distributions of
reveal the expected localization of approximately two electrons for most covalent σ-bonds, while the
(r) correlation plots demonstrate how the electron population varies with internuclear distance (Fig. 3). For each bond type, a linear regression was fitted to the
(r) data, and the corresponding coefficient of determination (R2) was used as a measure of correlation strength. High R2 values indicate a strong and chemically meaningful dependence of ELF population on bond length, whereas low or vanishing R2 values identify bond types where electron localization is dominated by polarity or nonbonding contributions rather than by distance.
![]() | ||
Fig. 3 Correlation between ELF population ( ) and bond length (r) in the QM9 dataset computed at the B3LYP/6-31G(2df,p) level. The dashed horizontal line indicates = 0.75 e−. | ||
The strongest correlations were observed for C–C, C–N, C–O, N–N, and N–O bonds, all with R2 > 0.85, reflecting the expected inverse relationship between basin population and internuclear distance: shorter bonds correspond to higher electron populations within the bonding basin. In particular, C–O (R2 = 0.955) and N–N (R2 = 0.913) display nearly linear trends, consistent with smooth transitions between single, double, and triple bonding regimes in these systems. The C–C and C–N correlations (R2 = 0.86–0.90) similarly capture bond-order modulation arising from conjugation and hybridization effects. In contrast, bonds involving hydrogen: C–H, O–H, and N–H, exhibit narrow distributions of both r and
, resulting in weak or negligible correlations (R2 < 0.25). This behaviour reflects the nearly constant bond lengths and uniformly localized σ-bonds in hydrogen-containing fragments, where population fluctuations are minimal. The C–F bonds also display poor correlation (R2 = 0.026) due to their strong polarity, as most of the electron density is concentrated near the fluorine atom, effectively decoupling the ELF population from geometric variation.
The occurrence of low ELF basin populations (
< 0.75 e−) can be attributed to electronic delocalization or topological artifacts inherent in the ELF partitioning procedure (Fig. 4). In many instances, the electron density that would normally be localized between two atoms is instead partially delocalized over two or more attractors, leading to the appearance of several weakly localized basins rather than a single well-defined bonding domain. Consequently, integration of the electron density over each individual basin yields a population smaller than the canonical value, even though the cumulative density within the entire delocalized region remains consistent with a shared-electron interaction. In strongly polarized bonds, such as C–N and N–O, the bonding electron density is displaced toward the more electronegative atom and partially incorporated into its adjacent lone-pair basin, resulting in a residual interatomic basin with a low population. A representative case in the QM9 dataset is entry 066597 (Fig. 4A and C), where the electron density is delocalized around the nitrogen atom across two C–N bonds and one C–C bond. The C–C bond is described by two disynaptic basins, V(C,C), with populations of 2.30 and 0.41 e−, respectively, while the two C–N bonds correspond to V(C,N) basins with populations of 3.98 and 2.99 e−. A similar effect is observed for QM9 entry 131625 (Fig. 4B and D), where multiple monosynaptic basins, V(N) and V(O), with populations from 0.23 e− to 4.56 e−, collectively describe the lone pairs on nitrogen and oxygen atoms, as well as the associated N–O bond. Thus, comparable effects are observed in geometrically constrained or highly strained systems, as well as in multicentre arrangements, where the ELF field exhibits diffuse or shallow maxima, reflecting numerical or topological segmentation rather than distinct localized electron pairs. Collectively, these observations indicate that basins with low populations arise in situations where electron localization is delocalized or not adequately captured by the synaptic assignment.
This correlation analysis serves as a filtering criterion for identifying chemically meaningful bonding basins for subsequent ML modelling. Bonds exhibiting high R2 values correspond to shared-electron interactions that show well-defined relationships between bond length and electron population, whereas those with low R2 values represent either highly polarized, ionic, or topologically ambiguous cases in which ELF attractors do not correspond to conventional two-centre bonds. The continuous and monotonic trends observed for C–C, C–N, and C–O systems further demonstrate that ELF populations encode a measure of bond order consistent with classical chemical intuition. Taken together, the geometric, electronic, and correlative analyses establish a statistical framework, linking molecular geometry, electron localization, and bond order across the QM9 dataset. These results provide the conditions for constructing data-driven models of bonding suitable for ML applications.
Three representative regression models were selected to validate different levels of model complexity, interpretability, and performance. Ridge regression was included as a baseline to quantify the extent to which ELF populations can be described through simple linear relationships between geometric descriptors and electron localization. The GBRF model, based on an ensemble of decision trees, was chosen to capture nonlinear and nonadditive dependencies between features while retaining interpretability. Finally, the FF-NN model was employed as a flexible, high-capacity nonlinear model capable of learning complex, continuous mappings between molecular geometry and ELF populations. Together, these three methods provide a balanced comparison across linear, ensemble-based, and neural network paradigms, enabling systematic assessment of both the complexity of
prediction and the degree of nonlinearity required for accurate modelling. All models were trained and validated on the QM9 dataset using five-fold cross-validation, and performance was evaluated using the MAE metric.
In the first stage, models were trained using only the bond descriptors, bond identity and bond length, without incorporating any explicit information about the bond environment. Hyperparameter optimization was carried out via grid search. For ridge regression, the optimal configuration corresponded to a regularization strength of α = 10, with intercept fitting disabled and no positivity constraint. For the GBRF regression model, the lowest MAE was obtained for num_leaves = 51, learning_rate = 0.1, and max_depth = −1. For the FF-NN model, the optimal setup used 5120 neurons per hidden layer, a dropout rate of 0.25, and a learning rate of 0.001. Among these models (Fig. 5a), the ridge regression provided a baseline performance with an MAE of 0.19 ± 0.001 e−, reflecting its limited ability to capture nonlinear relationships between geometric and electronic features. The FF-NN achieved an MAE of 0.18 ± 0.001 e−; however, the difference relative to ridge regression does not support a meaningful improvement in predictive accuracy. The best performance was obtained with the GBRF model, which reached an MAE of 0.08 ± 0.001 e−, demonstrating its capacity to model complex, non-additive dependencies between geometric descriptors and ELF populations.
In the second stage, the same regression framework was extended by including local chemical environment descriptors for each bond, representing the connectivity and elemental composition of neighbouring atoms. The GBRF model was optimized independently for this extended feature set, and the resulting best-performing parameters are num_leaves = 51, learning_rate = 0.1, and max_depth = −1. The incorporation of environmental information led to a marked improvement in predictive accuracy (Fig. 5a): whereas the GBRF model without environmental descriptors yielded an MAE of 0.08 ± 0.001 e−, inclusion of environmental features reduced the error to 0.04 ± 0.001 e−. This improvement highlights the role of the local bonding context in determining electron localization, as neighbouring atoms influence ELF-derived bond populations. Based on these findings, the GBRF regression model with environment descriptors was selected as the production model for further analysis.
To further evaluate the data efficiency and scalability of the trained models, learning curves were analysed as a function of training set size (Fig. S1 and S2, SI). These results explain how predictive performance evolves with increasing data availability. The GBRF model exhibits high data efficiency, achieving near-optimal performance even at small training fractions, with only marginal improvements upon increasing dataset size. This trend indicates that the model is capable of extracting the dominant structure–property relationships from relatively limited data. In contrast, the FF-NN model shows an improvement in accuracy by increasing the training set size up to approximately 50% of the dataset, beyond which performance saturates, suggesting that model capacity becomes the limiting factor rather than data availability. The ridge regression model displays limited scalability, with increasing error as the training set size grows, reflecting its inability to capture nonlinear dependencies in increasingly diverse chemical environments. Notably, the inclusion of environmental descriptors in the GBRF model leads to a consistent improvement in performance with increasing dataset size, highlighting the importance of local chemical context and its dependence on sufficient sampling of chemical space.
To gain deeper insight into the chemical interpretability of the trained model, detailed feature importance and residual analyses were conducted (Fig. 5b–d). The feature importance analysis revealed that the bond length constitutes the single most influential descriptor, contributing approximately 37% of the total predictive variance (Fig. 5b). The remaining 63% arises from descriptors representing the elemental identities of the bonded atoms (1%) and the chemical environment (62%), which reflects modulation of electron density near the chemical bond. While the bond length is the most influential single descriptor, the local chemical environment plays a dominant role in determining electron localization within a bond. In effect, accurate prediction requires knowing not only how long the bond is but also where it exists within the molecular framework. Complementing these findings, the residual analysis (Fig. 5c) showed an approximately Gaussian error distribution centred near zero, with minimal systematic bias across bond classes, indicating no major systematic deficiencies, with remaining deviations likely arising from chemically complex bonding cases. To place these errors in the context, the distributions of target ELF populations were analysed (Fig. 5d). The true ELF populations span a broad range, from values near 0.5 e− for highly polarized bonds to more than 5.5 e− for the most electron-rich bonds. The predicted distributions closely follow the true distributions across all bond types, with good agreement in both medians and overall shapes, and only minor deviations are observed in the tails. In particular, the model reproduces the lower ELF populations observed in polarized bonds such as N–O and C–O, as well as the higher populations associated with more covalent bonds such as C–C and C–N, while also reflecting the substantial intra-type variability evident from the broad distributions. Taken together with MAE ≈ 0.04 e−, these results confirm that the model performs well relative to the variability of the dataset and that the predictions remain reliable across the range of chemical bonds present in the QM9 dataset.
(1) The exploratory data study of topological analysis of the ELF revealed consistency between bond lengths and populations across all bond types in the QM9 dataset. In addition, the
(r) correlation displays a uniform inverse trend, and shorter bonds carry higher basin populations, with strong linear correlations for C–C, C–N, C–O, N–N and N–O (R2 > 0.85) and weak correlations for C–H, O–H, N–H, and C–F bonds, where limited geometric variability dominates.
(2) The regression analyses showed that ELF-derived bond populations can be predicted from the geometric structure alone, with nonlinear ML algorithms effectively capturing the dependencies between the bond length, bond type, and local environment. Among the tested models, GBRF achieved the best performance, reaching the lowest MAE when environmental features were included. Its balance between predictive accuracy and interpretability highlights GBRF as a transparent model for large-scale chemical space exploration.
(3) Beyond methodological considerations, the present work establishes the interplay between molecular geometry and the real-space bonding topology, which can be formalized through data-driven modelling. Although the current analysis is focused only on the QM9 dataset, the proposed framework is readily generalizable to other curated molecular databases, thereby enabling systematic investigations of chemical bonding across broader regions of chemical space. Also, the integration of geometrical descriptors within ML workflows facilitates the prediction and classification of chemical bonds across extensive molecular datasets, by reducing the simulation cost needed for computationally demanding conventional wavefunction-based analyses.
All scripts used in this work are publicly available at https://github.com/Parecido/ELF-AI. The repository contains Python scripts for dataset preparation, cross-validation, hyperparameter optimization, and training of the production model. Example wavefunction (.wfn) and topology (.top) files are also provided to illustrate the data format and workflow. The raw data supporting the conclusions will be made available by the authors on request.
| This journal is © the Owner Societies 2026 |