Open Access Article
Sascha
Thinius
Fraunhofer Institute for Manufacturing Technology and Advanced Materials, IFAM, Wiener Strasse 12, 28359 Bremen, Germany. E-mail: sascha.thinius@ifam.fraunhofer.de; sascha.thinius.87@gmail.com
First published on 17th November 2025
This research investigates predicting the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital (HOMO–LUMO; short HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and materials science. To address the high computational cost of traditional methods, this study develops a high-throughput, machine learning (ML)-based approach. Using 407
000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing (HPC) Slurm cluster, utilized Geometry – Frequency – Noncovalent – eXtended Tight Binding (GFN2-xTB) for electronic structure calculations with Boltzmann weighting across multiple conformational states. Three ensemble methods, namely Gradient Boosting Regression (GBR), eXtreme Gradient Boosting Regression (XGBR), Random Forrest Regression (RFR) and a Multi-layer Perceptron Regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in all models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.
While machine learning models have shown promise in predicting molecular properties, their application to HL-gap prediction in large and diverse datasets of natural products remains unexplored. Furthermore, an accurate prediction of the HL-gap for natural products is particularly challenging due to their structural complexity and diversity.
Several studies have demonstrated the potential of ML models to accurately and efficiently estimate various molecular characteristics, including electronic properties crucial for understanding chemical behavior.1–10 The application of ML in this domain has ranged from traditional models using curated feature sets1,2,6,9,11–13 to sophisticated deep learning architectures that learn directly from molecular structures.1,2,14–16 Early and contemporary studies have successfully used models like Random Forests,1,8,9,11,17 Support Vector Machines,1,12 and Gradient Boosting Regressors4,17–19 with pre-calculated molecular descriptors and fingerprints9,12,13,17 to achieve strong predictive performance on diverse datasets.2,7,9,12,17,18,20,21 Those linear models have shown success in specific chemical spaces like polycyclic aromatic hydrocarbons.22,23 While powerful, these methods' performance is intrinsically tied to the quality and relevance of the features. For instance, Pereira et al.9 explored random forest models for predicting HOMO and LUMO energies, achieving good accuracy with molecular descriptors combined with semi-empirical orbital energies. Schmidt et al.24 explored various ML algorithms, including linear and kernel-based regression, decision trees, and neural networks, for predicting properties like crystal structure and thermal conductivity, emphasizing the trade-off between accelerated research and the challenges of interpretability and data quality.
Reiser et al.25 reviewed the application of graph neural networks (GNNs) in materials science. The field has seen a significant shift towards deep learning, particularly with the advent of GNNs that can leverage complete atomic-level representations. These end-to-end models learn relevant features directly from the molecular graph, mitigating the need for manual feature engineering.1,11,14–16,19,20,26 Seminal works on the QM9 benchmark dataset established the high performance of these following methods:
■ Schrödinger Convolutional Neural Network (SchNet),16 which operates on atomic types and Cartesian coordinates and has been successfully applied not only to QM9 (ref. 16) but also to complex systems like oligothiophenes,1,15 with SchNet achieving the best performance among other GNNs, particularly for larger molecules.
■ Message Passing Neural Networks (MPNNs),27 a general framework for learning on graphs, with variants like deep (D)MPNN8 also showing excellent performance.
■ MatErials Graph Network (MEGNet),2,26 a universal GNN framework for predicting properties of both molecules and crystals, incorporating global state variables and demonstrating transfer learning capabilities.
■ Other advanced architectures like Deep Tensor Neural Networks (DTNNs)11,15,21,28 have also proven effective for predicting electronic properties and designing novel molecules.
■ Generative models as the Recurrent Neural Network (RNN) with transfer learning specifically employed by Yuan et al.19 on electronic properties, to generate novel oligomers with targeted HL-gaps, demonstrating the potential of deep generative models but also the inherent trade-off between chemical space exploration and property optimization.
■ Finally, Montavon et al.6 introduced a deep multi-task neural network for predicting multiple electronic properties.
A key challenge in applying these data-hungry models is the scarcity of high-quality data for specific or complex chemical systems. To address this, transfer learning has emerged as a powerful strategy. By pre-training a model on a large, general dataset (e.g., PubChemQC) and then fine-tuning it on a smaller, specific dataset, researchers have successfully predicted properties for conjugated oligomers,3 porphyrins8 and organic photovoltaics.4,29
While these studies highlight the remarkable potential of ML for predicting electronic properties, challenges remain in addressing data requirements, interpretability, and the accurate prediction of properties across vast and highly diverse chemical spaces, such as the natural products domain beyond the limited complexity of the QM9 dataset. This work aims to address this latter challenge by developing a high-throughput workflow and robust ML model for predicting the HOMO–LUMO gaps of over 400
000 natural products. This study aims to not only develop predictive models but also to gain insights into the key molecular features that influence the HL-gap, contributing to a deeper understanding of structure–property relationships.
The Common Workflow Language31,32 (CWL) is a highly flexible language widely used in the field of bioinformatics to create computational workflows in contrast to others33–35 utilized in the field of computational chemistry. The only prerequisite for workflow integration is that the computational task must be executable on the command line. To ensure consistency and package isolation, the software packages were installed using python package managers into a virtual environment with a python–click interface. Specifically, the following essential packages were installed in this way: RDKit,36 pandas,37 Atomic Simulation Environment (ASE)38 and xTB.39
In addition, a modest effort was required to integrate functions for the conformer generation, the Boltzmann weighting, the xTB-wrapper, I/O handling as well as the click interface into the virtual environment as a python package on https://github.com/sthinius87/HL-gaps-pub. The CWL-Input files are written YAML-format. All code developed is published at Zenodo (https://zenodo.org/records/15113790) via GitHub (https://github.com/sthinius87/HL-gaps-pub).40
The workflow's core functionality is encapsulated within a virtual environment. This environment houses Python code that, triggered via a click interface, initiates a series of computational tasks to finally evaluate the molecule's HL-gap. These tasks involve: employing the RDKit cheminformatics toolkit, the workflow generates diverse molecular conformations, exploring the potential spatial arrangements of atoms within a molecule. To account for conformational flexibility, a set of 10 conformers was chosen to balance the need for adequate conformational space sampling with the computational cost inherent in a high-throughput study of this scale. This approach, combined with Boltzmann weighting, provides a thermodynamically averaged property that is often more representative than relying on a single lowest-energy conformer, whose ranking might be inaccurate or which may not be the sole contributor to the molecule's properties at room temperature. Following the initial generation of conformers with RDKit, each conformation was subjected to a geometry optimization to find its nearest local energy minimum. This optimization was performed at the GFN2-xTB level of theory, employing the Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimizer. After optimization, the electronic properties for each of these now stable conformations were computed at the same GFN2-xTB level through self-consistent charges (SCC) to determine the HOMO–LUMO gap. More advanced methods, like DFT, would also be conceivable, but would go beyond the limit of our computational resources. The accuracy of the calculated HL-gaps with GFN2-xTB method against higher-level theoretical benchmarks (e.g., DFT) or experimental data was not assessed in this study. Finally, a Boltzmann weighting scheme is applied to assess the relative stability and population of each conformation at a given temperature. The parameters for each task are transferred to the code via the click interface, which are as follows:
■ Number of conformers (RDKit)
■ Accuracy (xTB: SCC convergence criteria)
■ Electronic temperature (xTB: fermi smearing)
■ Calculation method (xTB: current code flavors).
This workflow finally was applied to ∼407k SMILES strings, resulting in ∼406k results after curation of the dataset.
Four regression algorithms were selected for this study: three powerful tree-based ensembles—GBR, RFR, and XGB—and a MLPR. This selection allows for a robust comparison between two distinct and widely used classes of machine learning algorithms: tree-based ensembles and artificial neural networks (ANN). The literature confirms that both algorithmic classes are frequently employed and serve as strong baselines for predicting molecular properties. For example, GBR has been successfully applied, and found to be the best-performing model, for predicting properties of non-fullerene acceptors.18 Similarly, MLPR and other neural network architectures are a common choice for modeling electronic properties in large molecular datasets, from early deep learning6,7,9,11 demonstrations to more recent ANN studies.18 This choice allows for a valuable comparison between these two established algorithmic approaches on a large-scale natural product dataset using pre-calculated molecular descriptors.
For the hyperparameter optimization a randomized search with 2000 iterations and cross-validation with a fold of 3 has been applied. Multiple parameters were involved in the randomized search. For the GBR the most critical parameters are the learning rate (learning_rate), the number of boosting stages (n_estimators), the fraction of samples to be used for fitting the individual base learners (subsample) and the number of nodes in the tree (max_depth) whereas for the MLPR the number neurons and layers (hidden_layer_sizes), the L2-regularization term (alpha) and the exponential decay rate for estimates of first moment vector in the Adam41 solver (beta_1) parameters were considered in the optimization. For the RFR, the most critical parameters are n_estimators, max_depth, the minimum samples required to split a node (min_samples_split), and the number of features to consider for a split (max_features), whereas for the XGB model, the learning_rate, n_estimators, max_depth, and the L1 (reg_alpha) and L2 (reg_lambda) regularization terms were considered in the optimization. The optimized set of parameters can be found in the SI. Further the train-test split was set to a ratio of 0.7 to 0.3, respectively. For initial transformation of the data the StandardScaler was applied.
To ensure the random data partitioning was representative, the statistical distributions of key molecular descriptors and the target property were analyzed across multiple splits and found to be virtually indistinguishable between the training and test sets (see SI, Fig. S3). This confirms the absence of systematic bias in the data split. However, this high-level statistical similarity masks the underlying structural novelty of the test set, which serves as the true measure of the model's generalization ability. The test set is composed of over 120
000 unique molecular structures that the model has not encountered during training. The critical test is whether the model can generalize beyond the specific examples it has seen to accurately predict properties for these new chemical entities. Additionally, the analysis of the MLPR model's performance on distinct structural subgroups, presented later in Section 2.4.3, provides strong evidence for this robust generalization. The model maintains high predictive accuracy across various challenging structural elements, including different numbers of aromatic rings and complex functional groups. This demonstrates that the model is not merely interpolating based on overall statistical similarity but has learned the fundamental relationships between molecular structure and the HOMO–LUMO gap.
When evaluating the GBR and the MLPR model, the metrics of both models were calculated using a 10-fold shuffle-split cross-validation strategy with a 0.7 to 0.3 train-test ratio as shown in Table 1. Comparing the models reveals a nuanced picture. A fair comparison requires establishing a baseline of predictive accuracy. All four models demonstrate strong absolute performance on the unseen test data, achieving R2 scores above 0.94 and MAE values below 0.21 eV. This confirms their validity as powerful predictive tools for this chemical space. Notably, the XGB model emerged as the top-performing model in terms of absolute accuracy, yielding the highest test R2 (0.958) and the lowest MAE (0.180 eV). The model's robustness was evaluated by analyzing the generalization gap—the difference in performance between the training and test sets. A comparison across all four models reveals significant differences, as detailed in Table 1. While all three tree-based ensemble models show a large performance drop from the training data to the test data, the Random Forest model shows the most substantial gap in the R2 score (ΔR2 ≈ 0.048). In contrast, the MLPR model displays the greatest generalization stability, with the smallest performance gap in its R2 score (ΔR2 ≈ 0.022). This conclusion is strongly corroborated when analyzing the absolute error metrics. The tree-based models exhibit a pronounced increase in error on the test set. For instance, the XGB model's MAE increases by over 150% (from 0.069 eV to 0.180 eV), and the Random Forest model's MAE shows a more than six-fold increase (from 0.030 eV to 0.194 eV). The MLPR model, however, shows a much smaller and more controlled relative increase in its MAE of only 24% (from 0.169 eV to 0.210 eV). A similar trend is observed for the MSE and RMSE. This expanded analysis reveals a clear trade-off. For applications where achieving the lowest possible prediction error is the sole priority, the XGB model is the superior choice based on its test set MAE. However, for the goal of this study—developing a reliable and robustly generalizable model—the MLPR's demonstrated stability across multiple metrics makes it the most suitable candidate for the subsequent in-depth feature importance and error analyses.
| Metrics | R2-score | MSE [eV2] | MAE [eV] | RMSE [eV] |
|---|---|---|---|---|
| MLPR-train | 0.9688 ± 0.0009 | 0.0519 ± 0.0017 | 0.1686 ± 0.0025 | 0.2279 ± 0.0037 |
| MLPR-test | 0.9470 ± 0.0008 | 0.0886 ± 0.0010 | 0.2099 ± 0.0012 | 0.2976 ± 0.0016 |
| GBR-train | 0.9917 ± 0.0001 | 0.0138 ± 0.0001 | 0.0905 ± 0.0002 | 0.1173 ± 0.0003 |
| GBR-test | 0.9562 ± 0.0006 | 0.0732 ± 0.0007 | 0.1865 ± 0.0005 | 0.2706 ± 0.0013 |
| XGB-train | 0.9943 ± 0.0001 | 0.0094 ± 0.0001 | 0.0694 ± 0.0003 | 0.0972 ± 0.0004 |
| XGB-test | 0.9580 ± 0.0005 | 0.0702 ± 0.0005 | 0.1799 ± 0.0005 | 0.2650 ± 0.0010 |
| RFR-train | 0.9989 ± 0.0001 | 0.0019 ± 0.0001 | 0.0295 ± 0.0001 | 0.0432 ± 0.0001 |
| RFR-test | 0.9505 ± 0.0006 | 0.0828 ± 0.0005 | 0.1940 ± 0.0006 | 0.2878 ± 0.0009 |
The permutation importance of the MLPR model, shown in Fig. 2, reveals that a combination of structural, polarizability, and electronic descriptors governs the HL-gap prediction. By a significant margin, the most influential feature is NumAromaticCarbocycles, indicating that the presence and number of aromatic rings is a primary determinant. This is followed by descriptors related to molecular polarizability42 (SMR_VSA7 and SMR_VSA10) and shape (HallKierAlpha43). MinAbsEStateIndex and MaxAbsEStateIndex, and the presence of specific fragments44 like amides (fr_NH0), ketones (fr_ketone), and esters (fr_ester) also play a significant role. A clear consensus emerges across all models: molecular polarizability is a fundamental driver of the HL-gap (see Table 2). The SMR_VSA7 and SMR_VSA10 descriptors appear in the top features for every model. Similarly, molecular shape (HallKierAlpha) and the presence of specific functional groups like ketones (fr_ketone) are consistently identified as significant contributors. This agreement between methodologically distinct models provides strong confidence that these features have a true physical relationship with the HOMO–LUMO gap.
| Rank | MLPR (permutation) | GBR (Gini) | RFR (Gini) | XGB (Gini) |
|---|---|---|---|---|
| 1 | NumAromaticCarbocycles | SMR_VSA7 | SMR_VSA7 | SMR_VSA7 |
| 2 | SMR_VSA7 | SMR_VSA10 | HallKierAlpha | NumRadicalElectrons |
| 3 | SMR_VSA10 | fr_ketone | SMR_VSA10 | fr_ketone |
| 4 | HallKierAlpha | SlogP_VSA8 (ref. 42 and 45) | MinAbsEStateIndex | SMR_VSA10 |
| 5 | MinAbsEStateIndex | MinAbsEStateIndex | MaxAbsEStateIndex | SlogP_VSA12 (ref. 42 and 45) |
However, the models exhibit highly divergent strategies in how they weigh these features. The GBR model shows an extreme reliance on its top two polarizability descriptors, which together account for over 67% of its total feature importance. In contrast, the Random Forest model displays a more balanced approach, giving high importance to both polarizability and molecular shape. The top-performing XGBoost model reveals another unique strategy, identifying NumRadicalElectrons as its second most important feature—a descriptor not ranked highly by any other model. This suggests XGBoost successfully leveraged a feature that is critical for a specific, yet impactful, subset of molecules. In conclusion, this comparative analysis highlights that while all models correctly identify polarizability and shape as critical, their varied performance stems from different learning strategies. The tree models, particularly GBR, fixated on the strongest individual signals, while the MLPR learned a more holistic representation by balancing structural counts with electronic properties, which is key to its superior generalization stability.
Nevertheless, the model inherits weaknesses due to heteroscedasticity and potential for improvement in the lower gap region. The residual plot shows some evidence of heteroscedasticity, particularly at lower predicted values. This means that the variance of the errors is not constant across the range of predictions. The model tends to have larger errors for molecules with smaller HL-gaps. The points are more scattered in this region, indicating lower accuracy. To prove this, the metrics were re-evaluated for the HL-gap range below and above 6 eV, as it is possible the heteroscedasticity arises from the data distribution itself with 98% of the molecules having HL-gaps <6 eV. Even if the underlying error distribution is homoscedastic (constant variance), the sheer number of points in a dense region makes it more likely to observe larger errors. Table 3 clearly proves that the absolute errors (MSE, MAE, RMSE) are larger for the ≥6 eV subset. This directly contradicts the initial interpretation of the heatmap where we observed higher precision at higher HL-gaps.
| Range | Count | R 2 | MSE [eV2] | MAE [eV] | RMSE [eV] |
|---|---|---|---|---|---|
| <6 eV | 398 003 |
0.917 | 0.064 | 0.183 | 0.254 |
| ≥6 eV | 8200 | 0.934 | 0.135 | 0.257 | 0.368 |
| All | 406 203 |
0.947 | 0.089 | 0.210 | 0.298 |
This reinforces the point that was discussed above: the apparent higher precision at higher gaps in the heatmap was likely an artifact of the lower data density in that region. Even though the model makes larger absolute errors for higher gaps, there are fewer data points to show this spread, creating the illusion of tighter clustering around the diagonal. The higher R2 for the ≥6 eV subset is misleading because R2 is sensitive to the variance of the target variable. Since the ≥6 eV subset likely has a larger variance, it can lead to a higher R2 even with larger absolute errors. This illustrates how R2 does not directly reflect the accuracy of predictions but rather their relative performance in capturing the variance of the data. This analysis highlighted the importance of considering multiple metrics, particularly when interpreting visualizations like heatmaps and emphasized the need for caution when dealing with unbalanced data, where smaller data clusters can disproportionately influence visual trends.
| Subset | R 2 | MSE [eV2] | MAE [eV] | RMSE [eV] |
|---|---|---|---|---|
| NumRadicalElectrons | 0.506 | 0.319 | 0.434 | 0.565 |
| NumAliphaticHeterocycles | 0.954 | 0.068 | 0.190 | 0.262 |
| NumAromaticCarbocycles | 0.896 | 0.054 | 0.170 | 0.231 |
| NumAromaticHeterocycles | 0.868 | 0.051 | 0.166 | 0.227 |
| fr_Al_COO | 0.888 | 0.070 | 0.192 | 0.265 |
| fr_Al_OH_noTert | 0.852 | 0.101 | 0.229 | 0.318 |
| fr_Ar_N | 0.860 | 0.060 | 0.180 | 0.245 |
| fr_NH0 | 0.914 | 0.072 | 0.194 | 0.268 |
| fr_NH2 | 0.936 | 0.104 | 0.226 | 0.322 |
| fr_allylic_oxid | 0.934 | 0.068 | 0.189 | 0.260 |
| fr_aniline | 0.875 | 0.059 | 0.177 | 0.243 |
| fr_aryl_methyl | 0.868 | 0.046 | 0.156 | 0.214 |
| fr_bicyclic | 0.955 | 0.060 | 0.177 | 0.244 |
| fr_ester | 0.914 | 0.063 | 0.185 | 0.251 |
| fr_ketone | 0.885 | 0.063 | 0.185 | 0.251 |
| fr_para_hydroxylation | 0.871 | 0.057 | 0.175 | 0.238 |
| fr_piperdine | 0.941 | 0.089 | 0.213 | 0.298 |
| Whole set | 0.947 | 0.089 | 0.210 | 0.298 |
All subsets perform worse than the full dataset, which is expected. The full model is trained on all molecules and learns to capture the combined effects of all features. Subsets, by focusing on a single feature, lose this comprehensive perspective. Some subsets perform surprisingly well. This indicates that those specific features are strong indicators of the HL-gap for molecules possessing them. NumRadicalElectrons is a clear outlier. It's very low R2 (0.506) and high error metrics indicate it's not a good predictor of the HL-gap on its own. This is also expected as it is a very specific property not generally related to the HL-gap. The NumAliphaticHeterocycles and fr_bicyclic subsets show R2 values very close to the full dataset (0.954 and 0.955 respectively). This suggests that the presence of aliphatic heterocycles or bicyclic structures is strongly correlated with the HL-gap, and the model captures this well. fr_NH2, fr_allylic_oxid and fr_piperdine subsets also perform relatively well (R2 > 0.93), indicating that these functional groups also have a significant influence on the HL-gap. The subsets fr_NH and fr_esters have fair R2 values (0.914), but the MSE, MAE, and RMSE are somewhat higher than the full dataset, suggesting that while the general trend is captured, the predictions are less precise. NumAromaticCarbocycles, fr_Al_COO and fr_ketone subsets show moderate performance (R2 around 0.88–0.90). This indicates that while these features do influence the HL-gap, their effect is less pronounced or more complex compared to the features in the higher-performing subsets. The subsets NumAromaticHeterocycles, fr_Al_OH_noTert, fr_Ar_N, fr_para_hydroxylation, fr_aniline and fr_aryl_methyl, have the lowest R2 values among the fragment counts (around 0.85–0.87). This suggests that these features have a weaker or more intricate relationship with the HL-gap, or that their effect is more context-dependent, meaning influenced by other parts of the molecule.
The heatmap (Fig. 4) displays the HL-gap error distribution mapped into ranges, providing a deeper insight into where and why larger prediction errors occur. Based on the suggestion that errors greater than 0.4 eV are likely unusable for rigorous scientific work, range quality assignments might be defined as follows.
■ Excellent precision 0.0–0.1 – errors in this range are exceptionally small and likely inconsequential for most rigorous scientific applications.
■ High precision (0.1–0.2) – errors in this range are still quite small and should be suitable for most scientific studies.
■ Acceptable precision (0.2–0.4) – errors in this range might introduce some uncertainty but are likely tolerable for many scientific investigations, particularly in complex systems and high throughput screening applications.
■Marginal precision (0.4–0.8) – errors in this range are becoming substantial and may limit the reliability of conclusions drawn from the data. Careful consideration and potentially additional validation are necessary.
■ Low precision (0.8–1.2) – errors in this range are likely to compromise the accuracy of scientific results. This range is likely unsuitable for quantitative applications.
■ Poor precision (1.2–2.0) – errors in this range are likely to lead to unreliable or misleading results. Significant improvements in model accuracy are needed for this range to be useful.
■ Negligible precision (2.0–10.0) – errors in this range are so large that the data is essentially unusable for any scientific purpose.
For most molecular groups, the majority of molecules fall within the “Excellent Precision” and “High Precision” ranges. This suggests that the model performs reasonably well overall. A noticeable variation in the distribution of errors is observed across different molecular groups. Some groups have a higher proportion of molecules in the “Acceptable Precision” range and beyond, indicating potential challenges for specific chemical functionalities. The fr_Al_COO, fr_NH2 and fr_Al_OH_noTert groups appear to have a relatively higher proportion of molecules in the “Marginal Precision” range and beyond, signifying that predictions for molecules containing these groups might be less reliable. Analysis of the HL-gap prediction model revealed a notable trend. Molecular groups with smaller representation in the dataset tended to exhibit poorer predictive performance. This observation is not coincidental but rather reflects the influence of sample size on model accuracy and robustness. Smaller molecular groups suffer from reduced statistical power, limiting the model's ability to discern true relationships between specific chemical features and HL-gap values. This limitation arises from the increased susceptibility of smaller groups to noise, random variations, and the disproportionate impact of outliers. The poorer performance observed for smaller groups does not necessarily indicate a weaker or more complex relationship between their chemical features and HL-gap. Instead, it may reflect the model's inability to effectively capture these relationships due to data scarcity.
The subsets consistently performing the best, with combined percentages up to “Acceptable Precision” above 90%. With 94.55% fr_aryl_methyl is the best performing subgroup overall followed by NumAromaticCarbocycles (93.08%), implying excellent overall performance. Subsequently, NumAromaticHeterocycles (91.31%), fr_aniline (91.08%), fr_bicyclic (91.12%) and fr_para_hydroxylation (90.47%) demonstrate strong predictive capabilities. Subsets with combined percentages in the high 80 s, indicating good but slightly less precise predictions. Those include NumAliphaticHeterocycles (88.88%), fr_Ar_N (89.32%), fr_NH0 (88.88%), fr_allylic_oxid (88.84%), fr_ester (89.61%), fr_ketone (89.21%) and fr_piperdine (87.46%). While still a reasonable performance, with 84.69% fr_Al_COO it is noticeably lower than the top performers. The subgroup fr_NH2 (78.38%) shows a lower combined percentage compared to most other groups, suggesting potential challenges in accurate prediction. The fr_Al_OH_noTert (74.66%) group stands out as having the lowest combined percentage, indicating that the model might struggle with this specific functional group. However, the majority of molecules that is in the range of low and poor precision refers to molecules with a complicated electronic structure, like having ionic or radical character or having multiple functional groups, both donors and acceptors or multiple heteroatoms up 3rd and 4th period non-metals or metalloids. Example images of molecules can be found in the git repository.
The performance of our model is highly consistent with other studies that have utilized similar descriptor-based machine learning approaches on large-scale molecular datasets.7,9,12,13,19,46 Notably, Pereira et al.9 reported a very similar MAE of 0.21 eV and RMSE of 0.30 eV using Random Forest and MLPR models on over 111
000 organic molecules with DFT-calculated properties. Our MAE is also comparable to the 0.19 eV achieved by Xu et al.22 using a linear model on Polycyclic Aromatic Hydrocarbons (PAHs). Furthermore, our RMSE is more favorable than the 0.36–0.43 eV range reported by Nakata et al.12 using SVM and Ridge Regression on a subset of the PubChemQC12 database. These comparable error metrics suggest that our model achieves a robust and expected level of performance for its methodological class.
The current state-of-the-art in this field, however, is dominated by deep learning models, particularly GNNs, that learn features directly from molecular topology and 3D coordinates. These models consistently achieve significantly lower prediction errors. Seminal works on the QM9 benchmark dataset established the high performance of these methods, with models such as SchNet,1,15,16 MPNNs,8,14,15 MEGNet,2,26 and other Graph Convolutional Networks reporting MAEs for the HOMO–LUMO gap in the remarkably low range of 0.06–0.09 eV.1,2,4,8,14,15 This superior accuracy has been replicated by advanced architectures like PaiNN, which reported an exceptional MAE of just 0.01 eV on the Harvard organic photovoltaic dataset47 (HOPV) dataset.
The primary factors driving this performance disparity are the feature representation and the dataset characteristics. Our work employs traditional ML models that rely on pre-calculated 2D molecular descriptors. This methodological choice is shared by several studies reporting similar error magnitudes.7,9,12 In contrast, the highest-performing models are overwhelmingly GNNs that learn richer, tailored feature representations directly from the 3D molecular graph, using atomic types and Cartesian coordinates as inputs. This end-to-end learning allows the model to capture more nuanced and relevant structural information than is possible with predefined 2D descriptors.
Furthermore, our study tackles the COCONUT database, a large-scale collection of over 400
000 structurally diverse and complex natural products. This presents a significant learning challenge compared to the benchmark QM9 dataset, which consists of ∼134
000 smaller, less complex organic molecules and is the basis for many of the lowest reported errors.2,4,5,7,14–16,20 Many other high-performance models are trained on smaller, chemically homogeneous datasets focused on specific molecular classes.1,11,17–19 The structural complexity and diversity inherent in our natural product dataset likely establish a higher error floor. The challenges of complex datasets are highlighted by Deng et al.,3 where a GNN approach on conjugated oligomers still resulted in a high MAE of 0.54 eV.
000 molecules). This dataset serves as a true “out-of-distribution” test set, as its molecules were not part of the training data. The model's predictions were then compared against two distinct, higher-precision quantum chemical benchmarks provided by the QM9 dataset. These benchmarks are: first, the DFT55 gaps, calculated at the B3LYP/6-31G(2df,p) level of theory, and second, the higher-accuracy Quasi-Particle (QP) GW56 gaps, which are considered a more rigorous “gold standard” for electronic gaps. The results of this external validation are presented in Fig. 5.
The comparison to the DFT reference (Fig. 5, left) demonstrates the model's successful generalization. The model's predictions show a strong correlation with the DFT values and, crucially, reproduce the distinct two-cluster structure of the data, which separates saturated (high-gap) from conjugated (low-gap) systems. This confirms that the model, using only 2D descriptors, has learned the fundamental structure–property relationships governing the HL-gap. As the MLPR model was trained on xTB data, its predictions carry the known systematic bias of that method, resulting in a mean underestimation of the DFT gaps by 3.94 eV.
The comparison to the higher-level GW reference (Fig. 5, right) provides a more complete picture. As expected, the model's predictions show a larger systematic underestimation of 7.75 eV relative to the GW gaps. However, it is critical to note that it is a well-established fact that DFT itself (particularly with the B3LYP functional) significantly underestimates the more sophisticated GW gap. Therefore, this large deviation is not a failure of the model; rather, it correctly reflects the combined, systematic underestimation of both the GFN2-xTB training data and the DFT benchmark relative to the GW standard.
In parallel with exploring new architectures, several refinements could enhance the current modeling framework. A deeper investigation into the chemical structures of molecules with high prediction errors—particularly for the challenging functional groups identified in this study, such as aliphatic carboxylic acids, alcohols, and amines—could reveal specific structural motifs or electronic interactions that the current descriptors fail to capture. The observation that smaller, under-represented molecular groups exhibited poorer predictive performance underscores the need to address data scarcity. Future work should prioritize strategies such as targeted data augmentation techniques, gathering more data for these groups, or employing weighting schemes during regression to account for potential biases in the training data. A primary strategy is the use of deep generative models,19 such as Variational Autoencoders (VAEs)48,49 or Generative Adversarial Networks (GANs),49 which can learn to produce novel, yet chemically valid, molecular structures within a specific chemical domain. This approach would allow for the targeted generation of new molecules belonging to the poorly predicted classes, directly enriching and balancing the training set. Alternatively, data augmentation can be performed in the descriptor space. Techniques like the Synthetic Minority Over-sampling Technique50 (SMOTE) and its variants50 have been successfully adapted for QSAR datasets, where they create synthetic minority class samples by interpolating between existing data points in the high-dimensional feature space.
To make data acquisition more efficient, an active learning51–53 loop represents another promising direction. In this paradigm, the model's own uncertainty estimates are used to intelligently select the most informative molecules for which to perform expensive quantum chemical calculations. This ensures that computational resources are focused on the areas of chemical space where the model would benefit most from new information. These established strategies, from generative models to active learning, provide a clear and feasible path toward significantly improving model robustness and predictive accuracy for the challenging molecular subgroups identified in this work.
Other avenues for improvement include exploring novel descriptor selection strategies, incorporating domain knowledge through expert-curated features, and adjusting the computational workflow to incorporate higher-precision quantum chemical methods for the target property, which would enable enhanced reliability and practicability of the findings based on the semi-empirical xTB data. A small but structurally diverse subset of molecules, particularly those where the current model shows high error or those belonging to challenging chemical groups, could be re-evaluated using DFT or the recently developed general-purpose Extended Tight-Binding54 (g-xTB). Comparing the ML model's predictions not only to the xTB target values but also to these more accurate DFT-level results would serve two key purposes. First, it would provide a valuable cross-check on the physical trends identified by the model. Second, it would help to disentangle the model's prediction error from the inherent error of the underlying semi-empirical method. This would provide a more robust assessment of the model's performance and its applicability for practical high-throughput screening campaigns.
000 molecules from the COCONUT database and a streamlined computational workflow, the efficacy of combining xTB calculations with advanced machine learning algorithms was demonstrated. The findings highlight the critical role of molecular polarizability, specifically SMR_VSA descriptors, in determining the HL-gap in both models. All tested machine learning models, including GBR, MLPR, XGB, and RFR, achieved good overall predictive performance, though the MLPR model showed a slight advantage in generalization ability. A comprehensive external validation confirmed this, as the MLPR model successfully predicted gaps for the, QM9 dataset, with its predictions faithfully capturing the underlying chemical trends and the known systematic biases of its training method when compared to DFT and GW benchmarks. Challenges remain in accurately predicting HL-gaps for molecules containing multiple functional groups, notably aliphatic carboxylic acids, alcohols, and amines. Analysis of feature importance and performance across molecular subsets revealed that aromatic carbocycles and polarizability are strong predictors of the HL-gap, while the presence of multiple interacting functional groups or complex electronic structures often leads to reduced accuracy. These observations underscore the importance of considering both electronic and structural features in HL-gap modeling and suggest that further model refinement, particularly in addressing the complexities of specific functional group interactions and complex electronic structures, holds significant promise for future improvements in the predictive accuracy. This study therefore contributes a reliable, high-throughput methodology and provides a quantitative performance baseline, paving the way for future large-scale screening of electronic properties in the vast and biomedically important chemical space of natural products.
Supplementary information: parameters for the ML-Model and additional plots referenced in the actual paper. See DOI: https://doi.org/10.1039/d5dd00186b.
| This journal is © The Royal Society of Chemistry 2026 |