David
Dalmau
a,
Matthew S.
Sigman
b and
Juan V.
Alegre-Requena
*a
aDepartamento de Química Inorgánica, Instituto de Síntesis Química y Catálisis Homogénea (ISQCH), CSIC-Universidad de Zaragoza, C/Pedro Cerbuna 12, 50009 Zaragoza, Spain. E-mail: jv.alegre@csic.es
bDepartment of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, USA
First published on 15th April 2025
Data-driven methodologies are transforming chemical research by providing chemists with digital tools that accelerate discovery and promote sustainability. In this context, non-linear machine learning algorithms are among the most disruptive technologies in the field and have proven effective for handling large datasets. However, in data-limited scenarios, linear regression has traditionally prevailed due to its simplicity and robustness, while non-linear models have been met with skepticism over concerns related to interpretability and overfitting. In this study, we introduce ready-to-use, automated workflows designed to overcome these challenges. These frameworks mitigate overfitting through Bayesian hyperparameter optimization by incorporating an objective function that accounts for overfitting in both interpolation and extrapolation. Benchmarking on eight diverse chemical datasets, ranging from 18 to 44 data points, demonstrates that when properly tuned and regularized, non-linear models can perform on par with or outperform linear regression. Furthermore, interpretability assessments and de novo predictions reveal that non-linear models capture underlying chemical relationships similarly to their linear counterparts. Ultimately, the automated non-linear workflows presented have the potential to become valuable tools in a chemist's toolbox for studying problems in low-data regimes alongside traditional linear models.
However, modeling small datasets in chemical research presents inherent challenges. Such datasets are particularly susceptible to underfitting, where models fail to capture underlying relationships, and overfitting, where models overly adapt to data by capturing noise or irrelevant patterns.14 These issues stem from the limited number of data points, the complexity of algorithms relative to dataset size, and the presence of noise, all of which hinder a model's ability to generalize effectively.15
Multivariate linear regression (MVL) is arguably the most used method in low-data scenarios due to its simplicity, robustness, and consistent performance with small datasets.16 MVL models often present a bias-variance tradeoff that helps mitigate overfitting while providing intuitive interpretability.14 Although more advanced ML algorithms like random forests (RF), gradient boosting (GB), and neural networks (NN) can achieve higher predictive accuracy,17,18 their effectiveness in low-data scenarios is often limited by their sensitivity to overfitting and difficult interpretation.19 These models also require careful hyperparameter tuning and regularization techniques to generalize effectively.20–22
To fully harness the capabilities of non-linear ML algorithms in low-data scenarios, it is essential to address these challenges. To this end, we have developed a fully automated workflow integrated into the ROBERT software. The approach is specifically designed to mitigate overfitting, reduce human intervention, eliminate biases in model selection, and enhance the interpretability of complex models. Our goal is to demonstrate that, even in low-data regimes, non-linear algorithms can be as effective as MVL when properly tuned and regularized. This new workflow not only broadens the scope of ML applications in chemistry but also aims to incorporate non-linear algorithms as part of the chemists' toolbox for studying low-data scenarios (Fig. 1).
In line with previous studies,23 we observed that the most limiting factor in applying non-linear models to low-data regimes is overfitting. Even though we aimed to maximize validation performance across multiple train-validation splits during hyperparameter optimization, we often observed a significant degree of overfitting in databases with fewer than 50 data points when using non-linear algorithms.
A wide array of techniques has been designed to measure overfitting, with CV being one of the most widely used.24 In this context, introducing similar techniques during hyperparameter optimization should help reduce overfitting in the selected model. To test this hypothesis, we redesigned the program's hyperparameter optimization to use a combined Root Mean Squared Error (RMSE) calculated from different CV methods (Fig. 2A). This metric evaluates a model's generalization capability by averaging both interpolation and extrapolation CV performance. Interpolation is tested using a 10-times repeated 5-fold CV (10× 5-fold CV) process on the training and validation data, while extrapolation is assessed via a selective sorted 5-fold CV approach. This method sorts and partitions the data based on the target value (y) and considers the highest RMSE between the top and bottom partitions, a common practice for evaluating extrapolative performance.25,26 In principle, this dual approach should not only identify models that perform well during training but also filter out those models that struggle with unseen data.
Using Bayesian optimization,27,28 the new version of ROBERT systematically tunes hyperparameters using the combined RMSE metric as its objective function. As illustrated in Fig. 2B, this iterative exploration of the hyperparameter space consistently reduces the combined RMSE score, ensuring that the resulting model minimizes overfitting as much as possible. One optimization is performed for each selected algorithm, and the model with the best combined RMSE is used in the subsequent step of the workflow. Additionally, to prevent data leakage,29 the methodology reserves 20% of the initial data (or a minimum of four data points) as an external test set, which is evaluated after hyperparameter optimization. The choice of the test set split is set to an “even” distribution by default, ensuring balanced representation of the target values. This approach helps maintain model generalizability, especially in cases of imbalanced datasets, while preventing overrepresentation of certain data ranges in the test set.
The performance of three non-linear algorithms, RF, GB, and NN, was evaluated against MVL using scaled RMSE, which is expressed as a percentage of the target value range and helps interpret model performance relative to the range of predictions. To ensure fair comparisons while evaluating the train and validation set results, no specific train-validation splits were considered, as metrics can heavily depend on the selected split.37 Instead, we used 10× 5-fold CV, which mitigates splitting effects and human bias. To further avoid bias, the external test sets were selected using a systematic method that evenly distributes y values across the prediction range.
Promisingly, the 10× 5-fold CV results show that the non-linear NN algorithm produces competitive results compared to the classic MVL model (Fig. 3, bottom-left). The NN model performs as well as or better than MVL for half of the examples (D, E, F and H), which range from 21 to 44 data points. Similarly, the best results for predicting external test sets are achieved using non-linear algorithms in five examples (A, C, F, G and H), with dataset sizes between 19 and 44 points (Fig. 3, bottom-right). Overall, these results support the inclusion of non-linear algorithms alongside MVL in data-driven approaches for small datasets.
Considering the widespread use of RF in chemistry,38 it is noteworthy that this algorithm yielded the best results in only one case. This may be a consequence of introducing an extrapolation term during hyperoptimization, as tree-based models are known to have limitations for extrapolating beyond the training data range.39 However, further analysis revealed that including this term leads to better models, as it prevents the occurrence of large errors in some of the examples (Fig. S1–9†). Based on the results, the higher errors observed for RF in examples A–H are mitigated and no longer represent a serious limitation when larger databases are used (Fig. S10 and S11†). See also the Evaluating combined metric for BO and dataset size section of the ESI† for additional discussion.
To further enhance algorithm evaluation, a new scoring system was developed on a scale of ten (Fig. 4A). The score is provided with the PDF report that ROBERT generates after each analysis and is based on three key aspects: predictive ability and overfitting, prediction uncertainty, and detection of spurious predictions.
The first component is the most important, accounting for up to eight points. It includes (1 and 2) evaluating predictions from the 10× 5-fold CV and external test set using scaled RMSE, (3) assessing the difference between the two scaled RMSE values to detect overfitting, and (5) measuring the model's extrapolation ability using the lowest and highest folds in a sorted CV (Fig. 4A, top). The second component assesses prediction uncertainty by analyzing the average standard deviation (SD) of the predicted values obtained in the different CV repetitions (4). The final component identifies potentially flawed models by evaluating RMSE differences in the 10× 5-fold CV after applying data modifications such as y-shuffling40 and one-hot encoding,41 and using a baseline error based on the y-mean test (6). A comprehensive explanation of the score is included in the ROBERT score section of the ESI† and in the ROBERT documentation.42 This scoring framework ensures that models are evaluated based on their predictive ability, level of overfitting, consistency of predictions, and robustness against flawed models.
Fig. 4B presents the ROBERT scores for the eight datasets from Fig. 3. Even under this more critical and restrictive evaluation method, non-linear algorithms perform as well as or better than MVL in five examples (C, D, E, F and G). These results align with previous findings and further support the inclusion of non-linear workflows alongside MVL in model selection.
First, to evaluate the interpretability of the NN algorithm, we assessed whether it captures the same underlying relationships as the MVL model using SHAP analysis.43 On the left side of the SHAP summary plot for the NN model, the descriptors are ordered from most important at the top to least important at the bottom, exactly mirroring the MVL model's findings. Similarly, pink and blue dots on the left side of the plot indicate that both MVL and NN identified the same inverse and direct correlations with the target value (+ and − symbols in the dashed line box, Fig. 5). These findings suggest that both linear and non-linear models capture similar data trends. It is important to note that SHAP uses local linear models to approximate the decision-making process of the NN and therefore does not directly provide information on the NN's internal structure.44
Additionally, we compared the predictive accuracy of MVL and NN algorithms on the de novo molecule targets used in case H, using the values reported in the original study as the MVL baseline (Fig. 6). The RMSE values obtained for both models are nearly identical (5.32 and 5.31 M−1 min−1), demonstrating that a non-linear model can perform as well as the original MVL model.
A scoring system was developed to evaluate models beyond traditional metrics, assigning a score out of 10. This score accounts for various factors, including overfitting, predictive ability, uncertainty, and the detection of spurious results.
Interpretability assessments using SHAP analysis reveal that non-linear models capture underlying chemical relationships similarly to their linear counterparts. Furthermore, both model types lead to analogous de novo predictions, suggesting their potential utility in chemical discovery when using small databases.
Overall, the automated non-linear workflows presented have the potential to become part of a chemist's toolbox for studying problems in low-data regimes. These techniques provide alternative algorithms that can be used alongside traditional linear models in data-driven studies.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sc00996k |
This journal is © The Royal Society of Chemistry 2025 |