Machine learning workflows beyond linear models in low-data regimes†
Abstract
Data-driven methodologies are transforming chemical research by providing chemists with digital tools that accelerate discovery and promote sustainability. In this context, non-linear machine learning algorithms are among the most disruptive technologies in the field and have proven effective for handling large datasets. However, in data-limited scenarios, linear regression has traditionally prevailed due to its simplicity and robustness, while non-linear models have been met with skepticism over concerns related to interpretability and overfitting. In this study, we introduce ready-to-use, automated workflows designed to overcome these challenges. These frameworks mitigate overfitting through Bayesian hyperparameter optimization by incorporating an objective function that accounts for overfitting in both interpolation and extrapolation. Benchmarking on eight diverse chemical datasets, ranging from 18 to 44 data points, demonstrates that when properly tuned and regularized, non-linear models can perform on par with or outperform linear regression. Furthermore, interpretability assessments and de novo predictions reveal that non-linear models capture underlying chemical relationships similarly to their linear counterparts. Ultimately, the automated non-linear workflows presented have the potential to become valuable tools in a chemist's toolbox for studying problems in low-data regimes alongside traditional linear models.