Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Bowen Li; Srinivas Rangarajan

doi:10.1039/C9ME00078J

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration†

Bowen Li^a and Srinivas Rangarajan

*^a

Author affiliations

* Corresponding authors

^a Department of Chemical and Biomolecular Engineering, Lehigh University, Bethlehem, Pennsylvania 18015, USA
E-mail: srr516@lehigh.edu

Abstract

In this paper, we consider the problem of designing a compact training set comprising the most informative molecules from a specified library to build data-driven molecular property models. Specifically, using (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the ε-greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on various databases, including QM7, NIST, and a dataset of surface intermediates for calculating thermodynamic properties (heat of atomization and enthalpy of formation). For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy to five-fold cross validation on the entire set. On the other hand, our results indicate that kernel methods prefer diversity-maximizing selection.

Supplementary files

Article information

DOI: https://doi.org/10.1039/C9ME00078J
Article type: Paper
Submitted: 07 Jul 2019
Accepted: 27 Aug 2019
First published: 27 Aug 2019

Download Citation

Mol. Syst. Des. Eng., 2019,4, 1048-1057

Permissions

Request permissions

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

B. Li and S. Rangarajan, Mol. Syst. Des. Eng., 2019, 4, 1048 DOI: 10.1039/C9ME00078J

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Molecular Systems Design & Engineering

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Social activity

Search articles by author

Spotlight

Advertisements