Issue 5, 2019

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Abstract

In this paper, we consider the problem of designing a compact training set comprising the most informative molecules from a specified library to build data-driven molecular property models. Specifically, using (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the ε-greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on various databases, including QM7, NIST, and a dataset of surface intermediates for calculating thermodynamic properties (heat of atomization and enthalpy of formation). For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy to five-fold cross validation on the entire set. On the other hand, our results indicate that kernel methods prefer diversity-maximizing selection.

Graphical abstract: Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Supplementary files

Article information

Article type
Paper
Submitted
07 Jul 2019
Accepted
27 Aug 2019
First published
27 Aug 2019

Mol. Syst. Des. Eng., 2019,4, 1048-1057

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

B. Li and S. Rangarajan, Mol. Syst. Des. Eng., 2019, 4, 1048 DOI: 10.1039/C9ME00078J

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements