Large property models: a new generative machine-learning formulation for molecules

Tianfan Jin; Veerupaksh Singla; Hsuan-Hao Hsu; Brett M. Savoie

doi:10.1039/D4FD00113C

Large property models: a new generative machine-learning formulation for molecules

Tianfan Jin,^a Veerupaksh Singla,

^a Hsuan-Hao Hsu^a and Brett M. Savoie

*^b

Author affiliations

* Corresponding authors

^a Davidson School of Chemical Engineering, Purdue University, West Lafayette, Indiana, USA

^b Department of Chemical and Biomolecular Engineering, The University of Notre Dame, Notre Dame, Indiana, USA
E-mail: bsavoie2@nd.edu

Abstract

Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.

This article is part of the themed collection: Data-driven discovery in the chemical sciences

Faraday Discussions

Large property models: a new generative machine-learning formulation for molecules

Abstract

Article information

Download Citation

Permissions

Large property models: a new generative machine-learning formulation for molecules

Social activity

Search articles by author

Spotlight

Advertisements