How to actively learn chemical reaction yields in real-time using stopping criteria†
Abstract
Chemical reactions are central for the creation of new materials, drug design and many more fields. Obtaining high reaction yields is of great importance to reduce cost and increase the efficiency and purity of the obtained product. To reduce the number of experiments for high reaction yield screening in organic chemistry, the use of active learning (AL) is an interesting approach. Unfortunately, the majority of AL is based on “retro-AL” where all the reactions are already available. One problem of “real-time” AL is determining when to stop the AL loop without creating an external labeled test set to analyze the performance of the model. The stopping procedure presented in this work is a stopping criterion, namely the stabilization prediction (SP) (Bloodgood et al., Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 2009, 39–47). It uses an unlabeled equivalent of a test set called a stop set to indirectly evaluate the accuracy of the AL loop. To benchmark the stability of this method and investigate its applicability in chemistry, two datasets from the organic literature, four estimators, three types of descriptors, two sizes of queries per iteration (QPI) and stop set size were investigated. We determine that the present method is the most stable with a Support Vector Classification (SVC) estimator, 50 QPI and a stop set size containing 30% of the data. It produces the best compromise between an early stop (consumes less than 50% of the data) and a reliable accuracy over 10 different runs compared to the accuracy obtained with classical supervised machine learning. We do hope that this method would be of use to create “real-time” AL in chemistry.