Retrosynthesis with attention-based NMT model and chemical analysis of “wrong” predictions†
Abstract
We consider retrosynthesis to be a machine translation problem. Accordingly, we apply an attention-based and completely data-driven model named Tensor2Tensor to a data set comprising approximately 50 000 diverse reactions extracted from the United States patent literature. The model significantly outperforms the seq2seq model (37.4%), with top-1 accuracy reaching 54.1%. We also offer a novel insight into the causes of grammatically invalid SMILES, and conduct a test in which experienced chemists select and analyze the “wrong” predictions that may be chemically plausible but differ from the ground truth. The effectiveness of our model is found to be underestimated and the “true” top-1 accuracy reaches as high as 64.6%.