"Amide − amine + alcohol = carboxylic acid." Chemical reactions as linear algebraic analogies in graph neural networks
Abstract
In deep learning methods, especially in the context of chemistry, there is an increasing urgency to uncover the hidden learning mechanisms often dubbed as “black box." In this work, we show that graph models built on computational chemical data behave similar to natural language processing (NLP) models built on text data. Crucially, we show that atom-embeddings, a.k.a atom-parsed graph neural activation patterns, exhibit arithmetic properties that represent valid reaction formulas. This is very similar to how word-embeddings can be combined to make word analogies, thus preserving the semantic meaning behind the words, as in the famous example “King" − “Man" + “Woman" = “Queen." For instance, we show how the reaction from an alcohol to a carbonyl is represented by a constant vector in the embedding space, implicitly representing “-H2 ." This vector is independent from the particular carbonyl reactant and alcohol product and represents a consistent chemical trans- formation. Other directions in the embedding space are synonymous with distinct chemical changes (ex. the tautomerization direction). In contrast to natural language processing, we can explain the observed chemical analogies using algebraic manipulations on the local chemical composition that surrounds each atom-embedding. Furthermore, the observations find applications in transfer learn- ing, for instance in the formal structure and prediction of atomistic properties, such as 1 H-NMR and 13 C-NMR. This work is in line with the recent push for interpretable explanations to graph neural network modeling of chemistry and uncovers a latent model of chemistry that is highly structured, consistent, and analogous to chemical syntax.