Generalizable classification of crystal structure error types using graph attention networks
Abstract
Modern chemical applications of machine learning rely on massive training datasets collected through computational simulations or data mining. The quality of such datasets is increasingly challenged due to the discovery of errors in the most popular crystal structure databases. While methods exist to determine error presence, determining an error's cause is not straightforward. We propose a graph neural network-based approach to classify the presence of crystal structure errors, including proton omissions, charge balancing errors, and crystallographic disorder. A training dataset comprising >11k metal–organic frameworks (MOFs) labelled by error type was generated through domain expert inspection. Chemically intuitive features, such as atomic number and oxidation state, were found to achieve high classification accuracies ranging from 85 to 95%. Despite only training on MOFs, classification was generalizable towards unseen databases of molecules and metal complexes, observing accuracies eclipsing 96% in proton and disorder error classification in random samples of drug molecules and metal complexes. Further, graph explainability analysis indicated that these models frequently identify chemically-problematic subgraph structures—analogous to those a chemist would flag—as important towards the error label prediction.
- This article is part of the themed collection: Journal of Materials Chemistry A HOT Papers