Richard
Dybowski
*
St John's College, University of Cambridge, Cambridge CB2 1TP, UK. E-mail: rd460@cam.ac.uk
First published on 9th November 2020
There has been an upsurge of interest in applying machine-learning (ML) techniques to chemistry, and a number of these applications have achieved impressive predictive accuracies; however, they have done so without providing any insight into what has been learnt from the training data. The interpretation of ML systems (i.e., a statement of what an ML system has learnt from data) is still in its infancy, but interpretation can lead to scientific discovery, and examples of this are given in the areas of drug discovery and quantum chemistry. It is proposed that a research programme be designed that systematically compares the various model-agnostic and model-specific approaches to interpretable ML within a range of chemical scenarios.
Numerous techniques have been developed under the heading of ML including neural networks, support vector machines and random forests,4 but, in line with the concept of ‘statistical learning’,5 we will also regard all forms of statistical regression as being under the ML umbrella.
A pooling layer abstracts the values of a feature map. This successive use of convolutional and pooling layers produces a hierarchy of abstracted features from an image that are invariant to translation, hence the successful use of CNNs for the recognition of images such as faces.
There are many examples of where ML has been applied within chemistry,10 including the design of crystalline structures,11 planning retrosynthesis routes,12 and reaction optimisation.13 Here, we will briefly look at two; namely, drug discovery and quantum chemistry.
Once a target biomolecule (usually a protein) that is associated with a disease enters a pharmaceutical company's pipeline, it can take about 12 years to develop a marketable drug, but the failure rate during this process is high and costly. Each new drug that does reach the market represents research and development costs of close to one billion US dollars; therefore, early drug validation is vital, and this has led to the rise of computational (in silico) techniques.14 Consequently, the traditional techniques for drug discovery and target validation15,16 have been augmented with the use of machine learning to reduce the number of candidates by predicting whether a chemical substance will have activity at a given target.17 An example is the use of an ANN to aid the design of anti-bladder cancer agents.18
The standard approach to developing a neural network to predict whether a compound S will be active with respect to a target protein P is to train the network using a collection {S1,…,Sn} of compounds with known activity toward P, but how can a ligand–protein activity predictor be trained if there are no known ligands for target protein P? AtomNet19 is a CNN designed to predict ligand–protein activities when no ligand activity for a target protein is available. This is done by training the CNN using known activities across a range of ligand–protein complexes. The thesis of AtomNet is as follows: (i) a complex ligand–protein interaction can be viewed as a combination of smaller and smaller pieces of chemical information; (ii) a CNN can model hierarchical combinations of simpler items of information; (iii) therefore, a CNN can model complex ligand–protein interactions.
Ligand–protein associations were encoded for AtomNet using ligand–protein interaction Fingerprints. The network significantly outperformed a variant of AutoDock Vina;20 for example, AtomNet achieved an AUC (Area Under ROC Curve) greater than 0.9 for 57.8% of the targets in the DUDE dataset (Directory of Useful Decoys [false positives]).21
It is important when using ML for drug discovery that the ligand–protein data used to test the ML system is not biased.22
Method | Runtime |
---|---|
Configuration interaction (up to quadruple excitations) | O(N10) |
Coupled cluster (CCSD(T)) | O(N7) |
Configuration interaction (single and double excitations) | O(N6) |
Møller–Plesset second-order perturbation theory | O(N5) |
Hartree–Fock | O(N4) |
Density functional theory (Kohn–Sham) | O(N3–4) |
Tight binding | O(N3) |
Molecular mechanics | O(N2) |
Fig. 3 Schematic representation of quantum chemical and ML approximations with respect to computational cost and accuracy, which generalises the literature.24 DFT = density functional theory; SQC = semi-quantitative quantum chemistry. |
The set of nuclear charges {Zi} and atomic Cartesian coordinates {Ri} uniquely determine the Hamiltonian H of any chemical compound,
An example of the use of ML for quantum chemistry is the mapping of {Z,R} to E by encoding the information in {Z,R} using a Coulomb matrix M:25
The data set {Mk,Erefk} consisted of 7165 organic compounds, encoded as Coulomb matrices M, along with their energies Eref calculated using the PBE0 density function model. Cross-validation gave a mean absolute prediction error of 9.9 kcal mol−1.
The idea of using AI for scientific discovery is not new,26 and there has recently been interest in using ML to provide scientific insights as well as making accurate predictions.27
Interpretability is an ill-defined concept,28 but it will suffice for us to use the definition that interpretable ML is the use of ML models for the extraction of relevant knowledge about domain relationships contained in data.29 Consequently, interpretability refers to the extent to which a human expert can comprehend what an ML system has learnt from data; for example, “What is this ML system telling us?” From an interpretation we have insight, and from insight we can hopefully make a scientific discovery.
There are several categories of interpretability in the context of ML. In the following list, f(x) will be written as , where is the set of ML parameters estimated from training data.
(a) Observe the input–output behaviour of ; for example, by observing how varies as x is varied.
(b) Inspect the values of parameters within the internal structure of . Here, is either intrinsically interpretable or is interpretable by design. This allows mapping to be understood by a series of steps going from input x to output that are comprehensible to a domain expert. This can be regarded as an ‘explanation’ of how was derived from x.
(c) Determine the prototypical value x of for a given specific value f* of . Conceptually, this can be regarded as the x that maximises conditional probability p(x|f*). x need not have been previously encountered in a training set. An example of this approach is activation maximization.30
A simple example of an intrinsically interpretable ML system is a linear regression model
The potential of using interpretable ML for chemistry is starting to grow. For example, Bayesian neural networks have been optimised to predict the dissociation time of the unmethylated and tetramethylated 1,2-dioxetane molecules from only the initial nuclear geometries and velocities.32 Conceptual information was extracted from the large amount of data produced by simulations.
We now look at two other examples of interpretable ML: one from drug discovery; the other from quantum chemistry.
There are two types of model-agnostic techniques. One method is the association-based technique in which associations are determined between inputs to system f(x) and outputs from the system. One example of this are partial dependency plots,5 which create sets of ordered pairs {(x(i)s,f(x(i)s)} where feature subset xs ⊂ x. Another way to examine how f(x) changes as xj (xj ∈ x) changes is to use the ‘gradient input’ , where is a particular value of xi. However, an extension of this is to integrate the gradient along a path for xi from observed value to a baseline value . This is called an ‘integrated gradient’:
Integrated gradients33 determined the chemical substructures (toxicophores) that are important for differentiating toxic and non-toxic compounds. The relevant substructures identified in 12 compounds randomly sampled from the Tox21 Challenge data set are shown in Fig. 5. The DNN consisted of four hidden layers, each with 2048 nodes. The molecular structures were encoded using ECFPs, the training and test sets had 12060 and 647 examples, respectively, and the resulting AUC was 0.78.
Fig. 5 Six randomly drawn Tox21 samples. Dark red indicates that these atoms are responsible for a positive classification, whereas dark blue atoms attribute to a negative classification.33 |
An alternative to the detection of features relevant to a classification performed by a DNN is to start at an output node and work back to the input nodes. This is done with Layer-Wise Relevance Propagation (LRP),34 which uses the network weights and the neural activations of a DNN to propagate the output (at layer M) back through the network up until the input layer (layer 1). The backward pass is a conservative relevance redistribution procedure where those neurons in layer l (1 ≤ l < M) that contribute the most to layer l + 1 receive the most ‘relevance’ from it. LRP has, so far, only been used to detect relevant features in pixel-based images and has not been used for the interpretation of chemistry-oriented ML systems. For example, rather than use fingerprints, such as ECPF, for molecular structure input, 2D molecular drawings have been used as inputs to a CNN (and achieved a predictive accuracy of AUC 0.766)35 but LRP was not used for interpretation.
Graph convolutional neural networks (GCNNs) are a variant of CNNs that enable 3D graphs to be used as inputs. Consequently, if the nuclei and bonds of a compound are regarded as the vertices and edges of a 3D graph then 3D molecular structures can be considered as inputs to GCNNs. One approach is to initially slide convolutional filters over atom pairs to obtain atom-pair representations;33 pooling is then used to produce simple substructure representations. These representations were then fed into the next convolutional layer. The predictive accuracy of the resulting GCNN was an AUC of 0.714. Interpretation was done by omitting the pooling steps and feeding the substructures directly into the fully connected network.
The other type of model-agnostic approach is the use of surrogates (Fig. 6); namely, using a function (x) that is an approximation of black box f(x) but which is intrinsically interpretable. Examples of parsimonious intrinsically interpretable models include linear regression, logistic regression and decision trees. Such models can be either global or local. Given a set of vectors {x(1),…,x(n)}nd that we wish to apply to f(x), the global approach is to apply each of these vectors to the same surrogate model (x). In contrast, the local approach uses a different surrogate model i(x(i)) for vector x(i). An example is the LIME technique,36 which trains i(x(i)) on data in the ‘neighbourhood’ of x(i), thereby providing interpretability specifically for the input–output pair (x(i), i(x(i))).
Fig. 6 (a) Black box trained from data {x,y}. (b) Surrogate model of the black box trained from the same data. (x) approximates f(x). |
Rather than resorting to fuzzy logic, neural networks such as SchNet (described below) are constructed by combining science-based subsystems in a plausible manner. This is an example of the model-specific approach to interpretable ML (Fig. 4).
A strategy for molecular energy E prediction39 is to represent each atom i by a vector ci in B-dimensional space, and a deep tensor neural network (DTNN) called SchNet, shown in Fig. 7, repeatedly refines ci by pair-wise interaction between atoms i and j from an initial vector c(0)i for atom i to final vector c(T)i:
(1) |
Fig. 7 The architecture of SchNet.39 The iteration loop implements eqn (1), and the interaction module (a neural network) implements eqn (2). |
Term vij is obtained from atom vector cj and distance dij using a feedforward neural network with a tanh activation function:
vij = tanh[Wfc(Wcfcj + bf1) ○ (Wdfdij + bf2)] | (2) |
After T iterations, an energy contribution Ei for atom i is predicted for the final vector c(T)i, and the total energy E is the sum of the predicted contributions Ei.
The DTNN, trained using stochastic gradient descent, achieved a mean absolute error of 1.0 kcal mol−1 on the GDB datasets.
The constructive nature of the DTNN allows interpretation of how E is obtained, and the estimation of E allows energy isosurfaces to be constructed (Fig. 8).
Fig. 8 Chemical potentials for methane, propane, pyrazine, benzene, toluene, and phloroglucinol determined from SchNet.39 |
Returning to the Schrödinger equation (in Dirac notation),
H|Ψn〉 = E|Ψn〉 |
Hkm = εmSkm |
Hi,j = 〈ϕi|H|ϕj〉 |
Si,j = 〈ϕi|ϕj〉 |
SchNOrb40 was developed to predict H and S using ML. The first part of the structure of SchNOrb is identical to SchNet in that it starts from initial representations of atom types and positions, continues with the construction of representations of chemical environments of atoms and atom pairs (again identical to the method used in SchNet) but then uses these to predict energy E and Hamiltonian matrix H, respectively.
The SchNet and SchNOrb systems illustrate how DNNs can be customized to specific scientific applications so that the DNN architecture promotes properties that are desirable in the data modelled by the network. Interpretation occurs by unpacking these networks. This strategy has already found success in several scientific areas including plasma physics41 and epidemiology.42 Another example is the use of a customised DNN43 to encode the hierarchical structure of a gene ontology to provide insight into the structure and function of a cell.
Roscher et al.27 divided the various forms of ML for scientific discovery into four groups. Group 1 includes approaches without any means of interpretability. With Group 2, a first level of interpretability is added by employing domain knowledge to design the models or explain the outcomes. Group 3 deals with specific tools included in the respective algorithms or applied to their outputs to make them interpretable, and Group 4 lists approaches where scientific insights are gained by explaining the machine learning model itself. These categories should help to structure the above proposed systematic comparison.
As regards the model-specific and model-agnostic approaches, it is anticipated that the agnostic method is more widely applicable because of the greater propensity of black-box systems.
The architectures of SchNet and SchNOrb are not learned but are designed with prior knowledge about the underlying physical process. In contrast, the aim of SciNet45 is to learn, without prior scientific knowledge, the underlying physics of a system from combinations of (a) observations (experimental data) taken from the physical system, (b) questions asked about the system, and (c) the correct answers to those questions in the context of the observations. This is done by attempting to use the latent neurons of an autoencoder network3 to learn and represent underlying physical parameters of the system. A network's learned representation is interpreted by analysing how the region of latent neurons responds to changes in the values of known physical parameters. For example, when given a time series of the positions of the Sun and the Moon, as seen from Earth, SciNet deduced Copernicus' heliocentric model of the solar system. This course to scientific discovery has not yet been applied to chemical systems but, given its potential, it is suggested that a clear methodology be developed that extends the SciNet-type approach to help chemists uncover new ideas and the links between them.
The exploration of the potential of interpretable ML to the sciences is growing, with applications in genomics,46 many-body systems,47 neuroscience48 and chemistry. Although this chemical review has focused on interpretation with respect to drug discovery and quantum chemistry, the potential of ML has been explored in other areas of chemistry, such as the use of ML for computational heterogeneous catalysis49 and retrosynthesis,50 and the use of interpretable ML in these and other fields is expected to prove to be immensely useful.
But a cautionary note. Reproducibility is fundamental to scientific research; thus, it is crucial that scientific discoveries arising from ML are reproducible, and this need must be factored into any methodology built for ML-based discovery. And the provision of raw research data with a publication is essential to overcome the “reproducibility crises”.51
In February 2020, the Alan Turing Institute held a workshop that announced the Nobel Turing Challenge; namely, “the production of AI systems capable of making Nobel-quality scientific discoveries highly autonomously at a level comparable, and possibly superior, to the best human scientists by 2050”. Chemistry is within the remit of the Challenge, and it is anticipated that interpretable ML will play a vital role toward the production of an ‘AI Chemist’.
This journal is © The Royal Society of Chemistry and the Centre National de la Recherche Scientifique 2020 |