Qi
Yuan
,
Filip T.
Szczypiński
and
Kim E.
Jelfs
*
Department of Chemistry, Molecular Sciences Research Hub, White City Campus, Imperial College London, Wood Lane, London, UK. E-mail: k.jelfs@imperial.ac.uk
First published on 11th February 2022
The development of accurate and explicable machine learning models to predict the properties of topologically complex systems is a challenge in materials science. Porous organic cages, a class of polycyclic molecular materials, have potential application in molecular separations, catalysis and encapsulation. For most applications of porous organic cages, having a permanent internal cavity in the absence of solvent, a property termed “shape persistence” is critical. Here, we report the development of Graph Neural Networks (GNNs) to predict the shape persistence of organic cages. Graph neural networks are a class of neural networks where the data, in our case that of organic cages, are represented by graphs. The performance of the GNN models was measured against a previously reported computational database of organic cages formed through a range of [4 + 6] reactions with a variety of reaction chemistries. The reported GNNs have an improved prediction accuracy and transferability compared to random forest predictions. Apart from the improvement in predictive power, we explored the explicability of the GNNs by computing the integrated gradient of the GNN input. The contribution of monomers and molecular fragments to the shape persistence of the organic cages could be quantitatively evaluated with integrated gradients. With the added explicability of the GNNs, it was possible not only to accurately predict the property of organic materials, but also to interpret the predictions of the deep learning models and provide structural insights for the discovery of future materials.
Machine learning (ML) has many potential uses within material discovery, including to reduce the cost of property calculation compared to carrying out computational simulations (especially via quantum mechanical methods) and to remove the need for specialist modelling packages. This allows researchers to focus experimental synthesis and measurement effort on the most promising materials, reducing wasted laboratory resources,13,14 as well as to help facilitate the exploration of larger chemical space.15–17 Apart from the widely reported ML models for molecular discovery, especially drug discovery, the applications of ML to porous materials such as MOFs have gained significant interest.18 Various structural and geometrical descriptors for MOFs have been developed for the prediction of their gas sorption,19 and open-source databases recording the structures with experimental and/or computational properties have been published and the diversity of the chemical space has been examined.20 The development and application of ML to modelling the properties of organic cages, on the other hand, is less reported.
We have previously developed a computational database of >60000 organic cages formed through a range of reaction chemistries via a [4 + 6] reaction of four tritopic and six ditopic building blocks and studied their behaviour using molecular dynamics (MD) simulations.21 Each cage was allowed to evolve at an elevated temperature to sample different regions of the conformational potential energy surface. A subset of structures along the MD trajectory was then optimised and the resulting lowest energy conformations were analysed for shape persistence, following heuristic rules based on the expected number of windows and the observed window diameters. We then modelled the computed shape persistence of these cages using random forest models with Morgan fingerprints – vectors indicating the presence of specific substructures within a molecule – as the input features to the model. We found the models to be very effective when applied to systems with the same reaction chemistry, for example a random forest model trained on cages formed from imine chemistry was effective at predicting shape persistence in other imine cages.21 However, the random forest model did not translate well between cages formed from different reaction mechanisms: an imine-trained random forest model was not as effective at predicting the shape persistence of a cage formed by alkyne metathesis chemistry. This was not surprising given that experimentally, extremely small changes to the synthesis, for example adding a single CH2 group to one cage precursor, could completely invert the shape persistence behaviour.11 The prediction result of the random forest models could not be attributed to specific monomers of the cage or fragments of the monomers, because the feature importance analysis did not show a strong preference to any specific molecular features.21 Recently developed graph neural networks (GNNs), which encode molecular information into neural graph fingerprints with machine-learned continuous numeric vector representation, have exhibited improved predictive performance on various tasks including chemical reactivity,22 compound protein interaction23 and partial charge assignment,24 because of the flexibility of such fingerprints, especially when a larger dataset is available.25
An additional benefit of prediction via GNNs is that it is possible to identify key building blocks or molecular fragments contributing to the models' predictions through calculating attribution scores of the input features. Sundararajan et al. developed the integrated gradients approach to compute the contribution of input features for ML tasks and highlighted a case study of explaining molecular binding mechanisms using integrated gradients.26 McCloskey et al. calculated the attribution score of fragments of molecules with a hypothesised binding mechanism and proposed a sanity check to determine whether a hypothesised mechanism can be learned.27 The explicability of ML models for predictive tasks in material and molecular discovery has gained increasing research interest, since explainable models can not only provide insight for the monomers and fragments that contribute exclusively to the prediction to help future discovery, but also suggest possible pitfalls of the models where predictions are accurate, but the underlying chemical mechanism has not been learnt.
In this study, we developed GNN models to predict the shape persistence of organic cages formed via different [4 + 6] reaction chemistries: imine condensation, amide condensation, and alkene/alkyne metathesis. Graph representations of the organic cages were developed and neural fingerprints for cages were trained using the GNN architecture. The modular structure of organic cages allows us to represent the cage structure purely by combining separate representations of the building blocks and linkers. In this way, the entire connectivity information is represented more simply and avoids redundancy present if the entire polycyclic macromolecular cage graph were used. Furthermore, this approach allows us to easily create representations for cages – be it by synthetic end-users or for future development – that use different condensation chemistries from already existing precursors. The shape persistence of the organic cages was accurately predicted using the GNN model, with significant improvement of generalisability towards unseen monomers compared to prior work with random forest models. In addition, to obtain explicability of the prediction of the GNNs, the integrated gradient was implemented and computed for precursors of the organic cages and fragments of the precursors. It was therefore possible to quantify the contribution of precursors as well as fragments to the shape persistence of organic cages and provide insight for the design of future precursors for organic cages.
(1) |
If α < 0.035 and the cavity size was greater than 1 Å, the cage was labelled as “not collapsed”, else it was labelled “undetermined”. Only the “not collapsed” (four windows and α < 0.035) and “collapsed” (wrong number of windows) cages were used to train the ML models in this study. There are 16921 “collapsed” cages (46.8%), 12167 “not collapsed” cages (33.7%) and 7020 “undetermined” cages in the database. Summary statistics of cage collapse labels for different chemical reactions in this study are provided in Table S1.† An example cavity with the corresponding window can be seen on the right of Fig. 1.
(2) |
Fig. 2 GNN encoding of the molecular features of the building blocks and linkers of organic cages in this study. |
Explicit bonding features were ignored in our featurisation as we believe that the detailed information about the type and valence of the atoms involved captures all relevant bonding information implicitly without further redundancy and ambiguity of the ease of bond rotation in various functional groups.
Similar to our previous work,21 the neural fingerprints for the organic cages in this study were obtained by concatenating the molecular vectors of the building blocks and linkers. Such neural fingerprints were then processed by a multi-layer neural network followed by a prediction layer (see Fig. 3). Parameters for the multi-layer neural network and the GNN were updated during the training process. As a result, neural fingerprints of the cage components were also updated. We used cross-entropy loss as the loss function and it operates with C output neurons for C classes, which can then be directly used for the corresponding prediction. Hence, the architecture of the prediction layer is determined by the predictive task in this study: for the classification tasks, such as predicting the organic cage shape persistence, the output layer has two neurons zi (i = 1, 2). Each of the two output neurons was interpreted as the organic cage being “collapsed” or “not collapsed”, which were processed using the softmax function:
(3) |
Fig. 3 Architecture of the GNN in this study: monomers (building blocks and linkers) of the organic cages were encoded to numeric vectors using a graph neural network (see Fig. 2), the vectors were then concatenated and processed by a multi-layer neural network to output a shape persistence prediction. The prediction by the two neurons in the output layers was processed using the softmax function to obtain the final classification. |
The neuron with the larger softmax output σ(zi) would be treated as the “predicted” label.
The performance of the GNN model on the classification task of “collapsed” and “not collapsed” cages was evaluated using the accuracy, precision and recall scores on the test sets, defined as follows:
(4) |
(5) |
(6) |
In this study, the “collapsed” organic cages were regarded as “positive” in our predictions. “True positive” represents the data where cages were “collapsed” from both the GNN model prediction and as labelled in the database; “false positive” represents the data where cages were “collapsed” according to the GNN model prediction but “not collapsed” as labelled in the database; “true negative” represents the data where cages were “not collapsed” from both the GNN prediction and as labelled in the database; “false negative” represents the cages that were “not collapsed” according to the GNN prediction but were “collapsed” as labelled in the database.
The attribution scores in this study were calculated and represented using integrated gradients. The formal definition for attribution scores, as well as the axiomatic justification of the integrated gradients satisfying certain properties is provided by Sundararajan et al.26 To explain briefly here, let function F:Rn → [0,1] represent a deep neural network. Given an input feature x and some baseline feature x′, the integrated gradient of x along the ith dimension of x was defined as follows:
(7) |
The integrated gradient attribution was defined relative to a baseline, and the selection of the baseline is essential to causal analysis of ML models.31 A robust baseline input should give uninformative predictions; for example, for a classification task, the ML model should give the probability of approximately 0.5 for the baseline input, which requires a comparable number of “collapsed” and “not collapsed” cages to mimic the distribution of p = 0.5 for cage collapse in the training set. Here, we used the input of zero vectors as the baseline molecule and augmented the training set using the baseline cages to achieve uninformative predictions for the baseline cages. The detailed implementation is provided in Section S2.†
Once the integrated gradients for all input atoms of the organic cage were calculated, the contribution of the building block and linker of the cage was calculated by summing up the integrated gradients for the atoms corresponding to the building block and linker, respectively. The attribution of fragments in the building block and linker molecules were visualised using the atomic integrated gradients in the molecules.
The GNN model, as well as the computation of integrated gradients were implemented in Python 3.7.5 combined with PyTorch 1.1.0; the source code is provided at http://github.com/qyuan7/Cage_GNN.
GNN | Random forest | |
---|---|---|
Accuracy | 0.89 | 0.88 |
Precision | 0.90 | 0.89 |
Recall | 0.90 | 0.91 |
The All-vs-One task, where data for cages in all but one row in Table 1 was used as the training set and the remaining row is used as the test set, is more challenging compared to the All-vs-All task, as most of the precursors in the test set were not included in the training set (except for the amine2 linkers and amine3 building blocks, which were used by two rows in Table 1). The All-vs-One task provides better evaluation of the transferability of the ML models towards different families of precursors with different functional groups, which carries more application significance for the design of future organic cages. The accuracy scores for the GNN and random forest models are shown in Table 3, and the corresponding precision and recall scores are provided in Table S2.† The results are for when a model was tested on a single data set within a row, i.e. with cages formed by a single reaction chemistry type. The data sets in the other rows were used as the training set. For example, for the test of aldehyde3amine2 cages (row 1), all the precursor pairs in the other rows were used as the training set (amine3aldehyde2, alkene3alkene2, etc.), only the aldehyde3amine2 cages were used as the test set.
Building block | Linker | Test accuracy (Random forest) | Test accuracy (GNN) |
---|---|---|---|
a Model with better performance for each task is highlighted in bold. | |||
Aldehyde3 | Amine2 | 0.61 | 0.72 |
Amine3 | Aldehyde2 | 0.72 | 0.73 |
Alkene3 | Alkene2 | 0.63 | 0.81 |
Alkyne3 | Alkyne2 | 0.41 | 0.77 |
Acid3 | Amine2 | 0.71 | 0.76 |
Amine3 | Acid2 | 0.73 | 0.79 |
Aggregated score (95% CI) | 0.64 ± 0.10 | 0.76 ± 0.03 |
For the All-vs-One task, the GNN model consistently outperformed the random forest model and by a larger margin compared to the All-vs-All task. The biggest improvement in the predictive performance of the GNN model compared to the random forest benchmark was for the alkene3alkene2 cages (alkene metathesis of a trialkene and dialkene) and the alkyne3alkyne2 cages (alkyne metathesis of a trialkyne and dialkyne). As shown in Table 1, the building blocks for alkene3alkene2 and alkyne3alkyne2 cages were not used for the other cages, and the benchmark random forest model failed to give reasonably accurate predictions on the shape persistence of the alkene3alkene2 and alkyne3alkyne2 cages, thus the transferability for the benchmark random forest model is poor to building blocks that were not used in the training sets. The GNN model, on the other hand, was equally accurate for the predictions of the alkene3alkene2 and alkyne3alkyne2 cages compared to the other groups of cages. The consistent improvement in predictive power of the GNN model compared to the random forest model indicates that the GNN model has better transferability to novel precursors for cages and different reaction types. In addition, the improved performance of the GNN model for the alkene3alkene2 and alkyne2alkyne2 cages suggests that the GNN model has learnt some structural features of the precursors that led to collapse from the training process, providing the model with some “chemical intuition”, which can be investigated further by trying to explain and interpret the predictions of the GNN model using the integrated gradients.
It is worth discussing here possible extension to different cage topologies. In this work, we focused on the Tri4Di6 topology, which is only one of the six topologies resulting from a condensation of tritopic and ditopic precursors that we have previously enumerated.28 We foresee re-employment of our approach to cages in other topologies, not even limited to tri- and ditopic precursors. As it is difficult to predict a priori which topology will be preferred, we would recommend training separate models for other possible topologies using the same cage representation as the one used in this work (see Fig. 3). However, in the future we hope for a more general model, which would include neurons in the output layer containing probabilities of cage collapse in all possible topologies alongside a prediction of which topology is the thermodynamically more stable reaction outcome.
In this study, the integrated gradient of the cage input feature x for an atom is defined relative to the baseline input x′ in eqn (7), thus it is important that the GNN model F should give uninformative predictions to the baseline input. For the classification task in this study, the baseline input should render a probability close to 0.5, indicating the baseline cage composed of vector of zeros should have neutralised probability of being “collapsed”. When calculating the integrated gradients, we used the data augmentation technique on the training set, as described in Section S2 of the ESI† and the work by McCloskey et al.27 The distribution of the predicted softmax scores for the “collapsed” neuron in the output layer (which can be interpreted as the probability of the cage being collapsed) on the baseline cages used in training the GNN model for calculating the integrated gradient is shown in Fig. 5(b). The softmax score of the baseline cages centres around 0.5 with a mean value of 0.501 and standard deviation of 0.008. This result indicates that the GNN model gives neutral predictions to the baseline cages, and for a cage with softmax score larger than 0.5 for the “collapsed” neuron in the output layer that is classified to be “collapsed”, the majority of the attribution to the increased softmax score can be ascribed to the molecular features of the building block and linker molecules of the cage.
We calculated the integrated gradients of the precursors in the test sets for the All-vs-One tasks and ranked the building blocks and linkers according to their integrated gradient attribution scores. The top 5 building blocks BB1–5 for the cages generated from the aldehyde3amine2 (imine reaction of trialdehyde and diamine) cages with the largest integrated gradient are shown in Fig. 6. The percentage of aldehyde3amine2 cages containing the building blocks that were “collapsed” in the All-vs-One test set are also shown. It can be seen that almost all the building blocks in Fig. 6 have a probability of larger than 90% of being “collapsed”, indicating that cages with these building blocks have a great chance of being “collapsed” and that these building blocks should be avoided in the design of future organic cages for the sake of shape persistence. The top 5 linker molecules L1–5 for the aldehyde3amine2 cages with the largest integrated gradient attribution to “collapsed” cages are shown in Fig. 7, with the percentage of “collapsed” aldehyde3amine2 cages containing the linker molecules shown. Apart from L1, the cages in the test set containing these linkers have a high probability of being “collapsed”. The integrated gradients of the building block and linker molecules can thus serve as an indicator for the organic cages being “collapsed” – using building block/linker molecules with high integrated gradient attributions means there is a high probability of collapsed cages. It might be tempting to assume that precursors with smallest gradient attributions would indicate “non collapsed” cages. The “bottom 5” building blocks and linkers of the aldehyde3amine2 cages are shown in Fig. S11 and S12.† Cages with such building blocks and linkers still have a considerable possibility of being “collapsed”, thus the integrated gradient has only limited effect of identifying less collapse-inducing precursors. Limitation of the GNN model in identifying “non collapsed” cages could also be observed from the low specificity scores for the All-vs-One tasks, as shown in Table S2.† We, therefore, focus on the collapse-inducing precursors in this study.
The top building blocks and linkers for the other groups of organic cages with the largest integrated gradient together with the probability of a cage being “collapsed” with such precursors are provided in Section S4 of the ESI.† For the acid3amine2 cages (amide condensation of a tricarboxylic acid and diamine), the integrated gradient attributions of the top building blocks had poor correlation to the cage shape persistence, which could be because the carboxylic acid functional group was used less in the database (Table 1), and the GNN model therefore had poorer transferability to the cages with the tricarboxylic acid building blocks. Further improvement of the GNN model for the cages formed via amide condensation reaction would require a larger dataset labelled as per the current dataset. The relationship of cage shape persistence and the average integrated gradient attribution scores for the building block/linker molecules in the All-vs-One task is shown in Fig. S11 and S12.† Qualitative agreement of cage collapse and high integrated gradient scores can be found for cages formed via imine condensation, alkene metathesis and alkyne metathesis, which could provide initial insight into the shape persistence of organic cages formed via such reaction chemistries (see Fig. S13 and S14†).
If specific precursor fragments could be identified as collapse-inducing from the above analysis, then such a fragment could be usefully avoided in the design of novel precursors. Atoms in the cage components with an integrated gradient attribution score greater than 0.01 are highlighted in red in Fig. 6 and 7. The majority of the integrated gradient attribution is located in the central core of the building blocks; and such fragments could contribute to the poor shape persistence of the corresponding cages. It is thus possible to identify molecular fragments/centre cores that have high attribution to the collapse of organic cages and to therefore avoid/alter such fragments when selecting precursors for cage synthesis. In order to validate whether the identified central cores correlate with the shape persistence of all the organic cages in this study, we performed a sub-structure match of the cores across all the cages in this study and calculated the probability of a cage with precursors containing the backbones being collapsed, which is also shown in Fig. 6. Meanwhile, the linkers in Fig. 7 (apart from the outlier L1) contain more saturated carbon chains and hence more internal degrees of freedom. Furthermore, the amine part of the imine bond (resulting from the condensation to give cages in the aldehyde3amine2 set) contains one more flexible methylene unit compared to the aldehyde contribution. As a result, the fragments with high integrated gradients for linkers L2–5 span over both the linker backbone and the functional group, making it difficult to attribute the GNN prediction to any particular motif within those molecules, and therefore substructure matching of the linker molecules was not performed.
To investigate whether the integrated gradient analysis can help chemists design cages with improved shape persistence, we replaced the collapse-inducing core of building block BB2 with a simple rigid benzene ring (yielding BB2mod). After identical MD geometry optimisation conditions as those used in the training set, the modification provided a shape persistent cavity (see Fig. 8). This demonstrates that the GNN model not only shows higher accuracy than the previously reported models but also that the integrated gradients analysis is a powerful tool for molecular design.
The computational study of supramolecular systems such as organic cages is time consuming using physical simulations, and the development of ML techniques has the potential to provide data-driven solutions that might accelerate the evaluation of supramolecular systems. However, in many cases the ML models are regarded as powerful black-boxes, providing limited insight to further the materials discovery process further. In this study, we aimed to develop an explainable GNN model both to ensure the transferability of our model and to provide guidance for further material discovery.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1dd00039j |
This journal is © The Royal Society of Chemistry 2022 |