Yilin
Yang‡
,
Mingjie
Liu‡
and
John R.
Kitchin
*
Department of Chemical Engineering, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA. E-mail: jkitchin@andrew.cmu.edu
First published on 8th August 2022
With the popularity of machine learning growing in the field of catalysis there are increasing numbers of catalyst databases becoming available. These databases provide us with the opportunity to search for catalysts with desired properties, which could lead to the discovery of new catalysts. However, while there are search methods for molecules based on similarity metrics, for solid-state catalyst systems there is not yet a straightforward search method. In this work, we propose a neural network embeddings based similarity search method that is applicable for both molecules and solid-state catalyst systems. We illustrate how the search method works and show search examples for the QM9, Materials Project (MP) and Open Catalyst 2020 (OC20) databases. We show that the configurations found present similarity in terms of geometry, composition, energy and in the electronic density of states. These results imply the neural network embeddings have encoded effective information that could be used to retrieve molecules and materials with similar properties.
Usually, researchers may want to search for molecules or materials with similar properties in applications like discovering new drugs or cheaper materials.4–6 Many similarity search methods have been developed for this purpose.7,8 In general, a similarity search approach consists of three essential components: a molecular representation method, a quantitative metric to measure the similarity of two molecules, and a search algorithm. The search process usually starts with one or more query molecules (e.g., configurations that have desired properties). Then the representation method converts them into numerical representations which could be used to calculate the pair-wise similarities. After that, the search algorithm retrieves candidates on the basis of the similarity measurement. The retrieved molecules are ranked by their similarities to the query molecule(s) in descending order. Significant efforts have been spent on designing fingerprints to represent molecules. For example, SMIfp (SMILES fingerprint) converts a molecule into a 34-dimension scalar fingerprint.9 Each element of the fingerprint counts the occurrences of 34 symbols in SMILES, where SMILES (Simplified Molecular Input Line Entry System) is a chemical language and information system used to represent different atoms and bonds with ASCII characters.10,11
The substructure-based fingerprint is also a popular choice to represent molecules. Each item of the fingerprint encodes whether or not a substructure is present in a molecule. Typical examples include the Molecular ACCess System (MACCS) and the Barnard Chemical Information Ltd. (BCI) fingerprint.12,13 MACCS uses 166 structural fragments as the keys while BCI contains 1052 substructures. These fingerprints rely on a pre-built library of substructures as the keys, which limits their applications only for molecules which have well-defined bondings. Also, they cannot be used for chemical systems represented with the periodic boundary condition (PBC). However, catalyst systems, such as complex adsorbates on solid surface, are usually represented with PBC and there is no well-defined bonding between atoms. Therefore, these methods cannot be used. To cope with the PBC problem, several molecular representation methods have been proposed such as the Atom-Centered Symmetry Functions (ACSF),14 and Smooth Overlap of Atomic Positions (SOAP).15 However, these methods have very poor scaling regarding the number of elements in the database. Hence, a method that deals with PBC and that has good scaling regarding the number of elements is needed.
In the past decade, the development of deep learning methods has changed the way we can represent data like text and images. Deep learning models like the convolutional neural network (CNN) and recurrent neural network (RNN) have been widely applied in computer vision and language processing tasks.16–19 For most of the deep learning models, the last layer of the deep neural network represents the input data as a numerical vector which contains rich information about the data. This vector representation is also called an embedding. Since the neural network output usually depends linearly on the embedding, we can also regard the embedding as a nonlinear dimensional transform of the input into a space where the output is linear. More importantly, this numerical vector is a fingerprint of the input that may be useful in search. The promising performance of the deep learning models in various tasks implies the embedding must represent the data in a reasonable way. Therefore, these neural network embeddings have been applied in many information retrieval systems involving images and text.20–22
For molecular data, several graph neural network (GNN) models have been proposed to learn the embeddings to represent the atomic configurations, such as the CGCNN and the GemNet model.23,24 The atomic embeddings contain information including the element type of the central atom, positions, and elemental information of the neighboring atoms. When applied in specific tasks (e.g., energy and force prediction), it is reasonable to think that neural networks could be trained to generate atomic embeddings in a space where the specific property (e.g., atomic energy) is linearly related to the embedding vectors. Therefore, the distance between the embeddings in this space could serve as a similarity measure of a specific property.
In addition to the molecular representation part, the other important component in the molecular similarity search is the search algorithm. Exhaustive search for similar vectors to a high-dimensional query vector in a large database is both time- and resource-consuming. Therefore, many approximate nearest neighbor (ANN) search methods have been proposed to find approximate results with much less time and fewer resources.25 The ANN search methods can be classified as hashing-based, quantization-based, tree-based, or graph-based methods according to the techniques used to accelerate the search process.26 Typical examples include locality-sensitive hashing, SPTAG, and ScaNN.27–29 There are also packages released to implement these ANN search algorithms such that more research work can be benefited from these ANN search methods. For example, Facebook AI Similarity Search (FAISS) is a library containing implementations for several ANN search algorithms.30
In this work, we show a method based on neural network embeddings to search for similar structures. We demonstrate that the method can be applied to any atomistic system including organic molecules, bulk materials, and adsorption systems. When combined with ANN search methods, neural network embeddings can be used to retrieve similar atomic structures efficiently in large databases. We also show that the similarity is related to the specific property which is used to train the neural network models. Therefore, this method has the potential to search for similar molecular structures in a property-oriented way.
![]() | ||
Fig. 2 The overall architecture of the modified GemNet-dT. The red dotted parts are from the original GemNet-dT and the blue parts are from the modified GemNet-dT. xca is the distance between atom a and atom c. φcab is the angle ∠cab. RBF and CBF are the spherical Fourier–Bessel bases with polynomial envelopes developed by Klicpera et al.24 Therefore, e(ca)RBF and e(cab)CBF are the distances and angles expanded into those bases. z is the element information of all atoms in the chemical system. m(l)ca is the edge embedding between atom c and atom a. Edge embeddings for all arbitrary atom c regarding atom a, will be used to predict the atomic forces of atom a. h(l)a is the atomic embedding of atom a. It will be used to predict the atomic energy of atom a and it is the atomic fingerprint used in this work. (mlca is the edge embedding between atom c and atom a. While it is not used in this work, it can be used to predict the contribution to the atomic forces of atom a from atom c). |
To obtain the GemNet atomic embedding, we trained the GemNet on the energies of each entry in the database. Once the GemNet is trained, the output atomic embedding, h(l)a, which is used to predict the atomic energy Ea, will be the atomic embedding used to describe the local environment of the atom in the chemical system.
In this work, the similarity search is operated on two aspects, at the atomic level and the molecular level. At the atomic level, we can directly use ANN search on the atom embeddings based on the Euclidean distance. However, for the nearest molecules search, we need to convert the similarity of atom embeddings into the similarity of the molecules. This process is shown in Algorithm 1. Basically, for each atom in a molecule, we search for k approximate nearest atom embeddings in a database. These atom embeddings are considered matched for the query atom. The corresponding molecules containing these matched atoms are added to a candidate set. After looping over all atoms in the query molecule, we rank the molecules in the candidate set according to the number of matched atoms in descending order (sum of Euclidean distances between the matched atom embeddings is used to break ties). Top n molecules are the n approximate nearest neighbors. Large k prefers the candidates that are globally similar to the query molecule while small k favors the molecules containing local environments with high similarity to the query molecule. We used k around 10 times n in our work.
![]() | ||
Fig. 3 Retrieved similar molecules (top 5) for benzene. Figure (a) is benzene used as the query molecule. Figures (b) to (f) show the nearest molecules in the QM9 dataset. |
Because QM9 is a molecule database, methods developed for drug discovery can also be used. Therefore, to understand the search results from our method, we compared them with the results obtained using a typical molecular similarity search method, which served as a baseline for the search results. For the molecular similarity search, we used the Morgan Fingerprint with a diameter of four,37,38 and the similarity metric used was the Tanimoto similarity.39 The Morgan fingerprint is a bit vector that essentially tells whether there are certain local structure (e.g. aromaticity, double bond, hydroxyl group, etc.) in the molecule. The Tanimoto similarity is calculated by eqn (1).
![]() | (1) |
Therefore, if two molecules contain a lot of the same local structures, the Tanimoto similarity will be high and the molecules will be considered similar.
The search results of the Morgan fingerprints and the Tanimoto coefficient are shown in Fig. 4. The main difference from the GemNet result is in Fig. 4b and c, where larger rings are retrieved in the search results. These two molecules are less similar to benzene from the aspect of atom numbers and bond angles in the ring structure. According to the top 5 nearest molecules, GemNet embedding retrieves more similar molecules than Morgan fingerprint. This shows that the kind of vector fingerprint used in the search is important.
In addition to the qualitative evaluation of the similarity by visual comparison over the elemental and geometric features, we also analyze the similarity among the molecules by investigating their relevance in the energetic embedding space. We built Gaussian process regression (GPR) models using the found molecules as the training set. The hypothesis is that training on similar molecules will result in a more accurate model more quickly for a query than training on random molecules.
We used the FLARE package as the implementation of the GPR models.40 The hyperparameters for the GPR model are provided in the ESI.† During the training process, we iteratively added the found molecules one by one into the training set and updated the GP model. Then we used the GPR model to predict the energy of benzene and compared the prediction with the true label. The results of the GPR models are shown in Fig. 5. We included the results from the training set searched using GemNet embeddings and Morgan fingerprints, as well as a set of random molecules from the QM9 dataset as the baseline. In Fig. 5, we can see that as we add more configurations into the training set, the prediction error and standard deviation are generally decreasing. However, using molecules found in different ways, the GPR models have different performances. The GPR model trained on the molecules retrieved by GemNet embeddings has the smallest prediction error (0.04 eV) and standard deviation (0.02 eV). The GPR model from Morgan fingerprints has a larger error (3.64 eV) and standard deviation (17.14 eV). The GPR model from the random molecules has the largest error and prediction uncertainty, which is 5.76 eV and 43.44 eV respectively with up to 15 configurations. These results imply that GemNet embedding has a representation of the atomic environments that is more relevant to the energetic property of the molecules. This is not surprising since the GemNet model was trained on the energy data of the molecules and the atomic energy prediction is a linear regression on the atom embeddings.
In addition to the search for a whole molecule, we can also use the GemNet embeddings to search for substructures. Here, we demonstrate an example of searching for a molecule containing similar substructures to the hydroxyl group of the butanol and the amino group of the glycine. The search procedure is similar to the method for benzene but has an additional step to join the search results from hydroxyl and amino groups, which is similar to an and operator on two sets. Fig. 6 shows the query substructures and the searched molecule. The atoms in the query and matched substructures are marked with crosses. Both the hydroxyl and amino groups are retrieved in the resulting molecule. In addition, the retrieved hydroxyl and amino groups are somehow similar to the queries. For the hydroxyl groups, they are both at the end of a three-carbon chain for the query and searched molecules. For the amino groups, they are at the end of a two-carbon chain and there is a –OH group at the other end.
Similar to the QM9 case, the whole Materials Project database was split into the training and validation sets randomly with a ratio of 0.8:
0.2. A GemNet model was trained on the training set. The energy MAE on the training and validation set was 0.62 eV and 1.42 eV respectively. There is an apparent gap between the accuracy of the GemNet model on the training and validation set. We attribute this gap to the configuration extrapolation in the validation set. At the point we stop the training, there was no increase of the MAE on the validation set along with the training steps, which implied the model was not located in the overfitting region. We then used the trained model to search for similar atomic environments in the Materials Project training dataset.
As an example query, we search for an oxygen atom in a Al2Cu3O6 bulk cell. The query and found atoms are shown in Fig. 7. The distances of the searched atoms to the query atom and their ranks are shown in Fig. 7. The query oxygen atom is atom 9 in Fig. 7a, which is closely neighboring to a copper atom (atom 3). There is also an aluminum atom (atom 0) near the query oxygen atom. These three atoms form an angle around 135° with the aluminum and copper atoms at two ends and the oxygen atom at the vertex. There is also another oxygen atom (atom 5) at the opposite position to the query oxygen atom across the copper atom. These geometric features also appear in the searched atoms in Fig. 7b (atoms 10, 13, 15, 17) and Fig. 7c (atoms 6, 8). Periodic boundary conditions should be considered when examining the geometric similarity for atom 10 and atom 13 in Fig. 7b. In Fig. 7d, atom 17 is also the found atom and it is ranked as 7th in all atomic environments although its neighboring environment looks not so similar to the query atom. This is because there are no more similar atoms like the previous ones in the remaining pool.
As shown in Fig. 7, the Euclidean distance of the atom embeddings for the searched atoms in Fig. 7b and c to the query oxygen atom ranges between 0.06–0.11. This distance jumps to 0.17 for atom 17 in Fig. 7d. The distance of the atom embeddings also implies that atom 17 of Fig. 7d is not so similar to the query oxygen atom from the view of the GemNet model. It is also worth noting that during the searching process, we did not explicitly restrict the searching pool to be oxygen atoms. This atomic identity feature was already encoded into the GemNet atomic embeddings, and this is why the retrieved atoms are all oxygen atoms in Fig. 7 although with different local environments. For more examples, please refer to the ESI.†
The first query example is an oxygen atom adsorbed on a tilted hollow site consisting of two Pd atoms and one Ag atom. The local configurations are shown in Fig. 8a. We seek examples from the database that are similar to this query. During the searching, we did not explicitly provide information about the element types of the central and surrounding atoms. Only the GemNet embeddings were used to measure the similarities. According to the search result in Fig. 8, (a zoomed-in view can be found in the ESI†) this elemental information as well as the geometric information of the adsorption site has already been encoded into the GemNet embeddings. On the one hand, the retrieved atoms are all oxygen atoms. On the other hand, the adsorption sites are all hollow sites with two Pd atoms.
![]() | ||
Fig. 8 Configurations of the query and found atoms (marked as crossed). Configuration (a) is the query oxygen and configurations (b) to (d) are the retrieved atoms. |
In addition to these apparent similar geometric features, we also present the similarity between the query atom and the searched atoms via the density of states projected to these atoms (ADOS). The ADOS data was calculated by the Vienna Ab initio Simulation Package (VASP).41 The ADOS data is shown in Fig. 9. For the searched atoms, their ADOS curves almost overlap with the query oxygen atom. Their cosine similarities are all above 0.6 (1.0 would be identical ADOS). As a comparison, we show the ADOS data of four randomly selected oxygen atoms (see detailed configurations in the ESI†) in the OC20 dataset in Fig. 10. These random atoms have different ADOS from the query atom and their cosine similarities are generally smaller than the searched ones. This example shows that the GemNet embeddings are able to search for elementally and geometrically similar local environments for a single atom adsorbed on metallic surfaces. These similarities also lead to the similarity in the density of states (DOS). This example also implies a potential application of searching for similar local structures using the projected DOS with vector search methods, since similar DOS suggests similar elemental and geometric environments, as well as potentially similar chemical properties. Storing DOS data when building a database with some extra resources would be beneficial to this kind of application in the future.
![]() | ||
Fig. 9 Density of states projected onto the p-orbital of the query and searched oxygen atoms. Figures (a) to (d) correspond the configurations (a) to (d) in Fig. 8. The blue curve is the original DOS energy and density data. The orange line is the linearly interpreted data from the original DOS data to make the energy stamps to be the same across the configurations. Cosine similarity was calculated using the interpreted data. |
Next, we demonstrate that our method not only works for simple atoms like oxygen, but also for larger adsorbates like acetylene. In the OC20 dataset, we search for similar atoms with embeddings similar to that of the two carbon atoms in the query acetylene. We did not include the searching for similar atom embeddings to the hydrogen atoms since the carbon atom is the main feature of acetylene. Ignoring hydrogen atoms is also adopted in other molecular fingerprints like the SMILES.10 The query and searched configurations are shown in Fig. 11. The query object is an acetylene molecule adsorbed on a hollow site formed by three Cu atoms. The retrieved adsorption configurations are similar to the query one. The first point is that the found adsorbates are all acetylene without explicitly setting the search pool to be acetylene molecules. The second point is that the adsorption sites of the top two results (Fig. 11b and c) are hollow sites with three Cu atoms which are the same as the query one. This is not so clear in Fig. 11b, more details of the local structure can be found in the ESI.†
![]() | ||
Fig. 11 Configurations of the query (config. a) and top 3 retrieved acetylene adsorption configurations (config. b to d). The query and matched carbon atoms are marked as crossed. |
Similar to the oxygen case, we also compare the ADOS of the query and searched configurations. Fig. 12 shows the ADOS of a selected carbon atom of acetylene molecule in these systems. We can see the ADOS of the searched configurations are similar to the query one, and their cosine similarities are all above 0.65 which is much higher than that of four randomly selected configurations shown in Fig. 13. The similarities in terms of the adsorbates, adsorption sites, and DOS between the query and searched configurations suggest that our method also works well for adsorption systems with large adsorbates.
![]() | ||
Fig. 12 Density of states projected onto the p-orbital of the selected query and searched carbon atoms. Figures (a) to (d) correspond the configurations (a) to (d) in Fig. 11. The blue curve is the original DOS energy and density data. The orange line is the linearly interpreted data from the original DOS data to make the energy stamps to be the same across the configurations. Cosine similarity was calculated using the interpreted data. |
The results from the OC20 examples demonstrate that the method is able to find atoms in similar chemical environments illustrated by their similar ADOS. This could be very useful for catalyst design. One potential direction would be to find cheaper alternative catalyst materials which could maintain similar chemical environments for the adsorbates.
We illustrated the idea with examples across organic molecules, bulk systems, and solid surfaces with adsorbates, and showed the ability of this framework to find similar configurations in different databases. We presented the similarities from different aspects: elemental types, geometric features, energetic relevance, and the electronic density of states. These examples also reflect the generalizability of this framework for different types of atomistic systems.
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2dd00055e |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2022 |