DeepStruc: towards structure solution from pair distribution function data using deep generative models

Structure solution of nanostructured materials that have limited long-range order remains a bottleneck in materials development. We present a deep learning algorithm, DeepStruc, that can solve a simple monometallic nanoparticle structure directly from a Pair Distribution Function (PDF) obtained from total scattering data by using a conditional variational autoencoder. We first apply DeepStruc to PDFs from seven different structure types of monometallic nanoparticles, and show that structures can be solved from both simulated and experimental PDFs, including PDFs from nanoparticles that are not present in the training distribution. We also apply DeepStruc to a system of hcp, fcc and stacking faulted nanoparticles, where DeepStruc recognizes stacking faulted nanoparticles as an interpolation between hcp and fcc nanoparticles and is able to solve stacking faulted structures from PDFs. Our findings suggests that DeepStruc is a step towards a general approach for structure solution of nanomaterials.


B: Fitting parameters and mean absolute error (MAE) measures of reconstructed MMNPs for analysis of simulated pair distribution functions (PDFs)
In the PDF comparisons a scale-factor, a contraction/expansion-factor and an isotropic atomic displacement parameter (ADP) are refined. This approach is taken in brute force methods such as clusterMining 1 for deciding on the best cluster for a given measured PDF. It accounts for small differences in bond-length, scale-factor, and thermal motion without changing the geometric arrangements of the atoms in the clusters  The number of atoms in the particles were calculated as: Where V is the volume, r is the radius and APF is the atomic packing fraction which for fcc is 0.740. This yields a number of 203 atoms (1.8 nm), 371 atoms (2.2 nm) and 1368 atoms (3.4 nm).

D: Implementation of the sampling and fitting process of the predicted structures from latent space
During the inference process, a PDF is embedded into the latent space using a normal distribution by the prior neural network. Samples drawn from this normal distribution can be used to obtain predicted structures to be fit to the PDF. For the simulated test set (seven structure types and stacking faulted) only a single sample was drawn from the latent space to obtain structures for each of the PDFs. For the experimental data, in order to explore the latent space to a larger extent, we scale the standard deviation (σ) of the normal distribution by σ: 3, 5 and 7.
We then chose to sample 1000 predicted structures for each of the three normal distributions (σ: 3, 5 and 7) and fit them to the data. For the experimental data we report the best fitted structure within each of the distributions (Supplementary Information section E). Fig. S3 demonstrates the histogram of x-and y-positions in latent space when sampling from a mean latent space position of (9.7, 14.8) using the σ: 3 distribution for the Au144(PET)60.
We here sample from a normal distribution, however this could be optimized in many ways e.g. by drawing structures from a uniform distribution, by sampling from latent space regions that are well-populated by the test set, by favouring sampling from regions in the latent space that is populated by underrepresented structures, or by introducing chemical knowledge into the sampling process. Another way to alter the latent space would be to deviate from the isotropic normal distributions. The latent distribution of the CVAE is constrained to be a symmetric normal distribution (with a diagonal covariance matrix). In some cases, this might be limiting and one could consider allowing additional flexibility to the latent space structure. One of the ways of doing this is by relaxing the diagonal covariance matrix assumption, so that asymmetric normal distributions can be modelled. This is previously studied by Jakub et al. 3 where more complex structures for the covariance matrix are modelled for VAE-type models. This additional flexibility of the latent distribution increases the complexity of the learning problem but could also provide efficient exploration of the latent space.

E: Comparing the DeepStruc with baseline algorithms
To evaluate the results, we compare our results with two different approaches that can be used to identify the structural model.
The first approach is the brute-force approach, further described in the part E.1, proposed by Banerjee et. al., 1 which is fitting all constructed MMNP structures to the dataset and reporting the Rwp value. The second approach is a tree-based supervised learning algorithm, further described in part E.2, which has been trained on the same MMNP structures set as DeepStruc but using 100 PDFs with different simulation parameters for each MMNP structure. The range of simulation parameters used is shown in Table S5. While the brute-force method is directly providing us with the best fit, it is computationally expensive. The tree-based algorithm can, like DeepStruc, predict the chemical structure based on its PDF in less than a second. However, the brute-force approach and the tree-based algorithm cannot map a low-dimensional space of chemical structures that can be used to analyse similarities between structures. Also, they are constrained to the structural database whereas the regular CVAE without a graph-based input and the DeepStruc algorithm has generative capabilities which make it possible to both interpolate and extrapolate slightly from the training distribution. Fig. S2 illustrates a comparison of the results of the brute-force, tree-based and DeepStruc. The fits from the brute-force approach have a slightly lower Rwp value than the fits from the tree-based approach and DeepStruc. However, the brute-force approach is at least 3 orders of magnitude slower and consumes at least 4 orders of magnitudes more CO2 than the Machine Learning (ML) algorithms after they have been trained, see section E.3.. The training process of the ML algorithms has to be done only once compared to the brute-force algorithm which has to be redone on every experimental dataset that is collected.

E.1.: Brute-force modelling
Brute-force metal mining was done by creating MMNPs with the Python library atomic simulation environment (ASE) module 6 and fitting them to the data iteratively in DiffPy-CMI 5 , from 0 to 30 Å, as proposed by Banerjee et al. 1 The advantage of the brute-force approach is that it directly yields the Rwp value of the fit of the structure to the dataset. The disadvantage is that it is computationally expensive. In this project, the results of the bruteforce approach are used as a baseline for the predictions from the ML predictions.

E.2.: Using a tree-based classification algorithm to predict a MMNP from a PDF
A gradient boosting decision tree (GBDT) algorithm was trained to do the classification job of predicting the MMNPs from a PDF. 7 For each MMNP, 130 PDFs were simulated with parameters in the range shown in Table   S5. The GBDT algorithm is trained on 100 of the PDFs, 15 of the PDFs were used for validating the model during the training process and 15 of the PDFs were used to calculate the accuracy of the model after the training process (test set). The model is trained with a learning rate on 0.15, max depth on 3 and an early stopping criteria on 5 rounds of no improvement. Table S2 | DiffPy-CMI simulation parameters for PDF data of the data used to train the tree-based classifier.

rmin (Å)
The loss curve, Fig. S5, illustrates that the GBDT did not improve significantly after about 100 epochs. structure. We believe this is because the icosahedral structure was underrepresented in the model, which can be fixed with data augmentation.

G: Simulation parameters of the PDFs
All PDFs used for conditioning DeepStruc were simulated using the DiffPy-CMI library. 9 Table S5 | DiffPy-CMI simulation parameters for the PDFs of the seven structure types used in Fig. 1-5.   four neighbours each with a bond distance of approximately 2.76 Å. The graph representation is a mathematical approach to describe a chemical structure which maintains the interatomic relationship (edges) during translation, rotation and permutation. By adding the xyz-coordinates to the nodes and linking them through the adjacency matrix, the geometrical information can be transferred to the learning process of DeepStruc.