Andy S.
Anker
a,
Keith T.
Butler
b,
Raghavendra
Selvan
cd and
Kirsten M. Ø.
Jensen
*a
aDepartment of Chemistry and Nano-Science Center, University of Copenhagen, 2100 Copenhagen Ø, Denmark. E-mail: kirsten@chem.ku.dk
bDepartment of Chemistry, University College London, Gower Street, London WC1E 6BT, UK
cDepartment of Computer Science, University of Copenhagen, 2100 Copenhagen Ø, Denmark
dDepartment of Neuroscience, University of Copenhagen, 2200 Copenhagen N, Denmark
First published on 22nd November 2023
The rapid growth of materials chemistry data, driven by advancements in large-scale radiation facilities as well as laboratory instruments, has outpaced conventional data analysis and modelling methods, which can require enormous manual effort. To address this bottleneck, we investigate the application of supervised and unsupervised machine learning (ML) techniques for scattering and spectroscopy data analysis in materials chemistry research. Our perspective focuses on ML applications in powder diffraction (PD), pair distribution function (PDF), small-angle scattering (SAS), inelastic neutron scattering (INS), and X-ray absorption spectroscopy (XAS) data, but the lessons that we learn are generally applicable across materials chemistry. We review the ability of ML to identify physical and structural models and extract information efficiently and accurately from experimental data. Furthermore, we discuss the challenges associated with supervised ML and highlight how unsupervised ML can mitigate these limitations, thus enhancing experimental materials chemistry data analysis. Our perspective emphasises the transformative potential of ML in materials chemistry characterisation and identifies promising directions for future applications. The perspective aims to guide newcomers to ML-based experimental data analysis.
Most applications of ML in materials chemistry apply supervised ML methods. Supervised ML is broadly the task of predicting a label based on a given set of input features. As will be exemplified throughout the perspective, we observe three main applications of supervised ML for the analysis of scattering and spectroscopy data: (1) identifying a physical model from a scattering or spectroscopy dataset (main application 1, Fig. 2). Here, scattering or spectroscopy data are the input features, and the model is supervised to relate the datasets to the physical models, which are the labels. (2) Predicting scattering or spectroscopy data from a physical model. This can be achieved by using the data as labels and the physical model as input features (main application 2, Fig. 2). (3) Bypassing the model refinement step to directly obtain structural information (main application 3, Fig. 2). This is done by training the supervised ML model on data with varying structural parameters.
To train an ML model using supervised methods, one needs a dataset consisting of many pairs of labels and input features. This dataset, consisting of e.g., structure models and simulated data, is generally split into a training, validation, and test set, often in a 3:1:1 ratio. While it is critical that the data closely mirrors real-world, experimental data, labelled experimental datasets that can be used for training are not widely available. Due to this, one often resorts to simulated data that are designed to resemble experimental data. The model is trained on the training set, while being continuously evaluated on the validation set, using a user-defined objective function, called the loss function. Depending on the chosen class of models, the training will often improve on the training set until it can fit any trends in the data, including noise (overtraining). The validation set is used to ensure that the model training is stopped before it is overtrained. Once training is complete, the test set, which has not been used during training or validation, is employed to estimate the accuracy of the model on future unseen data (generalisation). It is critical that the test set closely mirrors experimental data in order to trust the reported accuracy. An intriguing possibility is to gather extensive experimental datasets from structural models, which could serve as training set for a structure-to-signal ML model (Fig. 2B) that thereby learns to include experimental effects that are otherwise challenging to simulate. The quality and size of the training set thus plays a crucial role in the model's efficiency and accuracy, with larger, higher-quality datasets typically yielding better results. A model's ability to interpolate and extrapolate, or make predictions within the range of the training data and beyond it, is generally influenced by the ML algorithm and the range and diversity of the training set. Many factors therefore need consideration when selecting and training an ML model. These include the choice of ML algorithm (tree-based methods, neural networks (NNs), genetic algorithms, etc.),28,29 the number of parameters in the ML model, and both the quality and quantity of the training set. The model's ease of training and deployment can be influenced by the choice of ML algorithm. Interpretability of the model depends strongly on the algorithm used, for example an individual decision tree is easily interpretable, whereas a deep neural network with millions of parameters is not, and requires post hoc methods to understand its operation.30 When it comes to scalability, NNs have many more trainable parameters compared to tree-based methods. This makes tree-based method efficient learners in small data regime, however, NNs often prove more effective at handling larger datasets. NNs are today commonly trained on large datasets, as used in for example, the backbone of the GPT model31–33 and AlphaFold.34 This superiority in scalability might explain why NNs have become the predominant ML algorithm for structural analysis as large databases of training data have become increasingly available.
While training an ML model can be computationally expensive, this is a one-time cost. Subsequent predictions using the ML model can be computationally inexpensive and integrated into web-based solutions, or can be done at synchrotron or neutron facilities for real-time data analysis while experiments are going on.
However, supervised ML is limited by its reliance on paired input data and labels for training, which can be challenging to obtain for experimental data analysis. As will be discussed and exemplified below, we observe three common challenges encountered when analysing experimental scattering and spectroscopy data with supervised ML. These are illustrated in Fig. 3: challenge (1) handling data with contributions from multiple chemical components. Challenge (2) handling data arising from structures not present in the training database and challenge (3) accounting for experimental data that contain signals not included in the simulated data. In all three scenarios, the labelled data are inadequate for solving the problem at hand, making unsupervised ML methods a more suitable alternative, or complementary tool. Unsupervised ML models work without paired labels and input features, using only input features or intermediate input-derived labels, such as in autoencoders.35 Unsupervised ML is often used to present complex data in a low-dimensional space (dimensionality reduction), enabling the analysis of high-dimensional dataset similarities, clustering, and the extraction of underlying data trends that are difficult to comprehend from the input representation space.28 Unsupervised methods can also be applied to ‘demix’ data, i.e., separating the signal from each component in a multi-phase scattering or spectroscopy dataset.
In the following sections, we use selected examples to provide an overview of how supervised ML has been used to identify structural models and structural information from experimental powder diffraction (PD), PDF, small-angle scattering (SAS), inelastic neutron spectroscopy (INS) and X-ray absorption spectroscopy (XAS) data or predict the dataset from a structure. We also outline and exemplify how unsupervised ML has been applied to address the three challenges presented in Fig. 3, and we discuss the potential future impact of ML in the analysis of experimental materials chemistry data.
Conventional analysis of XAS data requires expertise in the complex data analysis as well as manual work. To address this, Zheng et al. created a large XANES database, XASdb, with more than 800000 computed reference XANES entries from over 40000 materials from the open-science Materials Project database.36 Their supervised ML model, ELSIE, illustrated in Fig. 4, was used for the analysis of XANES data. Given a XANES spectrum as input, it outputs a list of the chemical compounds whose spectra are most similar to the target spectrum (main application 1, Fig. 2). From these compounds, chemical information such as oxidation state and coordination environment can be extracted. ELSIE predicts the chemical compound with 69.2% top-5 accuracy on a test set of 13 simulated XANES spectra. However, the correct oxidation state is predicted with 84.6% accuracy and the coordination environment with 76.9% accuracy. As we illustrate with challenge 3 in Fig. 3, the ML model's accuracy is lower on experimental data. On six experimental XANES spectra, ELSIE predicts oxidation state with 83.3% accuracy, coordination environment with 83.3% accuracy and the chemical compound with 33.3% top-5 accuracy.37 While the predictions from ELSIE demonstrate some accuracy, they are yet to achieve the reliability of conventional methods that rely on direct comparison of measured references that are XANES spectra measured on compounds with well-known oxidation states. However, reference comparison, though grounded in empirical data, can also be inaccurate as both oxidation state and structure affect the XANES pattern, which makes it challenging to choose the chemical compounds from which the reference pattern is measured. These results highlight the impact of large databases like the open-science Materials Project36 and JARVIS.38 As these databases grow, they will likely catalyse supervised ML analysis of scattering and spectroscopy data in materials chemistry for example by achieving higher accuracies in the oxidation state determination from a XANES pattern. An optimal path forward might combine the ever-improving predictive capabilities of ML models like ELSIE with the established reliability of conventional reference matching.
Fig. 4 Workflow schema of the Ensemble-Learned Spectra IdEntification (ELSIE) algorithm. The ELSIE algorithm consists of two steps. In the first step, the absorbing species is identified and used to narrow down the candidate computed reference spectra. In the second step, the algorithm yields a rank-ordered list of computational spectra according to similarity with respect to the target spectrum. The figure is adapted from Zheng et al.37 (Under Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/). |
The above example shows that large XANES databases, like XASdb or the XAS data distributed via the Materials Project,39 can be used to address the spectrum-to-structure problem, as illustrated with main application 1 in Fig. 2. However, it can also be used to address the inverse problem: structure-to-spectrum, as illustrated with main application 2 in Fig. 2. Calculating a XANES spectrum from a structure can be computationally demanding (CPU hours) but by using a supervised ML model to predict the spectrum, this process can be done in milliseconds to seconds.40–42 One advantage of the spectrum calculation are that they can be performed for hypothetical structures whereas structure-to-spectrum ML models are in contrast limited by the composition of their training set.
Supervised ML can also be used to directly predict chemical information (main application 3, Fig. 2) such as average size, shape, morphology and oxidation states of first metallic nanoparticles43,44 and metallic oxides,45 Bader charge,46 mean nearest neighbour distances,46 and local chemical environment,47 from an XANES spectrum or predict the radial distribution function from the experimental EXAFS data.48–52
Analysing XAS data from samples containing more than one chemical species remains a challenge (challenge 1 in Fig. 3), as supervised ML models trained on data simulated from a single chemical species are constrained to be used on experimental data from individual chemical species, and attempting to account for all possible chemistries, e.g. by training on simulated data from mixed samples, leads to a combinatorial explosion. Instead, linear unsupervised ML techniques like principal component analysis (PCA) and non-negative matrix factorisation (NMF) have been used to discover trends in large XAS datasets and separate them into signals from their respective chemical components.53–58 For example, Tanimoto et al. used NMF to identify and map spatial domains from absorption spectra in 2D-XAS images of lithium ion batteries.56 The authors recognised that NMF can be challenged by background effects as these can be predominant in some of the NMF-extracted components. Therefore, they subtracted a reference X-ray absorption spectrum obtained on Li0.5CoO2, which also includes that background signal. This trick enables the NMF method to distinguish small differences in the spectra.
Traditional SAS data fitting is done by refining a structural model against the data as illustrated in Fig. 1. The structural model must describe the particle shape, size, and size distribution as well as possible agglomerations of e.g., nanoparticles or large molecules in the sample, and much work is often needed in deciding on a suitable structural model. This is usually done by manually comparing them, and this step can be time-consuming and prone to errors. For example, the structure refinement can become stuck in local minima.60 Here, ML can assist by providing a more efficient approach to determine the starting model for structure refinement. The Computational Reverse-Engineering Analysis for Scattering Experiments Genetic Algorithm (CREASE-GA) tool, developed by members of Prof. A. Jayaraman's research group, can reconstruct 3D structures from SAS patterns using a genetic algorithm (main application 1, Fig. 2).61–67 CREASE-GA compares the goodness-of-fit between the experimental SAXS pattern and simulated SAS patterns derived from a population of 3D structures. A genetic algorithm29 is then used to update the 3D structure population to better describe the experimental SAS pattern. This process continues until convergence, determining the 3D structure of the sample in question. Originally, the SAS patterns from the 3D structure population were calculated using the Debye scattering equation. This posed a computational bottleneck for CREASE-GA as the computational time of the Debye scattering equation scales with the number of scatterers squared.60 To address this, the authors have recently managed to accelerate CREASE by over 95% by employing NNs to estimate the SAS patterns (main application 2, Fig. 2).62,65 While NNs cannot match the accuracy of the Debye scattering equation in simulating SAS patterns, they offer an additional advantage. The NN learns concurrently from the 3D structure population and their corresponding SAS patterns to predict 3D structures that align more closely with the experimental SAS pattern. Heil et al.65 show that the accelerated CREASE method achieves similar, and sometimes superior, results when modelling particle size distribution and degree of aggregation from experimental data obtained from a one-component (melanin) nanoparticle solution and a binary (melanin and silica) nanoparticle assembly.
Supervised ML has also proved to be an efficient tool for direct parameter extraction from SAS data (main application 3, Fig. 2), which might be difficult or time-consuming for humans to detect, such as orientation,68 shape,69–71 or the model for SAS form factor fitting.72–74 For example, the Scattering Ai aNalysis (SCAN) tool can predict the model for SAS form factor fitting from a SAXS pattern obtained from a nanoparticle. With the SCAN tool, the user can choose from a range of ML algorithms including tree-based algorithms and NNs. These algorithms individually achieve accuracies between 27.4% and 95.9% quantified on a test set of simulated SAXS data. However, when the ML models are combined, they achieve an accuracy of 97.3% on the same test set. We are grateful to the authors for making SCAN open source, which has made it possible to implement it as a Hugging Face app.75 This makes it easily useable, also for users without programming experience, as illustrated in Fig. 5. Since SCAN can analyse SAXS data in seconds in an automated manner, an obvious use case would be to follow nanoparticle shape changes during an in situ SAXS experiment. This is not possible with conventional structure refinement methods which require user inputs.
Fig. 5 (A) The SCAN74 tool directly predicts structural information such as particle shape from a SAXS pattern. (B) Overview of the SCAN74 tool's ease of use for predicting structural information from a SAS pattern through the Hugging Face app.75 Simply click “Browse files”, wait for the model to predict the structural information, and, if needed, download the detailed information in an Excel sheet. |
Recent advances in ML techniques offer promising new opportunities in PD data analysis. For example, it has been demonstrated that a sample's crystal system and space group can be predicted from X-ray PD or electron diffraction data using NNs21,26,76–78 and tree-based techniques (main application 1, Fig. 2).79 Suzuki et al. demonstrated that an advantage of the tree-based ML approaches is that they are interpretable.79 Interpretability enables us to understand the ML model's prediction mechanism and thus analyse when it predicts differently from a human expert. This can either indicate when the ML model is wrong and needs to be corrected or reveal when it uncovers unexpected correlations that may lead to scientific insights. In the study by Suzuki et al., it was revealed that the ML model leveraged specific parameters—namely, the number of peaks present in the PD and the Q-value of the 3rd peak—to discern whether the data were derived from cubic or non-cubic structures. This approach mirrors the analytical strategies typically adopted by human researchers and hence builds a degree of trust in the predictions generated by the ML model.
However, the above ML models can only be used to determine the crystal system or the space group from PD data. To identify the full structural model for e.g., structure refinement, the unit cell, and unit cell content is also needed in the prediction task. Garcia-Cardona et al.80 made progress towards this for neutron diffraction data, where the crystal system (cubic, tetragonal, trigonal, monoclinic, and triclinic) could be predicted with an accuracy of 92.65% (main application 1, Fig. 2) using convolutional NNs, which are a type of NNs that capture the relationship between neighbouring data points e.g. neighbouring intensities in the diffraction pattern. Subsequently, another supervised tree-based ML model was used to predict unit cell parameters (unit cell length and angles) from the data.80 The authors note that the ML models possess good performance on simulated data but more sophisticated models are required before it is applicable on experimental data (challenge 3, Fig. 3). Progress is made in developing an ML model that is capable of precisely identifying a full structure model, including unit cell content, as required for e.g., Rietveld refinement of experimental PD data.25,81–83 One such example is the probabilistic convolutional NN known as XRD-AutoAnalyzer83 which achieves 93.4% accuracy on phase identification from experimental PD patterns while providing uncertainties between 0 and 100%.84 Here, an X-ray PD pattern is measured over 10–60° (using a Cu Kα source) and fed into XRD-AutoAnalyzer, which then identifies a structure model. Should the prediction uncertainty surpass 50%, additional measurements are necessary to definitively identify the structure. In order to determine which additional measurement to perform, class activation maps, a type of interpretable ML, are used on XRD-AutoAnalyzer to highlight regions that are critical for differentiating between the most probable phase and the subsequent highest probability phase. The procedure continues repetitively with new X-ray PD measurements until the XRD-AutoAnalyzer can confirm the structure model with a confidence exceeding 50%.
If working in a more restricted chemical space with well-defined components, it is possible to use supervised ML models for direct prediction of structural parameters for the phases included in the space. Dong et al. demonstrated that it is possible to directly predict structural information such as scale factor, lattice parameter and crystallite size (main application 3, Fig. 2) from PD patterns from a system of 5 different metal oxides using a convolutional NN that they call Parameter Quantification Network (PQ-Net).2 They obtained an experimental X-ray diffraction computed tomography dataset of a multi-phase Ni–Pd/CeO2–ZrO2/Al2O3 containing about 20000 diffraction patterns with signals from multiple phases. Treating such a large quantity of data with conventional Rietveld refinements takes significant computer time. To overcome this limitation, PQ-Net was trained on simulated PD data with varying scale factors, lattice parameters and crystallite size for NiO, PdO, CeO2, ZrO2 and theta-Al2O. A 2nd degree Chebyshev polynomial background and Poisson noise were also added to the training data. After training, PQ-Net can identify the crystalline phase, scale factor, lattice parameter and crystallite size for each experimental PD pattern in the dataset, orders of magnitudes faster than done using conventional Rietveld refinement. As seen in Fig. 6, the results of using PQ-Net are comparable to those determined through Rietveld methods on experimental data. A limitation of the PQ-Net approach is that it is tied to its training set, necessitating training prior to each experiment. If unexpected phases emerge during experiments, they will result in large goodness-of-fit values. For each new experiment or when new structure types are encountered, a new training is therefore required.
Fig. 6 Crystallite size (colourbar axis corresponding to nm) and lattice parameter a (colourbar axis corresponding to Å) maps for CeO2 and ZrO2 obtained with the Rietveld method, results obtained with the PQ-Net, their absolute difference for the experimental multi-phase NiO–PdO–CeO2–ZrO2–Al2O3 system and the uncertainty maps of the deep ensemble PQ-Net.2 (Under Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/). |
Dong et al. required 100000 datasets for training to achieve good results predicting structural information on experimental data from a chemical system with five components. For larger and complex systems with more possible components, the supervised ML model must be trained on even more data of both individual structural models and combinations of these. An example of this approach is the work by Lee et al., who used a supervised ML model for phase identification in a quaternary chemical system consisting of Sr, Al, Li, and O. This system thus spans simple to ternary oxides and multiple different polymorphs, and in total 170 inorganic compounds appear in the chemical space.85 Here, the ML model was trained on 1785405 synthetic combinatorically mixed PD patterns. After training, the model was able to phase identify and give rough estimates of phase fractions of multicomponent systems from XRD data.
Instead of training the supervised ML model on large databases of combinations of phases, unsupervised ML methods, such as PCA and NMF, can demix multiphase PD patterns into individual phase patterns, as also addressed for XAS data above.86,87 Here, a set of experimental diffraction patterns are given as input to the unsupervised ML algorithm, which decomposes it into its constituent parts. However, the PCA and NMF algorithms may encounter difficulties if the PD pattern of a chemical phase changes during the reaction, for example, through peak shifting from a unit cell change, variations in peak intensity from a change in thermal vibrations, or a change in the crystalline size, leading to different peak widths. For example, Stanev et al. encountered peak shifts induced by an alloying process in PD data.88 To address this, they implemented a strategy that combined NMF with cross-correlation analysis of the demixed PD patterns, thus enabling the clustering of patterns that originated from the same chemical phase.
Other unsupervised ML methods can also be applied to demix signals into their constituent components (tackling challenge 2, Fig. 3). Chen et al. employed deep reasoning networks to map the crystal-structure phase diagram of Bi–Cu–V oxide using experimental PD data.89 Based on PD data from Bi–Cu–V oxides prepared in various compositions, the ML model was trained to demix the phases in the PD patterns, and subsequently map the crystal-structure phase diagram of Bi–Cu–V oxide.89 Once trained, the deep reasoning network can take a PD pattern from a sample in the composition space as input, and demix signals from multiple phases into their constituent components. Using a linear combination of the components, the PD pattern can be reconstructed, and the phase diagram can be constructed with phase concentrations. The authors demonstrated this process on X-ray PD patterns from a phase diagram of Bi–Cu–V containing 19 chemical phases.
In a slightly different application of supervised methods, we have recently demonstrated how explainable supervised ML can be used to extract information on the local atomic arrangement present in a sample.95 The aim of PDF analysis of e.g. nanostructured materials is often to identify models for the main structural motifs in a material. Our algorithm, ML-MotEx provides this information by using SHAP (SHapley Additive exPlanation)96,97 values to identify which atoms in a given starting model are important for the fit quality. The ML-MotEx algorithm is illustrated in Fig. 7. The starting model should be chosen to contain the main atomic arrangements expected to be found in the sample. If analysing the structure of e.g. an amorphous material, the starting model may be a related crystalline structure. However, sometimes, it can be challenging to generate a good starting model, which is a significant drawback of ML-MotEx. Based on the model, thousands of structure fragments are generated by iteratively removing atoms from the starting model (step 1), and a PDF fit is done for each of the fragments (step 2). A supervised ML model is then trained on the thousands of PDF fits (step 3), and ultimately, each atom can be assigned an atom contribution value which describes how much it contributes to the goodness-of-fit (step 4). By analysing the SHAP values, it is thus possible to identify which motifs in the starting model are important in the material to describe the data. ML-MotEx has so far been used to identify the structure of ionic clusters in solution,95 extract structural motifs in amorphous metal oxides,98 and identify stacking faulted domains on MnO2 from both X-ray PD and PDF data.99
Fig. 7 Use of ML-MotEx. Firstly, a starting model is provided. Using this starting model, a structure catalogue is generated, and the structures in the catalogue are fitted to the experimental data in question. An ML algorithm is then trained to predict Rwp values and finally calculating quantified values of feature importance for the fit quality. The figure is from Anker et al.95 (Under Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/). |
The supervised ML methods used for structure identification for both PD and PDF data discussed so far are limited to identifying structural models that are part of the structural database on which they have been trained. Ultimately, the aim of a PDF experiment may be to solve the structure of new nanomaterials. To explore structural models beyond any existing structural database (challenge 2 in Fig. 3), some classes of unsupervised ML could be useful. We have recently used a graph-based conditional variational autoencoder,35 DeepStruc (Fig. 8A) to determine the atomic structure of metallic nanoparticles up to 200 atoms in size from PDF data.100,101 Given a PDF, DeepStruc can output a particle structure, and we obtained mean absolute errors of 0.093 ± 0.058 Å on the atomic positions in metallic nanoparticles from simulated PDFs. Fig. 8B shows the results of applying DeepStruc to experimental PDFs obtained from three chemical systems, consisting of two magic-sized clusters (I) Au144(p-MBA)60102 and (II) Au144(PET)60,102 and (III) a 1.8 nm Pt nanoparticle.103 All three structures match the structures found in the literature and provide good data fits. Although DeepStruc is supervised in the sense that it is trained on structure and PDF pairs, it also has abilities from unsupervised ML as it learns to probabilistically map cluster structures and PDFs into a two-dimensional chemically meaningful space, which we refer to as the latent space. By inspecting the latent space, it is possible to find relations between different types of cluster models. DeepStruc places decahedral (orange) structures in the latent space between face-centered cubic (fcc) (light blue) and hexagonal closed packed (hcp) (pink) structures. This spatial arrangement can be explained by considering that decahedral structures are constructed from five tetrahedrally shaped fcc crystals, separated by {111} twin boundaries.13,104,105 The twin boundaries, resembling stacking faulted regions of fcc, justify their location in the latent space between fcc and hcp.48,95,96 The capability of DeepStruc to interpolate between cluster structures arises from each structure in the latent space being probabilistically rather than deterministically predicted. This has been demonstrated in Anker et al.,100 where we show that generative models28 are necessary to go beyond the structural database used for training the ML model. Specifically, we showed that a generative model, like DeepStruc, can interpolate between structural models, as shown in Fig. 8C, while still yielding sensible results. More traditional deterministic models, which are not probabilistic, could not interpolate between structures and thereby not go beyond the structural database when predicting a structural model from a PDF.
Fig. 8 DeepStruc is a Conditional Variational Autoencoder that can solve the structure of a small mono-metallic nanoparticle from a PDF. (A) DeepStruc predicts the xyz-coordinates of the mono-metallic nanoparticle structure with a PDF provided as the conditional input. The encoder uses the structure and PDF as input, while the prior only takes the PDF as input. A latent space embedding is given as input to the decoder to obtain the structural output, which produces the corresponding mono-metallic nanoparticle xyz-coordinates. During the training of DeepStruc, both the blue and green regions are used, while only the green region is used for structure prediction during the inference process. (B) PDF fit of the reconstructed structure from three different nanoparticle systems: (I) Au144(p-MBA)60 PDF,102 the (II) Au144(PET)60 PDF102 using a reconstructed structure icosahedral structure and (III) a 1.8 nm Pt nanoparticle PDF from Quinson et al.103 (A) and (B) are adapted from Kjær & Anker et al.101 (Under Creative Commons Attribution 3.0 Unreported License https://creativecommons.org/licenses/by/3.0/). (C) Structures generated by decoding different extents of interpolation of the latent variables obtained for PDF-A and PDF-B. The generated structures start from structure-A and progressively evolve towards structure-B. This work uses a Conditional Variational Autoencoder similar to DeepStruc and we compare it with a Deterministic Autoencoder. (C) is from Anker & Kjær et al.100 (Under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License https://creativecommons.org/licenses/by-nc-nd/4.0/). |
DeepStruc is integrated with the Hugging Face platform, enabling users to rapidly determine the structure of small metallic nanoparticles from PDFs using a simple two-click process.106 The Hugging Face integration provides a user-friendly experience, without requiring data storage or complex software installations.
Unsupervised ML algorithms have also been employed to either uncover trends in PDFs obtained from multiple samples or to separate the signal from different phases in a PDF (challenge 2, Fig. 3).107–109 NMF has proven to be especially useful, and has been used to analyse PDFs obtained from various materials and conditions, including battery materials, amorphous solid dispersions, or data collected under high-pressure. It has also been used to extract the interface PDF between a Fe and a Fe3O4 phase.110–114 Recently, efforts have also been made to develop an efficient and accurate NMF algorithm that can be used during data measurement.115,116 This NMF algorithm is available at PDF-in-the-cloud.117,118
For example, determining the appropriate spin wave model of the half-doped bilayer manganite, Pr(Ca, Sr)2Mn2O7 (abbreviated as PCSMO) has been debated, with a Goodenough spin wave model (Fig. 9A),119,120 or a Dimer spin wave model (Fig. 9A) being considered.121 Due to the subtle differences between the two models, determining which model the INS spectra correspond to has been challenging, as it requires a meticulous manual fitting process. After extensive experimentation and careful data treatment, it was ultimately determined that the Goodenough spin wave model best describes the experimental data (Fig. 9B).122
Fig. 9 Determining the spin wave model from experimental INS data using ML. (A) Two magnetic exchange models in a single sheet of Mn ions in a half-doped manganite. (Left) Goodenough model (Right) Dimer model.27 (B) 2D representation of experimental data of PCSMO measured at 4 K using the MAPS spectrometer.122 The INS spectra are arranged in terms of incident neutron energy (Ei) and bins of energy transfer ω = 0.10–0.16Ei, etc.125 (C) Schematic representation of the DUQ method. The input initially passes through a series of convolutional NNs (orange block) to extract features. In standard logistic regression, the outputs from the convolutional NNs are classified by summing the weights connecting each filter, fi, to the class C of interest. This is a simple binary classification. The DUQ classifier instead outputs a weight vector associated with the input that is correlated to the class predictions. If all the weights in the weight vector are close to a class (based on distances, Kc, from the weight vector to the centre, ec, of clusters of training examples), the prediction has a large certainty, while the certainty is larger with a larger spread of weight vectors. (C) is from Butler et al.27 (Under Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/). (D) The DUQ classifier cannot identify the spin wave model of an experimental dataset with high certainty. However, Exp2SimGAN matches the experimental dataset to the simulated training set of the DUQ classifier enabling the classification of the spin wave model with high certainty. A + B + D are from Anker et al.125 (Under Creative Commons Attribution 3.0 Unreported License https://creativecommons.org/licenses/by/3.0/). |
To ease this task, a supervised ML model has been developed to assist in analysing INS data. By training supervised ML models on simulated INS spectra calculated using physics-driven equations, Hamiltonians can be predicted from INS data. Specifically, Butler et al. demonstrated that NNs can predict magnetic Hamiltonians or classify the spin wave model from simulated INS data of PCSMO, saving significant time compared to manual data analysis.27 They first used a logistic regression123 model, illustrated in Fig. 9C, which makes a simple binary classification, either Goodenough or Dimer, but gives no indication of the reliability of the prediction. It was thereby challenging to judge when to trust the model. To resolve this problem, they used a deterministic uncertainty quantification (DUQ) classifier (Fig. 9C),27,122 to perform uncertainty classification instead. The DUQ classifier outputs a weight vector associated with the input that is correlated to the class predictions. If all the weights in the weight vector are close to a class, the prediction has a large certainty, while the certainty is larger with a larger spread of weight vectors.
To reliably predict the spin wave model from experimental INS data (Fig. 9B), the DUQ classifier was trained on computationally expensive resolution convoluted INS spectra. Physics-driven simulations may not always capture the experimental noise, instrumental effects or other artefacts from phenomena not described by the underlying theory (challenge 3 in Fig. 3). In this example, the computationally inexpensive resolution unconvolved INS simulations did not capture any instrumental effects. To address this challenge, we introduced an unsupervised image-to-image algorithm, Exp2SimGAN, which is a generative adversarial network124 (GAN) capable of learning the simulated and experimental data distributions and transforming between them, e.g. transforming a simulated dataset into one that resembles an experimental dataset, or vice versa.125 By using Exp2SimGAN to convert experimental INS spectra into simulated-like data, the DUQ classifier, trained on computationally inexpensive resolution unconvolved INS spectra, can be applied to the experimental INS data (Fig. 9D).125 This approach helps bridge the gap between simulations and experimental data, allowing for more accurate and efficient analysis (tackling challenge 3, Fig. 3).
Samarakoon et al. demonstrated an alternative approach for the analysis of INS data using autoencoders.126–128 They showed that autoencoders can eliminate background signals and artefacts from the experimental INS spectrum by compressing them into a latent space. Once in the latent space, the magnetic behaviour can be categorised, and the autoencoder can solve the inverse problem by extracting the Hamiltonians from the experimental INS spectrum. This is achieved by decoding the INS spectrum from the latent space positions. As a result, the autoencoder works as a fast surrogate model for INS simulations accelerating the fitting procedure of the experimental INS spectrum. Later work integrates ML modelling approaches into the INS experiments enabling real-time analysis of INS data.129
Supervised ML is now widely used to identify structural models (main application 1, Fig. 2) from data, to predict data from structural models (main application 2, Fig. 2), or to directly provide structural information from data (main application 3, Fig. 2). However, we have highlighted three challenges that supervised ML faces in automating the analysis of scattering and spectroscopy data (Fig. 3). Challenge 1: handling datasets originating from a mix of chemical phases. Here, unsupervised ML, especially NMF, has successfully been used to demix datasets into constituent components. We anticipate the emergence of combination methods, where unsupervised ML firstly demixes complex datasets whereafter they are independently analysed using supervised ML. Challenge 2: handling data from a structural model that is not part of a database. Here, generative modelling appears promising for interpolating between structural models in a database. Challenge 3: handling experimental data. For the ML models to significantly impact the data analysis of scattering and spectroscopy data, they must perform well on experimental data and not only on simulated data. Often in materials chemistry, supervised ML models are trained on physics-driven simulations which do not include instrumental artefacts, noise or other phenomena not directly described by the underlying physics. Here, new methods are needed to make simulated data resemble experimental data. Unsupervised image-to-image algorithms could potentially address this challenge.125
However, using ML to resolve more complicated challenges in materials chemistry is still challenged by limited sizes of datasets connecting structure and spectroscopy/scattering signal. One way to handle limited data is to constrain the ML model with chemical knowledge. Here, physics-informed NNs serve as an inspiration, as they embed partial differential equations as constraints into the NN optimisation problem, for example, when using an NN as a surrogate model for the Schrödinger equation.130 As a result, the range of potential solutions is limited to a manageable size for ML to handle with the available information. However, not all chemistry can be expressed as differentiable equations, necessitating the development of similar approaches that can incorporate chemical knowledge into the ML architecture as ‘chemistry-driven ML’. Equivariant graph-based NNs show promise, as they leverage group representation theory to design architectures that are equivariant to specified symmetry groups, making them well-suited for analysing chemical systems with underlying symmetries.131 We expect another impact to come from interpretable and explainable ML which enables researchers to understand the underlying mechanisms behind predictions, build trust in ML model outcomes, and uncover unexpected correlations that may lead to scientific insights. For those interested, we refer to a recent review paper by Oviedo et al.30 for more about interpretable and explainable ML in materials chemistry.
Currently, it is not mandatory to publish data, code, and software requirements alongside research papers, making it difficult for other researchers to apply trained ML models to their own experimental data. A paradigm shift from publishing papers with code to publishing code with papers may thus be needed. For the ML developer, we refer to N. Artrith et al.132 for best practices in ML for chemistry. We suggest that publishing code with papers would greatly benefit the field, allowing materials chemists to analyse data easily or automatically without domain expertise.
If we unceasingly share ML models, expand open-source databases, and bridge the gap between simulated and experimental data, the next decade holds promise to integrate analysis of scattering and spectroscopy data with ML into self-driving laboratories. Self-driving laboratories are currently receiving much attention for e.g. identifying new, improved photocatalysts for hydrogen production from water,133 synthesising pharmaceutical compounds,134 and optimising nanostructure synthesis based on their optical properties.135,136 As illustrated in Fig. 10, the self-driving laboratory will synthesise a material, perform a scattering or spectroscopy experiment, and the data can be automatically analysed with ML. The findings will then be fed into an active learning framework that suggests the next experiment based on structural insight.
This journal is © The Royal Society of Chemistry 2023 |