E. A.
Eronen
*a,
A.
Vladyka
a,
Ch. J.
Sahle
b and
J.
Niskanen
*a
aDepartment of Physics and Astronomy, University of Turku, FI-20014 Turun yliopisto, Finland. E-mail: eemeli.a.eronen@utu.fi; johannes.niskanen@utu.fi
bESRF, The European Synchrotron, 71 Avenue des Martyrs, CS40220, 38043 Grenoble Cedex 9, France
First published on 15th August 2024
Machine learning can reveal new insights into X-ray spectroscopy of liquids when the local atomistic environment is presented to the model in a suitable way. Many unique structural descriptor families have been developed for this purpose. We benchmark the performance of six different descriptor families using a computational data set of 24200 sulfur Kβ X-ray emission spectra of aqueous sulfuric acid simulated at six different concentrations. We train a feed-forward neural network to predict the spectra from the corresponding descriptor vectors and find that the local many-body tensor representation, smooth overlap of atomic positions and atom-centered symmetry functions excel in this comparison. We found a similar hierarchy when applying the emulator-based component analysis to identify and separate the spectrally relevant structural characteristics from the irrelevant ones. In this case, the spectra were dominantly dependent on the concentration of the system, whereas adding the second most significant degree of freedom in the decomposition allowed for distinction of the protonation state of the acid molecule.
Because raw atomic coordinates R are unsuitable for contemporary ML, numerous families of descriptors D(R) have been developed to encode the structural information into a useful input.18–31 While each of these representations might perform well for some tasks, they are not necessarily equally fit for every situation. Beside the ML performance, interpretability of the descriptor is a key consideration for studies of actual structural information content of, e.g. X-ray spectra.17 The study of the spectrum-to-structure inverse problem is indeed highly dependent on the descriptor, which needs to include physically relevant features that are meaningful not only to ML models but also to human researchers.
Spectrum prediction by ML, i.e. finding a suitable function for spectrum S(D(R)), is a more viable task than structure prediction (finding function R(S)),13 possibly because the former is not an injective function. Moreover, some structural characteristics of the system are spectrally irrelevant. While the behavior of a spectrum can be captured by a well-performing feed-forward neural network (NN), the knowledge remains hidden in the respective weight matrices and bias vectors. To this end, the NN is useful only for predicting the outcome for new input, i.e. for emulation. Emulator-based component analysis (ECA)14 is an approach to extract knowledge contained by a ML model and a given data set. With the help of a fast emulator such as an NN, the method iteratively finds the structural dimensionality reduction for maximal explained spectral variance. The resulting basis vectors provide an exhaustive input feature selection, which not only points out spectrally relevant ones but also their collaborative effect.17 Moreover, approximate structural reconstruction from spectra can be done by first reconstructing the few spectrally dominant latent coordinates, and then taking an expansion with the respective basis vectors.16 Our previous work on N K-edge X-ray absorption spectroscopy of aqueous triglycine17 showed that ECA greatly outperforms principal component analysis (PCA)32 of structural data in terms of covered spectral variance. Furthermore, a recent study using encoder–decoder neural networks supports the validity of ECA.33 While the advantage of this method is its lack of need for any prior hypothesis, it gives rise to another requirement for a descriptor: decomposability into a few dominant contributions.17
We explore structural information content of simulated sulfur Kβ X-ray emission spectra (XES) of aqueous sulfuric acid. To create the data set we sample atomistic local structures from ab initio molecular dynamics (AIMD) simulations at six different concentrations and calculate spectra for these structures. We first assess a total of six structural descriptor families in terms of spectrum prediction performance by an NN. For a fair comparison between the descriptor families, we allocate equal computational resources to the joint hyperparameter–NN architecture search for each of them. Next, we identify the spectrally dominant structural degrees of freedom using ECA, and study the performance of the best descriptor of each family in the task. A sulfuric acid molecule can exist in one of three protonation states, which we find to be distinguishable from the XES after rank-two decomposition, in which the first rank covers intermolecular interaction given by the concentration. Our results highlight the need for identification of relevant structural degrees of freedom for reliable interpretation of X-ray spectra. Moreover, they raise a call for methods of obtaining simple structural information from contemporary structural descriptors, that may have a notably abstract mathematical form.
Num. conc. | Molarity [M] | Box L [Å] | Duration [ps] | Sampling [fs] | N snapshots | N spectra | SO42− [%] | HSO41− [%] | H2SO4 [%] |
---|---|---|---|---|---|---|---|---|---|
1v63 | 0.9 | 12.50 | 100 | 62.5 | 1600 | 1600 | 96.2 | 3.8 | 0.0 |
6v54 | 4.9 | 12.66 | 50 | 62.5 | 800 | 4800 | 71.4 | 28.5 | 0.1 |
12v36 | 10.1 | 12.55 | 50 | 125 | 400 | 4800 | 17.2 | 79.3 | 3.5 |
20v20 | 15.3 | 12.95 | 50 | 250 | 200 | 4000 | 2.0 | 74.8 | 23.1 |
21v7 | 17.5 | 12.58 | 50 | 250 | 200 | 4200 | 0.0 | 28.7 | 70.8 |
24v0 | 18.6 | 12.90 | 50 | 250 | 200 | 4800 | 0.0 | 0.0 | 99.6 |
Total | 24![]() |
We computed the XES for every sulfur site of each sampled snapshot using the projector-augmented-wave (PAW) method40 with plane wave basis and density functional theory (DFT) implemented in GPAW version 22.1.0.40–42 We used periodic boundary conditions, the PBE exchange correlation potential and a 600 eV plane wave energy cutoff (for justification, see ESI†).
The spectrum calculations applied transition potential DFT43 in a fashion motivated in ref. 44. First, we computed the neutral ground state of each snapshot and emission lines for each site on a relative energy scale. We then calibrated the individual spectra on the absolute energy scale by a Δ-DFT calculation for the highest transition. This procedure builds on calculation with one valence vacancy and on respective calculations for the full core hole at each site, repeated for each snapshot. We convoluted the obtained energy–intensity pairs (a stick spectrum) with a Gaussian of full width at half maximum of 1.5 eV and then presented the spectra on a grid with bin width of 0.075 eV.
![]() | ||
Fig. 1 Simulation results of aqueous sulfuric acid. (a) and (b) Two sample structures for 0.9 M and 15.3 M, respectively, prepared with the Jmol software.46 Only molecules within 3 Å distance from the central molecule are shown for clarity. (c) and (d) Computational ensemble mean X-ray emission spectrum for each concentration with the standard deviation σ shown as grey shaded area. As an example we show the location of the features Kβx, Kβ1,3 and Kβ′′45 for the highest concentration spectrum in panel (d). In addition, the coarsened grid points used for the target spectra of the machine-learning-based analysis are shown on the mean spectra. |
We evaluated the protonation state of each acid molecule as in ref. 34. An oxygen atom was considered to belong to the molecule if it was at most 2.0 Å from the respective sulfur atom. A hydrogen atom, in turn, was considered a part of the acid molecule if it was closer than 1.3 Å from any of its oxygen atoms. Throughout the simulations, the acid molecules always had four oxygen atoms with the distribution of forms: 5878 SO42−, 9435 HSO4−, 8847 H2SO4, and 40 H3SO4+ corresponding to protonations states from 0 to 3, respectively.
In this work we study six different descriptor families for encoding the local atomistic structure R around the emission site Sem into a vector of features D(R). We used the implementation within the DScribe-package (version 2.0.1)19,47 for the local version of the many-body tensor representation (LMBTR),20 smooth overlap of atomic positions (SOAP)22 and atom-centered symmetry functions (ACSF).21 For the many-body distribution functionals (MBDF)30 we used the implementation provided with the original publication. In addition, we implemented the descriptor introduced in ref. 28 calling it “Gaussian tensors” (GT) hereafter. Finally, we used a sorted variant of the Coulomb matrix (CM)23 similar to the bag of bonds24 and analogous to the implementation in ref. 16. We used the emission site Sem as the only center for building the descriptors LMBTR, SOAP, ACSF and GT.
We split the data set randomly to 80% (19360 data points) for the training, and the rest 20% (4840 data points) for testing. We calculated feature-wise z-score standardization for the obtained raw D(R) (NN input) and S (NN output) features using the training set and then applied this scaling to all data prior to any further procedures. The applied feature scaling is common practice in ML in general,48 and also in MD with atomic ML potentials in particular.49 Furthermore, it can be shown that the standardization does not limit the emulation performance of an NN, but is still essential for achieving unbiased L2 regularization during training (see ESI†).
We used fully connected feed-forward NNs implemented in PyTorch (version 2.0.1)50 for ML. This NN architecture allows for a range of possible hyperparameters that we searched over: the weight decay term α (from 10−13 to 1), number of hidden layers (2, 3, 4 or 5), hidden layer width (16, 32, 64, 128 or 256 neurons) and the learning rate (0.0001 or 0.00025). As the activation function of the neurons, we used the exponential linear unit (ELU).51 The training of the networks was done in mini-batches of 200 data points by maximizing the R2 score (coefficient of determination; generalized covered variance) using the Adam optimizer.52 We applied early stopping of the training by checking every 200 epochs if the validation score no longer improved.
We calculated the average score from a five-fold cross-validation (CV) on the training set for each selected descriptor–NN hyperparameter combination. Because building the descriptor consumed significantly more computational resources than training one NN model, we trained ten different NN architectures for every single descriptor hyperparametrization. The same procedure was repeated for each of the six descriptor families with equal total processor (CPU) time of the random grid search by allocating a total of 1440 CPU hours per descriptor family on Intel Xeon Gold 6230 processors at Puhti cluster computer, CSC, Finland. Different descriptor families have different free parameters to search over, which is detailed in the ESI.† We found slight variation of the results to persist within the searched grid space due to randomness, for example, in shuffling of the mini-batches and initialization of NN weights. In the end, we chose the descriptor–NN hyperparameters with the highest CV score and trained the final model using the full training set. In this final training we used 80% of re-shuffled training data (15488 points) for actual training and the rest 20% (3872 points) for validation to determine the early stopping condition.
We carried out ECA decomposition14 of the structural descriptor space using the PyTorch implementation of the algorithm.53 In the ECA procedure, basis vectors V = {vj}kj=1 are searched for to achieve a rank-k approximation for D(R):
![]() | (1) |
We found that the randomly selected initial guess of the ECA component vectors sometimes affected the resulting fit. Therefore, we ran the ECA procedure with 25 different random initial guesses for each component and always chose the vector resulting in the highest R2 score, before moving on to optimizing the next component. This is likely a symptom of the high-dimensional feature vectors, which encode a wealth of atomistic information within each (correlated) element of the vector. This complexity is inherited from the physical problem itself.
We used the training set for both NN training and ECA decomposition. The generalizability of the outcome was assessed with the test set in both cases. To allow for apples-to-apples comparison, we used the R2 score for all NN training, testing, and ECA.
Even after extensive model selection, the descriptors yield varying prediction performance of the target spectra as presented in Table 2. Practically equal accuracy is obtained with the best-performing descriptors LMBTR, SOAP and ACSF. The MBDF descriptor provides intermediate performance among the studied ones, whereas GT and CM yield more than 0.1 units lower R2 than the most accurate descriptors. The tendency of an ML model to overfit is commonly measured by the difference between the train and the test scores. Our results hint an increasing trend in this difference along decreasing accuracy. Fig. 2a illustrates the distribution of the R2 scores for z-score inverse transformed (absolute intensity) spectral features using the LMBTR emulator with an overall test set R2 of 0.950. Additionally, typical prediction quality along this distribution is presented in Fig. 2b–d. A similar figure for the z-score-standardized spectral space with R2 of 0.928 is available in the ESI.†
Descriptor | N features | R train 2 | R test 2 | Difference |
---|---|---|---|---|
LMBTR | 420 | 0.944 | 0.928 | 0.015 |
SOAP | 2700 | 0.961 | 0.928 | 0.033 |
ACSF | 543 | 0.952 | 0.923 | 0.029 |
MBDF | 330 | 0.915 | 0.878 | 0.036 |
GT | 1275 | 0.857 | 0.814 | 0.043 |
CM | 595 | 0.889 | 0.806 | 0.083 |
![]() | ||
Fig. 2 Absolute intensity scale spectrum prediction performance of the best NN-LMBTR model measured using the test set. (a) Distribution of R2 scores for each data point. Examples of known and predicted spectra with R2 score closest to (b) the 1st decile, (c) the median, and (d) the 9th decile. For each of the three cases, the location along the R2 distribution is shown in panel a as a correspondingly colored circle. The scores of this figure differ from the rest of the study as ML was applied to z-score standardized spectra. For a similar plot using the standardized spectra, see ESI.† |
We measure the ECA decomposition performance with the R2 score, shown as a function of the rank of the decomposition in Fig. 3. The scores for the train set and for the test set both rise monotonically and approach a plateau near the respective emulator performance. In general, the R2 scores of high rank (≥5) ECA are roughly ordered along the overall ML accuracy of the respective emulators. The design principle of ECA aims at maximal covered spectral variance at any given rank, manifested by the diminishing improvement as a function of k observed in Fig. 3. Consequently, the high-k scores tk are not reconstructable from the spectra, as these degrees of freedom are irrelevant in their emulation. With components of negligible effect on the outcome, full structural reconstruction from spectra is impossible, as a structural descriptor is completely defined by expansion (1) done to the full rank. The intended rapid reduction of dimensionality motivates the study of low rank (e.g. k ≤ 3) decompositions for which LMBTR performs the best. We have noticed jumps in the R2 curves as a function of the decomposition rank k, seen in Fig. 3 for SOAP and CM. This phenomenon is unpredictable and potentially related to the initial guess of the ECA component vector. Apart from the obvious complexity of the problem, the detailed origin and cure for this behaviour remain unknown to us.
Next, we analyse the ECA results using the LMBTR descriptor, which contains simple physical information as part of it, namely element-wise interatomic distances from the emission site Sem. Instead of presenting the numeric values of these distances, the descriptor encodes the information on a predefined grid as a sum of Gaussian functions, centered at the respective positions. The according features of the first ECA component vector, z-score inverse transformed into the descriptor space, are shown in Fig. 4a. In the sense of the aforementioned representation, these curves reflect the change in the interatomic distances most relevant in terms of the target spectrum shape. The figure shows that the S Kβ XES is affected by notably distant molecules. As the typical protonating H distance from the S atom of the acid molecule is 2.2 Å, the part corresponding to the Sem–H distribution shows notable relevance in the region above 3 Å. This can be attributed to the different hydrogen number density in the system reflecting the concentration, with possibly minor effects coming from hydrogen bonding of the system. The concentration dependency is further indicated by the opposite effect of Sem–S curve at 4–6 Å. Additionally, the Sem–H distribution shows a weaker effect in the region between 1 Å and 3 Å, which likely arises from the tails of the descriptor Gaussians corresponding to the hydrogen atoms protonating the acid molecule.
The first two ECA components correspond to structural features which define the majority of the spectral variance (Rk=12 = 0.682, Rk=22 = 0.788). Recent results indicate orders of magnitude better R2 score for covered spectral variance by ECA in comparison to that obtained by PCA for the structural descriptor.17 In the case of S Kβ XES of aqueous H2SO4 such a drastic difference is not observed: here a 2-component structural decomposition by PCA resulted in R2 = 0.453 for spectral variance, which is still notably less than the Rk=22 = 0.788 obtained by ECA. This indicates a more direct structural characteristics – spectral response relation in the current case.
To study the separability of spectrally dominant structural features, we performed projection of each data point in the test set on two-dimensional ECA space. We focus on two most obvious characteristics: the protonation state of an acid molecule and the concentration of the system, which are indicated by coloring of each point in the resulting scatter plot, shown in Fig. 4b and c, respectively. The protonation state is not the ruling structural characteristic behind variation of the S Kβ XES as it can be only partially identified by the first ECA score t1 (Fig. 4b). However, the interplay of t1 and t2 disentangles these classes almost completely. This result is supported by the fact that the first component vector shows only a weak effect in Fig. 4a along the part of the curve which corresponds to the protonating hydrogen atoms. In contrast, the score t1 describes the concentration of the system, seen as the spectral change in Fig. 1c and d and in our analysis of the respective Sem–H curve in Fig. 4a. Ultimately, the need for the second degree of freedom to identify the protonation state is congruent with the overlap between the spectral-region intensity histograms of the protonation states, reported in a previous work on aqueous H2SO4.5
Our results suggest that, in X-ray spectroscopy of liquids (with ≈2 × 104 data points) an equal ML performance can be obtained with LMBTR, SOAP and ACSF with joint model selection of the descriptor hyperparameters together with the NN architecture. Among the studied descriptor families, we obtained intermediate performance with MBDF, whereas we could not achieve competitive accuracy when using CM and GT. The result thus highlights the need of suitable encoding of information by the descriptor. Although the CM can be even converted back to the original structure with the loss of only handedness of the system, and although the descriptor family performed quite well for Ge Kβ XES of amorphous GeO2,16 it does not excel with the current liquid system.
Picking a descriptor family poses a serious model selection problem because there are inherent tunable hyperparameters characteristic to each one of them (see ESI†). Due to expected interplay between the optimal NN architecture, this optimization is ideally done jointly with the hyperparameters of the NN, which multiplies the required computational effort. Without prior knowledge, descriptors with more free hyperparameters are more flexible than those with fewer. Therefore, their parameter-optimized forms have a higher prior potential for accuracy as well. Because this tuning is ultimately left for the user, we accounted for the discrepancy in descriptor design by applying equal computation time for refining each descriptor, regardless how many free hyperparameters the implementation had. We propose this practice for fair assessment of structural descriptors of ever-increasing multitude. Furthermore, we find that the top-level performance among the descriptor family is typically achieved with several drastically different parametrizations. Therefore we conclude that diminishing CV score gains provides a reasonable stopping condition for the joint randomized hyperparameter search, if the allowed hyperparameter space is sufficiently large.
In this work we chose to measure the performance using the R2 score, which is a widely applied metric for information captured by a model, and utilized in e.g. PCA. The score is well-suited for spectrum interpretation because it is independent of the units and of the absolute scale. The sulfur atom has a Kβ baseline spectrum in the SO4 moiety, and therefore respective variation evaluated by the R2 score yields an informative measure for interpretation of spectra. When using z-score standardized output, the R2 gives each output feature an equal importance in the spectral interpretation, whereas raw spectral intensity favors features of large variation, observed typically for features with large overall intensity. We motivate the choice of standardization by the nonlinearity of the structure–spectrum relationship. Namely, a weak spectral feature may be indicative of a more interesting or a more widely present structural characteristics than its absolute intensity might indicate. We note however, that the analysis methods applied in this work do not necessitate the use of either output standardization or the R2 score.
Analogous to PCA, ECA works as a dimensionality reduction tool. Instead of maximizing the covered structural variance, the method focuses on maximizing the spectral one for a decomposition in the structural space. As a result, the basis vectors of ECA can be used to identify descriptor features, which affect (or do not affect) the shape of the target spectra, or even for approximate structural reconstruction from spectra.16 The method is capable of a remarkable reduction of dimensionality,14,16,17 improving on similar methods such as the partial least squares fitting with singular value decomposition57 as demonstrated in ref. 14. The first ECA component vector, shown in Fig. 4a, represents the dominant structural effect behind the variation of the spectrum. Although higher-rank ECA components may have cancelling contributions to those of lower ranks, these refinements are not equally relevant for spectrum interpretation as manifested by the associated diminishing spectral effect. We also note that the overall sign of an ECA vector can be chosen arbitrarily (adjusting the sign of the according score), but the relative signs of its components (e.g. the curves in Fig. 4a) are always fixed.
Structural interpretation of spectra sets several requirements for a descriptor. The representation needs to allow for accurate emulation, effective decomposition, and back-conversion to simple physical information. Although all of the studied descriptors are calculated from local atomistic structures, several factors complicate recovering such information from them. These include smearing the exact values on a grid and summation of information from many atoms into one feature, possibly with distance-dependent weights. In addition, some of the descriptors rely on basis functions and may potentially have an abstract mathematical form. In this line of thought, interpretation of descriptors calls for future efforts.
Machine learning by NNs requires large data sets, that have only recently become feasible owing to the increase in computational resources and the developments in simulation tools. Advances of ML in potentials for molecular dynamics,49,58,59 and in electron structure calculations,60 could help generate more extensive and more accurate training data, leading to improved performance of spectrum emulation and subsequent analyses.
We observed a similar hierarchy in the comparison of the descriptor families for structural dimensionality reduction guided by covered spectral variance. The LMBTR stood out especially in the low-rank decompositions of the applied emulator-based component analysis. Although the system manifests significant complexity, the analysis method managed to condense spectral dependence into two dimensions with R2 = 0.788 for an independent test set. The results indicated that even distant atoms have a significant effect on the XES, that probes local bound orbitals around the emission site. The dominant underlying coordinate t1 followed the concentration of the system, whereas inclusion of the second most relevant degree of freedom t2 allowed for clear distinction of the protonation state of the acid molecule. Altogether, our results highlight loss of structural information upon formation of a spectrum, which will have implications for justified interpretation of spectra using simulations.
Structural descriptors facilitate accurate prediction of X-ray spectra by a neural network. Advances in simulation methods can be anticipated to extend and improve the data sets to allow for studies of even more complex systems and analyses with higher accuracy. Conversion of the descriptor back to simple atomistic information needs specific research efforts, as results presented in terms of these mathematically sophisticated representations can be difficult to interpret by a human.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4cp02454k |
This journal is © the Owner Societies 2024 |