Ricardo Grau-Crespo*ab,
Said Hamad
c,
Salvador R. G. Balestra
d,
Ramsey Issae,
Taylor D. Sparks
e,
Arantxa Fernandesf,
Ben L. Griffithsf,
Robert F. Moranf,
David McKayg and
Sharon E. Ashbrook
*f
aSchool of Engineering and Materials Science, Queen Mary University of London, London E1 4NS, UK. E-mail: r.grau-crespo@qmul.ac.uk
bDepartment, of Chemistry, University of Reading, Whiteknights, Reading, RG1 6DX, UK
cDepartamento de Sistemas Físicos, Químicos y Naturales, Universidad Pablo de Olavide, Ctra. Utrera Km. 1, Sevilla 41013, Spain
dDepartamento de Física Atómica, Molecular y Nuclear, Área de Física Teórica, Universidad de Sevilla, Av. Reina Mercedes s/n, 41012 Seville, Spain
eDepartment of Materials Science and Engineering, University of Utah, Salt Lake City, UT 84112, USA
fSchool of Chemistry, EaStCHEM and Centre of Magnetic Resonance, North Haugh, St Andrews, KY16 9ST, UK. E-mail: sema@st-andrews.ac.uk
gEPCC, Bayes Centre, University of Edinburgh, 47 Potterrow, Edinburgh, EH8 9BT, UK
First published on 16th September 2025
Understanding the atomic-scale local properties of solid solutions is crucial for deciphering their structure–property relationships. In this work, we present a computational approach that combines solid-state nuclear magnetic resonance (NMR) spectroscopy with density functional theory (DFT) calculations to investigate local chemical environments in solid solutions. Previous canonical ensemble models, which only sample configurations at a fixed composition of the simulation cell, fail to capture local compositional fluctuations that can significantly influence the NMR spectra. To address this limitation, we employ a grand-canonical ensemble approach enabling a more comprehensive representation of the contributions of all possible local chemical environments to the NMR spectrum, using a La2(Zr1−xSnx)2O7 pyrochlore solid solution as a case study. To mitigate the high computational cost of such simulations, we also explore ensemble truncation strategies and the use of machine learning (ML) to aid predictions of NMR chemical shifts, achieving a significant reduction in computational cost while maintaining most of the predictive power. Our results show that combining the grand-canonical approach with machine learning and ensemble truncation offers an efficient framework for modelling and interpreting NMR spectra in disordered crystalline materials.
Although the theory to obtain NMR chemical shifts from quantum-mechanical simulations in a periodic solid is well established,4–6 the interpretation of NMR spectra from disordered solid solutions remains highly challenging, primarily because spectra represent statistical averages over numerous distinct chemical environments. Ensemble-based approaches combining density functional theory (DFT) calculations with statistical mechanics have demonstrated considerable promise in simulating NMR spectra of site-disordered solids and interpreting the complex spectral lineshapes obtained.7–9 Such approaches usually involve systematically enumerating chemical configurations within a symmetry-adapted configurational ensemble, allowing for statistical averaging of computed chemical shifts to reconstruct the experimental NMR spectrum (usually assuming this is acquired under magic-angle spinning (MAS) conditions). Despite their success, the methods used in previous studies face two significant limitations: firstly, finite supercell sizes restrict the range of possible local chemical environments; secondly, these approaches are computationally demanding, rapidly becoming impractical as the configurational space expands.
Here, we propose a computational strategy based on grand-canonical ensembles (as in the so-called quasi-chemical approximation)10 to overcome these challenges. In contrast to canonical-ensemble methods used before for solid-state NMR modelling, where the simulation supercell has the same composition as the solid, the grand-canonical ensemble enables the sampling of configurations with varying compositions. This flexibility provides comprehensive access to the entire range of local chemical environments, including those involving large compositional fluctuations away from the average composition of the solid, without the need for a very large simulation supercell, which is problematic due to the high computational cost of DFT evaluations of NMR chemical shifts. We illustrate the utility of this method by applying it to interpret the complex lineshapes seen in the 119Sn MAS NMR spectra of a La2(Zr1−xSnx)2O7 pyrochlore solid solution. Pyrochlore solid solutions provide a useful test case, combining intricate local structural variations with detailed (and typically multinuclear) experimental NMR data to benchmark computational approaches.5,7,11,12 Moreover, we address the critical computational bottleneck associated with extensive DFT-based configurational sampling by controlled ensemble truncations and by integrating machine-learning (ML) techniques.
We systematically generate all the symmetrically inequivalent configurations for each number n of B-site substitutions in a pyrochlore unit cell La16Zr16−nSnnO56, using the Site Occupancy Disorder (SOD) program.16,17 The total number of configurations with n Sn substitutions:
![]() | (1) |
and the number of those that are inequivalent (Mn) are listed in Table 1, for each n. Here N is the number of sites over which substitutions are considered (i.e., N = 16 for the B sites in the pyrochlore cell) and x = n/N is the molar fraction of substitutions.
Chemical formula | x | n | Wn | Mn |
---|---|---|---|---|
La2Zr2O7 | 0 | 0 | 1 | 1 |
La2(Sn0.0625Zr0.9375)2O7 | 0.0625 | 1 | 16 | 1 |
La2(Sn0.125Zr0.875)2O7 | 0.125 | 2 | 120 | 3 |
La2(Sn0.1875Zr0.8125)2O7 | 0.1875 | 3 | 560 | 8 |
La2(Sn0.25Zr0.75)2O7 | 0.25 | 4 | 1820 | 22 |
La2(Sn0.3125Zr0.875)2O7 | 0.3125 | 5 | 4368 | 35 |
La2(Sn0.375Zr0.625)2O7 | 0.375 | 6 | 8008 | 65 |
La2(Sn0.4375Zr0.5625)2O7 | 0.4375 | 7 | 11![]() |
82 |
La2(Sn0.5Zr0.5)2O7 | 0.5 | 8 | 12![]() |
97 |
La2(Sn0.5625Zr0.4375)2O7 | 0.5625 | 9 | 11![]() |
82 |
La2(Sn0.625Zr0.375)2O7 | 0.625 | 10 | 8008 | 65 |
La2(Sn0.875Zr0.3125)2O7 | 0.6875 | 11 | 4368 | 35 |
La2(Sn0.75Zr0.25)2O7 | 0.75 | 12 | 1820 | 22 |
La2(Sn0.8125Zr0.1875)2O7 | 0.8125 | 13 | 560 | 8 |
La2(Sn0.875Zr0.125)2O7 | 0.875 | 14 | 120 | 3 |
La2(Sn0.9375Zr0.0625)2O7 | 0.9375 | 15 | 16 | 1 |
La2Sn2O7 | 1 | 16 | 1 | 1 |
Total | 65![]() |
531 |
The thermodynamics of mixing and the NMR spectra in this system were simulated using a grand-canonical ensemble approach, which includes configurations of different compositions with probabilities that depend on the relative chemical potential of the two species being mixed. Whereas the canonical ensemble only considers configurations with n = xN substitutions, the grand-canonical ensemble considers configurations with all possible compositions n = 0, …, N. The probability of the mth configuration with n substitutions in the grand-canonical ensemble is:
![]() | (2) |
![]() | (3) |
In the limit of full disorder (formally infinite temperature), it is not necessary to calculate a chemical potential to obtain the probabilities, as in that case it can be demonstrated that:
Pnm = Ωnm xn(1 − x)N−n. | (4) |
The dependence on x in the equation above is such that the largest contributions, in terms of occurrence probabilities, come from configurations with n close to xN. The cumulative probabilities of all configurations with n substitutions are:
![]() | (5) |
which has a maximum when n = xN. For example, Fig. 2 shows the result obtained if we take a hypothetical large cell with N = 100 and model the composition x = 0.2. Since in this large cell the canonical ensemble with n = xN (n = 20 in this case) is likely to contain most meaningful chemical environments, the canonical ensemble works well here as a first approximation. However, in the case of much smaller cells (such as those often required when predicting NMR parameters with periodic DFT), the canonical ensemble may be a poor representation of the disordered solid, as it can miss possible local environments and/or fail to reproduce the correct statistical weights associated with each configuration.
![]() | ||
Fig. 2 Cumulative probability Pn of all configurations with n substitutions in a hypothetical supercell with N = 100 sites, as a function of n, when the overall fraction of substitutions is x = 0.2. |
The grand-canonical algorithm used here has been implemented within the SOD code.16 It is useful to note that, in the context of solid solutions (and in other mixtures), the ensemble described above is often called the semi-grand-canonical ensemble,19,20 to emphasise that it is only the relative proportion between the two species that is allowed to change, while the total number remains constrained by the number of available crystal sites. We prefer to refer to this ensemble simply as grand-canonical, as this more general term includes the case when some of the species disordered over the crystal sites are vacancies or interstitials, e.g. H distribution in equilibrium with an external H2 gas, which can be treated within the same formalism.21,22
The computation of NMR parameters used the DFT settings described above and employed the gauge-including projector augmented wave (GIPAW) approach6,23 to reconstruct the all-electron wave functions in the presence of an induced magnetic field. On average, each optimisation + NMR calculation of a configuration required ∼90 minutes using 896 cores on ARCHER2,28 or around 40 million CPU hours for all configurations. The calculations produced the absolute shielding tensor (σcalc) in the crystal frame. Diagonalising the symmetric part yielded the principal components σcalc11, σcalc22 and σcalc33, from which we can determine the isotropic shielding constant, σcalciso = (σcalc11 + σcalc22 + σcalc33)/3. To allow comparison with experimental data, we derived the corresponding isotropic chemical shift, δcalciso, using σcalciso.
For each configuration we generated the predicted 119Sn MAS NMR spectra by including an individual Gaussian function for each δcalciso value obtained in the calculation of the corresponding configuration. A fixed line broadening of 0.85 ppm was applied to each Gaussian function, which was chosen based on the linewidth seen for the single signal in the 119Sn MAS NMR spectrum of the end member La2Sn2O7. The spectra were then weighted by the respective probability of that configuration before being combined to generate a final spectrum for each composition.
The fact that the grand-canonical ensemble is a more complete representation of the disordered solid than the canonical ensemble is best illustrated by comparing the corresponding configurational entropies. For the canonical ensemble, the maximum mixing (configurational) entropy per site, obtained in the limit of full disorder (T → ∞), is:
![]() | (6) |
which matches the exact value:
![]() | (7) |
only in the limit of an infinite supercell (N → ∞). In contrast, it can be demonstrated that the mixing entropy obtained from the grand-canonical ensemble:
![]() | (8) |
leads to the exact value for any supercell size, i.e.,
![]() | (9) |
This result is a consequence of the fact that the grand-canonical ensemble admits all possible configurations of substitutions, whereas the canonical ensemble only admits the configurations that are compatible with the canonical composition within a cell, an observation that is important for our discussion of the 119Sn MAS NMR spectra below.
We now illustrate the performance of the grand-canonical ensemble model in the case of the La2(SnxZr1−x)2O7 solid solution and compare this with the canonical ensemble model. Fig. 4a shows the configurational entropy (calculated from the DFT energies) for x = 0.5, as a function of temperature, for each ensemble. The maximum entropy (in the limit of infinite temperature) per formula unit that can be obtained from the canonical model in a cell with 8 formula units corresponds to , where W = 16!/(8!8!) = 12
870 is the total number of configurations with that composition in the cell. However, the exact maximum entropy should be 2kB
ln
2 ≈ 11.5 J mol−1 K−1, because there are two B sites per La2(Sn0.5Zr0.5)2O7 formula unit, each with 1/2 occupancy by Sn and 1/2 occupancy by Zr in the formula unit. Therefore, the canonical model with this cell size can only give ∼85% of the possible configurational entropy for the x = 0.5 composition. This can be interpreted as the canonical model missing some local chemical environments/configurations, which are responsible for the unaccounted 15% of the configurational entropy. That incompleteness has important consequences for the discussion below about the agreement between calculated and experimental NMR spectra: the missing configurations in the canonical ensemble correspond mainly to less probable environments, but some of these have distinctive chemical shifts that can affect the spectra. The grand-canonical model, on the other hand, gives a better description of the entropy, including the correct high-temperature limit of the configurational entropy, and therefore leads to a more accurate thermodynamic analysis (and to a better account of possible chemical environments, as will be discussed below). At any temperature high enough for cation diffusion to take place, say above at least 600 K, the grand-canonical configurational entropy is practically identical to the configurational entropy of a perfectly disordered system.
Fig. 4b shows the calculated mixing enthalpies and free energies at T = 873 K, as a function of composition. Enthalpies were obtained from the average of DFT energies as , i.e. neglecting pressure and vibrational effects which typically contribute little to the mixing value, ΔHmix = H(X) − (1 − x)H(0) − xH(1), in oxide solid solutions.34,35 Whereas the canonical ensemble leads to acceptable results for the mixing enthalpy, the calculation of the mixing free energy (ΔGmix = ΔHmix − TSmix) is of course more problematic, due to the underestimation of the mixing entropy in the canonical ensemble.
The mixing thermodynamics analysis shows that, while the system is not an ideal solid solution, because it has a (small) positive mixing enthalpy, it does exhibit an almost perfect level of disorder for equilibration temperatures above ∼600 K. Since the synthesis procedure involves temperatures much higher than that, and the cation distribution does not equilibrate upon cooling to room temperature (due to high barriers for cation diffusion), these results suggest that the solid solution will remain very highly disordered when cooled to room temperature.
Upon substitution of Zr, a change in isotropic chemical shift is expected, owing to the difference in electronegativity between Sn4+ and Zr4+ cations. The number of Sn NNN can vary from 0 to 6; therefore, in principle, 7 signals might be expected in the spectrum. However, when 2, 3 or 4 Sn/Zr are substituted, there are 3 different ways of arranging these in each case, leading to a total of 13 different NNN environments, although the shift differences between varying arrangements of the same number of Sn/Zr may be smaller and less well resolved in the spectrum. Fig. 5a shows that as Zr is substituted into Y2Sn2O7 (e.g., x = 0.875), additional signals are seen at decreasing shift (−649 and −657 ppm), which could be assumed to result from Sn with 1 and 2 Zr NNN, respectively. However, further increase in the Zr content does not lead to additional resonances as might be expected, and for the lowest level of Sn present (x = 0.125) only three signals are still seen, but now at −646, −652 and −658 ppm.
The powder XRD measurements (see above) confirmed the presence of a solid solution and showed no evidence for phase separation into Sn-rich and Zr-rich pyrochlores, in agreement with the DFT predictions in this work. Previous experimental work on La2(SnxZr1−x)2O7 using 17O MAS NMR spectroscopy concluded that the B-site cation distribution was close to random, albeit with some evidence for a weak preference for the formation of Sn–O–Sn and Zr–O–Zr over Sn–O–Zr bonds (i.e., very low levels of Sn and Zr clustering),32 suggesting all 13 possible NNN will likely be present to some extent for Sn. This leads to the conclusion that signals from Sn with higher numbers of Zr NNN are overlapped with those arising from 1, 2 or 3 Zr on the surrounding B sites, in agreement with previous work on Y2(SnxTi1−x)2O7,3,7 where multiple resonances were seen in 89Y MAS NMR spectra, but the 119Sn spectra showed broad and overlapped signals. This behaviour was ascribed to two competing contributions to the isotropic chemical shift: the decrease resulting from the substitution of the less electronegative Ti4+ cation, and a concomitant increase arising from the change in cell size on substitution of the smaller Ti4+ cation. Although Zr also has a lower electronegativity than Sn (1.33 cf. 1.96), Zr4+ is much more similar in size to Sn4+ (86 ppm cf. 83 ppm) than Ti4+ (74.5 ppm), resulting in the overlap of some signals in the 119Sn MAS NMR spectrum of La2(SnxZr1−x)2O7, but with better resolution than what was seen for Y2(SnxTi1−x)2O7. Clearly, however, it would not be straightforward to deconvolute the complex and overlapped lineshapes observed into signals from the 13 (or even just 8) NNN environments, or indeed to predict exactly where each of these signals is likely to appear. Any such analysis also neglects the effects on the shifts from changes in the longer-range structure and/or those in local geometry (e.g., bond angles and bond distances). This makes any analysis of the experimental 119Sn MAS NMR spectra, and the extraction of the information they contain on cation disorder, very difficult to carry out without the aid of supporting computational studies.
![]() | (10) |
To understand the difference at low Sn concentrations, let us focus on the case with x = 0.125 (Fig. 6). In the canonical ensemble simulation, because the cell employed only has two Sn atoms, the spectrum for a configuration with that composition can only show two peaks, corresponding to Sn with z = 0 and z = 1 Sn NNN, respectively. However, by using the grand-canonical ensemble, we can account for all contributions for any z between 0 and 6. The Sn cations with z = 0 contribute one of the two large peaks, at values between −647 ppm and −655 ppm (this peak has a small shoulder at ca. −651 pm, which results from lower-probability z = 0 configurations with higher Sn occupancies in the coordination spheres beyond NNN; that distinction seems exaggerated by our model, as it appears unresolved in experiment). The second large peak, between −655 ppm and −660 ppm, is the z = 1 peak, which is present in both the canonical simulation and in experiment. Crucially, in the grand-canonical simulation we also observe that a third peak appears at more negative chemical shifts, between −660 ppm and −665 ppm. That peak is very clear in the experimental spectrum but could not be obtained with the canonical ensemble approach. The peak is the result of clusters of 3 Sn atoms, where a given Sn atom has z = 2 NNN Sn atoms, as shown in Fig. 6. In particular, the peak between −660 ppm and −665 ppm appears when an Sn atom has two NNN Sn atoms in a “trans” configuration (i.e. occupying opposite positions around the central Sn atom in the coordination environment scheme in Fig. 1b). Sn atoms with two NNN Sn atoms in “cis” configuration (both NNN Sn atoms in adjacent positions) contribute a peak that appears at less negative chemical shifts, roughly in between the peaks with z = 0 and z = 1; that overlap makes the z = 2 “cis” peak difficult to resolve in experiment, in contrast with the z = 2 “trans” peak, which is clearly visible (Fig. 5a, bottom).
Clusters of 3 Sn atoms are absent in the canonical representation of the composition with x = 0.125 for the simulation cell size used in this work, since in this case only 2 Sn atoms are included per cell, and therefore all z ≥ 2 contributions are missing. In principle, modelling this compositional fluctuation in the canonical approach is possible, but it would require a larger simulation cell. For example, doubling the cell in only one direction (which would be a minimal but not ideal extension as it would break the cubic symmetry) would give an N = 32 supercell, where the composition x = 0.125 would be represented by n = 4 Sn atoms and therefore z = 2, 3 clusters would be possible. However, such large cells quickly become impractical (in terms of computational cost), both because of the combinatorial explosion of possible configurations and the increased computational cost of optimising and predicting the NMR parameters for each configuration in the ensemble at DFT level.
The grand-canonical approach thus provides an elegant route to account for compositional fluctuations in atomic local environment in the simulation of the NMR spectra of solid solutions, without the need for prohibitively large simulation cells, as evidenced in the case of the x = 0.125 composition. Still, there are substantial discrepancies between model and experiment at other (mainly intermediate) compositions. For example, at x = 0.375 and x = 0.5 the simulated spectra seem to have more peaks than those observed experimentally. We believe the discrepancies between model and experiment are mainly due to the factors described below.
First, while the DFT-GIPAW approach is extremely powerful and valuable, it is still approximate and we find that the absolute values of the chemical shifts, as well as some relative values, are not well reproduced. This pyrochlore oxide solid solution is a particularly challenging case as the shift differences are quite small and therefore difficult to resolve. At intermediate concentrations, when the number of possible chemical environments contributing to the spectra becomes higher, prediction errors accumulate more densely, leading to less well-defined peaks and stronger discrepancies with experiment.
Second, while the grand-canonical approach corrects the leading finite-size error of the canonical model in terms of configurational statistics (in the sense that it allows for a more flexible composition of the first Sn–B coordination sphere, regardless of the overall composition), it still suffers from some finite-size effects. The second and third Sn–B coordination spheres correspond to distances (between 6.5 and 7 Å, and between 7.5 and 8 Å, respectively) that are longer than half the length of the simulation cell, thus distorting somewhat the statistical weighting of the smaller but still important longer-range contributions. This effect also accumulates more problematically at intermediate compositions compared to near the solid solution's end-members, further contributing to discrepancies with experiment.
Third, the ion dynamics in the flat potential energy surface of this pyrochlore is not accounted for by our use of (T = 0 K) optimised geometries. In our “static” approach, each DFT simulation is performed at a specific local relaxation pattern corresponding to the energy minimum for the given Sn/Zr occupancy configuration. However, we have observed that DFT predicts shallow minima for this solid (to the extent that the precise DFT-optimised local geometries are often sensitive to the details of the optimisation algorithm). Finite-temperature measurements might therefore be influenced by dynamic effects that are not well accounted for in our model.
There are other potential sources of error. For example, chemical shift anisotropies (CSAs) were ignored in our analysis. They are between 60 and 70 ppm, which is about 10 kHz at 9.4 T. Since we spin the samples at 14 kHz, we would expect CSA to have a very small effect. There is also the possibility that the experimental samples have defects in concentrations high enough to affect the NMR spectra, as well as other complexities resulting from kinetic effects in the synthesis. With the computational tools we are developing these effects would be a good area for future investigation, but in the present case they are unlikely to be more important than the more critical simulation limitations described above.
We have also examined the potential contribution of stress-volume effects. In the current formulation, each configuration (n, m) contributes to the ensemble in proportion to the probability Pnm of eqn (2), with an energy calculated via a DFT full relaxation of geometry and cell parameters under periodic boundary conditions. However, if the equilibrium cell parameters vary strongly with composition, one can expect that configurations with n far from xN would require an additional energy penalty, due to the stress of accommodating the compositional fluctuation, leading to them having lower probability of occurrence with respect to the uncorrected grand-canonical formalism. We have implemented an approximate (volumetric) correction for this effect: if we know both the equilibrium cell volume V and bulk modulus B as a function of x, the energy penalty added to each Enm as a stress-volume correction (SVC) is:
![]() | (11) |
For our La2(SnxZr1−x)2O7 case study, using bulk moduli and cell volumes obtained from linear interpolation between the DFT values of the endmembers, we found that this SVC correction had no significant effect on the results, which can be explained by the similar radii of Sn and Zr ions and the weak variation of volume with composition in this system. However, the correction given by eqn (11), or more sophisticated versions of it, might be needed in other systems where stress associated with compositional fluctuations plays a more important role.
A final caveat is that, while the grand-canonical formalism allows the adoption of a relatively small simulation cell, it nevertheless still requires a very significant computational effort. Even when one is interested in modelling a specific solid solution composition, it is necessary to calculate energies and NMR parameters for all the configurations at different cell compositions. The number of configurations scales exponentially with the supercell size (as 2N for a binary solid solution) and can only be slightly reduced by exploiting the crystal symmetry. We therefore need to explore routes to accelerate the simulations for this computation method to become a realistic analysis tool.
We can illustrate this idea with the example of the pyrochlore with composition La2Sn0.25Zr1.75O7, which corresponds to x = 0.125. In the canonical ensemble, this corresponds to n = 2 Sn/Zr substitutions in the cell with N = 16 sites. Configurations with other values of n, which are omitted in the canonical ensemble, are expected to be relevant as this is a small cell, and we are close to an endmember. Fig. 7a shows the cumulative probabilities , calculated using the grand-canonical ensemble, at 873 K, as well as those expected in the limit of full disorder (formally corresponding to infinite temperature, but obtained with the binomial distribution) at the same compositions. Because the system is highly disordered, the grand-canonical cumulative probabilities follow the binomial distribution well. As expected, for composition x = 0.125 the maximum probability lies close to n = 2 substitutions over the 16 B sites, whereas for composition x = 0.75 the maximum probability lies close to n = 12 substitutions over the 16 sites. The cumulative probability of configurations with a given n becomes much lower the farther n is from xN. We can therefore truncate the number of configurations employed in the grand-canonical analysis, by retaining only configurations with n close to the canonical values. In fact, it seems that it is often sufficient to take one or two n values at each side of xN to achieve a converged result. In Fig. 7b and c, we show the effect of performing the truncated grand-canonical analysis, where instead of using the full range of configurations (0 ≤ n ≤ 16), we use only 0 ≤ n ≤ 4 (i.e. n = xN ± 2) for x = 0.125 and 11 ≤ n ≤ 13 (n = xN ± 1) for x = 0.75. The spectra obtained using these truncated ensembles are very similar to those obtained using the full range of configurations, but they are obtained at a fraction (∼7% and ∼23% in the two cases discussed) of the computational cost of evaluating the full ensemble. The advantage introduced by ensemble truncation seems higher for compositions near the solid solution endmembers.
We considered five different ML models, for which the results are shown in Table 2, comparing the mean absolute errors (MAE) between the DFT-calculated 119Sn chemical shifts and those predicted with ML. The number of configurations included in the training set is 20% of the total number of configurations (we took all the configurations with extreme compositions, i.e., n = 0, 1, 2, and 16, 15, 14, plus a random selection from the rest of the compositions to complete 20%). The MAE was evaluated for all the other configurations not in the training set (i.e. 80% of the configurations in the ensemble). The best performing model was the random forest, with a MAE just below 3 ppm.
Model | MAE (ppm) |
---|---|
Random forest | 2.99 |
Gradient boosting | 3.04 |
SVR | 3.37 |
Bayesian ridge | 3.88 |
Linear regression | 3.88 |
Since the DFT optimisation and NMR calculations of all the configurations in the full ensemble are so computationally demanding (in our case, ca. 40 million CPU hours), if the ML predictions could give results of similar quality to those from the DFT calculations, we would have a method for more efficient prediction of NMR spectra of solid solutions with DFT quality in a much shorter time scale. To explore this, we calculated the 119Sn MAS NMR spectra, with the grand-canonical method, using both the DFT-calculated 119Sn isotropic shifts, and those predicted with the random forest ML model. As Fig. 8 shows, the spectra calculated with the ML approach shows close similarity with the one calculated in the full-DFT approach. Most of the peaks that appear in the full-DFT simulations are also present in the ML-accelerated simulations, and they are present at the same, or only slightly shifted, chemical shifts.
Still, there is plenty of room for improvement in the ML model. For some compositions the discrepancies with the full-DFT calculations are significant: for example, for x = 0.375 the ML-extended model predicts two well-defined peaks, whereas the full-DFT spectrum has a more much complex structure. Interestingly, the ML-aided result at this composition looks more like the experimental spectrum (in Fig. 5) than the full-DFT result does. The reason for this error cancellation is likely to be related to dynamic effects, which, as mentioned earlier, lead to effective averaging over different local relaxation patterns in the experimental spectra: the ML simulation ends up being closer to experiment by avoiding local relaxation effects (it uses the geometry of the unrelaxed lattice to save the cost of DFT geometry optimisations, and that might resemble the dynamic average better than the DFT local minima). This hypothesis deserves further investigation in future studies, perhaps by combining molecular dynamics with ML interatomic potentials and fast ML-aided NMR predictions. In any case, the limitations of the current ML model to reproduce the target NMR chemical shifts from DFT still need to be addressed. For example, we might need to use more expressive descriptors of local chemical environments, such as SOAP (Smooth Overlap of Atomic Positions) feature vectors,37 which have been successfully used for NMR predictions elsewhere;38–41 and it might be possible to improve our sampling strategy in the creation of the training dataset, including by using active learning approaches.42
Despite its limitations, our initial ML results already suggest that conducting the computationally intensive DFT calculations for all configurations may not be necessary if a sufficiently representative training set is available. Instead, a fraction of the original number of DFT calculations (∼20%) appears to be sufficient in this case. Furthermore, our initial results show how future studies could be expanded to include larger simulation supercells to account better for the statistical distribution of chemical shifts without the artifacts introduced by periodic boundary conditions.
The combination of grand-canonical ensembles and DFT simulations is, however, a computationally expensive approach to the simulation of the NMR spectra of solid solutions. We have demonstrated that ensemble truncation strategies significantly reduce the computational cost of these simulations without significantly sacrificing accuracy, if specific compositions are being studied. Machine learning techniques allow further enhancement in efficiency by decreasing the number of necessary DFT calculations and predicting NMR chemical shifts for uncomputed structures. Furthermore, these techniques open the possibility of performing future studies in larger simulation cells, to afford a more complete description of the configurational statistics of chemical shifts.
The presented methodology not only refines the interpretation of experimental solid-state NMR spectra but also establishes a scalable and transferable framework for studying other complex solid solutions.
This journal is © The Royal Society of Chemistry 2025 |