Machine Learning-Based Correction for Spin-Orbit Coupling Effects in NMR Chemical Shift Calculations

As one of the most powerful analytical methods for molecular and solid-state structure elucidation, NMR spectroscopy is an integral part of chemical laboratories associated with a great research interest in its computational simulation. Particularly when heavy atoms are present, a relativistic treatment is essential in the calculations as these influence also the nearby light atoms. In this work, we present a Δ-machine learning method that approximates the contribution to 13C and 1H NMR chemical shifts that stems from spin-orbit (SO) coupling effects. It is built on computed reference data at the spin-orbit zeroth-order regular approximation (ZORA) DFT level for a set of 6388 structures with 38 740 13C and 64 436 1H NMR chemical shifts. The scope of the methods covers the 17 most important heavy p-block elements that exhibit heavy atom on the light atom (HALA) effects to covalently bound carbon or hydrogen atoms. Evaluated on the test data set, the approach is able to recover roughly 85% of the SO contribution for 13C and 70% for 1H from a scalar-relativistic PBE0/ZORA-def2-TZVP calculation at virtually no extra computational costs. Moreover, the method is transferable to other baseline DFT methods even without retraining the model and performs well for realistic organotin and -lead compounds. Finally, we show that using a combination of the new approach with our previous Δ-ML method for correlation contributions to NMR chemical shifts, the mean absolute NMR shift deviations from non-relativistic DFT calculations to experimental values can be halved.


Statistical Quantities
The following quantities were used for the statistical evaluation of the chemical shift data of the benchmark studies with n data points.δ x denotes the predicted value (e.g., the purely SR chemical shift) and δ r the reference, (e.g., the SO one).

Structures in the data set
The geometrical distortion procedure is used to generate structures with variations in bond lengths, angles and dihedral angles from an optimized parent structure.The process was used in the same way for the Δ corr -ML method and is described in more detail in the respective Supporting Information.S1 For the data set of the Δ SO -ML model, three distorted structures were generated from each parent structure.They were chosen automatically from random generations to select one per energy window of 2.5-5.0 kcal•mol -1 , 10.0-15.0kcal•mol -1 , and 30.0-40.0 kcal•mol -1 higher than the parent structure.
A complete detailed list of all 1597 compounds would be out of scope for this Supporting Information.However, all structures are provided in the supporting .ziparchive as Cartesian coordinate (.xyz) files including combined files with a collection of optimized structures.All molecules consist of at least one carbon atom in a typical organic environment with at least one of the heavy atoms.The size of the molecules range between 3 and 46 atoms with a maximum number of 24 carbon or 27 hydrogen atoms per compound.The number of structures and chemical shifts that contain a certain element or group of elements is listed in Table S1.A more detailed list with extra information for every compound is provided in the supplementary raw data (.xlsx) file.

Structure and division of the data set
Like the ML architecture, the structure of the data set is also adapted from the previous Δ corr -ML model.It consists of 1597 compounds, each of which features four structures (one optimized and three distorted ones).During the data acquisition process, each structure is screened for carbon or hydrogen atoms that make up a data point.There are three ways to shuffle the data before it is divided into training and test set: • compounds mode: Shuffles only the compounds and it is ensured that all atoms from all structures of one compound are in the same data subset.
• structures mode: Shuffles the structures and it is ensured that all atoms of one structure are in the same data subset.However, different structures of the same compound can be part of the training and the test set.
• atoms mode: Shuffles all atoms, disregarding their structural origin.Atoms from one structure can be found in the training and the test set.
For the Δ corr -ML method, we have shown that the data bias that stems from including different degrees of geometric distortion of the same compound in both data subsets (structures mode) is reasonable.Switching to compounds mode does lead to an expected slight decrease of the generality of the model (exhibited by a worse performance on the test data set).However, this is a small effect rendering the structures mode reasonable for the statistical investigations.Using the atoms mode is inherently questionable as all information of the molecular origin of the atoms is lost as the data points form a molecule are spread over both data subsets.Hence, the structure mode was applied in all investigations presented in this work (as it was the case for the Δ corr -ML method).Note that for this reason, a somewhat reduced performance of the Δ SO -ML method is expected for structures with low similarity to the training data set as it is the case for the investigated organotin and -lead compounds and for the compounds in the 17HAC test set.For more information on the data set structure and shuffling modes, we refer to the investigations in the Supporting Information for the Δ corr -ML method.S1

Input feature vector
The relative NMR chemical shift δ of a compound (c), that can be measured in an experiment relative to an external reference compound (ref. ), is also computed as the difference of the respective absolute shielding constants σ: In all NMR shift calculations of this work, a tetrymethylsilane (TMS) molecule was treated in the exact same way as the actual compound.This holds both for the structures in the data set and for the conformer ensembles for the comparison with experimental data.
A complete list of the descriptors that have been included in the input feature vector of the neural network of the Δ SO -ML scheme is given in Table S2.In addition to the list, definitions for the magnetic input features are given in the following.All descriptors in this group are derived from the chemical shielding tensor that is obtained from a standard (low-level ) NMR shielding calculation.The definitions for the magnetic quantities are taken from the Δ corr -ML model and are recapitulated below.
The total NMR shielding tensor σ is defined as with The total isotropic shielding constant σ iso is the trace of the shielding tensor: The span Ω, skew κ, asymmetry η and anisotropy ∆ result from σ and σ iso as defined in Table S2.Furthermore, the dia-and paramagnetic parts σ iso,dia and σ iso,para of the total isotropic shielding constant are also included in the input feature vector.
Dia-and paramagnetic part of the shielding constant σ iso,dia , σ iso,para ✓ ✓ * For 1 H: HA bound directly to 1 H or via two bonds (max.3 HAs), for 13 C: HA bound directly to 13 C (max. 4 HAs).
3 Supplementary Data

Analysis of the data set
Complementarily to the analyses of the data set in the manuscript (see Fig. 3), more visualisations are presented herein.In the following figures, the color codes corresponds to the distance of the 13 C/ 1 H nuclei to the closest heavy atom (HA) in terms of numbers of covalent bonds.For a clearer picture, the color code is depicted in Figure S1.
A representation of the complete data set for 1 H NMR is shown in Figure S2 (analogous to Fig. 3).Separate and more detailed analyses of the data set for all three distance categories are given in Figures S3 to S5

Te Se
(a)

Organotin and organolead compounds
For the evaluations of the organotin and -lead compounds, two benchmark sets were investigated.The organotin compounds were taken from the SnS51 set S2 and for the organolead compounds, a successor study is in progress during the publication process of this work.The benchmark study for 207 Pb NMR chemical shifts contains a similar number of experimentally accessible organolead structures, for which conformer ensembles were generated.In both cases (Sn and Pb), only the conformer lowest in Gibbs free energy was taken from each ensemble and the Sn(CH 3 ) 4 and Pb(CH 3 ) 4 compounds were included as well.Then, the reference value of ∆ SO δ was calculated at the level used throughout the study (SO-ZORA-PBE0/TZ2P).Some compounds were omitted due to technical issues including convergence problems of the SO-PBE0 calculation and the presence of atoms that were not included in any training data structure (such as transition metal atoms).
In the case of the SnS51 set, the following compounds were not included here: 6, 7, 8, 9, 10, 12, 13, 14, 15, 18, 22, 29, 41, 47, 48.In total, this yields 34 Sncontaining compounds with a total number of 817 calculated 13 C and 1170 1 H NMR shifts (without averaging of chemically equivalent atoms).For the Pb test set, some compounds were excluded for the same reason as for the Sn-containing compounds.For a detailed analysis of the structures, we refer to the upcoming publication.In total, 48 compounds were included with 1415 calculated 13 C and 2059 1 H NMR shifts (without averaging of chemically equivalent atoms).All structures are included as Cartesian coordinate files (.xyz) in the supplementary .ziparchive.
In addition to the results presented in Fig. 8, the evaluation of the 1 H NMR shift data for the SO-HALA effects of Sn and Pb atoms is presented in Figure S9.any SO contribution (PBE0/ZORA-def2-TZVP, full colors) and with the MLpredicted ∆ SO δ values (brighter colors).The data is averaged over 1 H nuclei bound to a HA via one/two (H-X 0/1 -HA), three (H-X 2 -HA), and four or more bonds (H-X 3 -HA) and over the full data (all).

17HAC test set
The 17HAC test set was compiled to address the heavy atom effect on 13 C (HAC) for all 17 HAs included in this study.For this, 13 C NMR data for 63 (metal-)organic compounds were collected yielding a total number of 236 experimental 13 C NMR shifts.For each heavy element, at least three compounds that contain it are included in the 17HAC set.The 13 C NMR shifts originate from atoms at different distances from the HA (all categories shown in Figure S1(a) are represented) and the subsequent computational investigation revealed that the magnitude of the HALA effect to 13 C varies throughout the range of compounds.
A complete list of the compounds included in the 17HAC set and some further details is given in Table S3.The chemical shift data itself and the experimental references are provided in the supplementary raw data (.xlsx) file.
The workflow used for the low-level NMR chemical shift calculations is a commonly applied way to incorporate solutions to the typical error sources of conformational flexibility and solvation effects into the computation.For each compound, a conformer ensemble was generated and refined (here with CREST and CENSO, see the manuscript) and the results of the subsequent NMR shielding calculations applying the implicit CPCM solvation model were Boltzmann averaged (at 298 K) to obtain the chemical shielding constant for each atom.The reference was tetramethylsilane (TMS) for all compounds and was treated in the same way in the respective solvent.The final computed NMR chemical shift was obtained via equation ( 7) and the values for all chemically equivalent 13 C nuclei were averaged to be comparable to the experimental value.To disentangle the (ML) correlation and SO contributions to the NMR shift from the solvation contribution, the ML models are always applied to chemical shift values from calculations without a solvent model before the contributions are added to the values including solvation.Therefore, all chemical shifts were recalculated without applying the CPCM model before both Δ-ML methods were applied (separately) to each chemical shift value of every conformer.In this way, the final chemical shift value can be divided into the low-level value (including conformational and solvation effects) and the correlation as well as the spin-orbit contribution for the results obtained by the Δ corr -and the Δ SO -ML methods, respectively.For this purpose, all chemical shift calculations for the data set of the Δ corr -ML method were performed at the PBE0/ZORA-def2-TZVP level of theory and this new low-level data was used to retrain the Δ corr -ML model.Finally, after the presented workflow was applied to every compound in the 17HAC set, the statistics were obtained that can be found in Table 3. Table S3: List of the compounds of the 17HAC data set including the solvent used in the NMR measurement, the number of conformers obtained from the workflow described above, the number of experimentally obtained 13 C NMR chemical shifts, and their respecitve reference.

Figure S1 :
Figure S1: Example molecule showing the color code of the distance categories used throughout.(a) 13 C, division in one (blue), two (yellow), and three or more (gray) covalent bonds between 13 C and the HA (= Se, Te), (b) 1 H, division in one or two (blue), three (yellow), and four or more (gray) covalent bonds between 1 H and the HA.The indifferent H/C atoms are shown in white.(a) is equivalent to Fig. 3(a).

Figure S2 :Figure S3 :Figure S4 :Figure S5 :
Figure S2: Complete data set with 64436 1 H NMR chemical shifts showing the relation between the purely scalar-relativistic NMR shift δ SR calculated with the lowlevel PBE0/ZORA-def2-TZVP method and the relativistic reference δ SO = δ SR + ∆ SO δ with the SO contribution calculated at the PBE0/TZ2P level.For the color code, see Figure S1(b).

Figure S6 :
Figure S6: NMR shifts from 1 H nuclei connected to a HA via one or two bonds showing the correlation between δ SR calculated with the low-level PBE0/ZORA-def2-TZVP method and the target value ∆ SO δ calculated at the PBE0/TZ2P level.

Figure S7 :
Figure S7: NMR shifts from 1 H nuclei connected to a HA via three bonds showing the correlation between δ SR calculated with the low-level PBE0/ZORA-def2-TZVP method and the target value ∆ SO δ calculated at the PBE0/TZ2P level.

Figure S8 :
Figure S8: NMR shifts from 1 H nuclei connected to a HA via four or more bonds showing the correlation between δ SR calculated with the low-level PBE0/ZORA-def2-TZVP method and the target value ∆ SO δ calculated at the PBE0/TZ2P level.

Figure S9 :
Figure S9: Comparison of 1 H NMR metrics for the H-Sn and H-Pb test structures withoutany SO contribution (PBE0/ZORA-def2-TZVP, full colors) and with the MLpredicted ∆ SO δ values (brighter colors).The data is averaged over 1 H nuclei bound to a HA via one/two (H-X 0/1 -HA), three (H-X 2 -HA), and four or more bonds (H-X 3 -HA) and over the full data (all).

Table S1 :
Numbers of structures (optimized and distorted) and chemical shifts that are included in the complete data set and consist of at least one atom of the respective element of group of elements.

Table S2 :
List of variables included in the input feature vector of the Δ SO -ML model for 1 H and 13 C chemical shift SO contribution prediction in three groups.The term n th bonding sphere refers to all atoms within a distance of n covalent bonds to the origin atom.Details of the equations in the magnetic properties group can be found above.SO δ = δ SO − δ SR with δ = σ ref. − σ Each for 1 st , 2 nd , 3 rd , and 4 th (and 5 th for 1 H) bonding sphere: H, C, N, O, S, Cl, Zn, Ga, Ge, As, Se, Br, Cd, In, Sn, Sb, Te, I, Hg, Tl, Pb, Bi ✓ Number of atoms of type . . . in the second bonding sphere: H, C, N, O, S, Cl, Zn, Ga, Ge, As, Se, Br, Cd, In, Sn, Sb, Te, I, Hg, Tl, Pb, Bi ✓ ✓ Atom-centered symmetry functions (ACSF) G 1 with cutoff 5.0 angstroms including: H, C, N, O, Si, P, S, Cl, Zn, Ga, Ge, As, Se, Br, Cd, In, Sn, Sb, Te, I, Hg, Tl, Pb, Bi ✓ ✓Each for 1 st and 2 nd bonding sphere: