Open Access Article
Jens Wagner
,
Kerstin Münnemann
,
Thomas Specht
,
Hans Hasse
and
Fabian Jirasek
*
Laboratory of Engineering Thermodynamics (LTD), RPTU University Kaiserslautern-Landau, Germany. E-mail: fabian.jirasek@rptu.de; Tel: +49 (0)631 205 4685
First published on 20th February 2026
Elucidating unknown mixtures is a critical challenge in chemistry and chemical engineering. Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical technique generally suited for this purpose. However, component-wise elucidation with NMR is tedious for complex mixtures, requires expert knowledge, and often yields ambiguous results. In contrast, identifying and quantifying structural groups in a mixture from NMR spectra is much more straightforward. In prior work, we have introduced ‘NMR fingerprinting’ for the automated elucidation of carbon-, hydrogen-, and oxygen-containing structural groups in unknown mixtures based on standard NMR experiments and a support vector classification (SVC) from machine learning (ML). In the present work, we present a substantially advanced NMR fingerprinting method that employs a deep set model (DSM), addressing major shortcomings of the SVC, and integrates additional information from 2D NMR experiments. The DSM was trained on experimental NMR spectra of pure components from open-source databases, augmented with synthetic spectral data, and comprises invariant and equivariant network structures to ensure predictions independent of the input order of the NMR signals. Tested on experimental pure-component test data, the DSM performs excellently, significantly outperforming our previous approaches. Furthermore, we demonstrate the applicability of the DSM to unknown mixtures by predicting the structural groups from NMR spectra of test mixtures measured using a benchtop NMR spectrometer. The predictions agree very well with the true mixture compositions, highlighting the method's potential for efficient automated mixture analysis and providing a reliable basis for downstream tasks, such as thermodynamic modeling using group-contribution methods.
NMR spectroscopy has also successfully been applied for the qualitative and quantitative analysis of mixtures.11–16 If the mixture components are known, a variety of automated quantification methods are available, even if signals in the NMR spectra overlap.17–23 However, elucidating unknown components in mixtures remains a significant challenge, whose solution often depends on expert knowledge, which becomes infeasible if complex mixtures are studied. In cases where mixtures contain unknown components with signal overlap, already the first step of separating the relevant signals, the so-called deconvolution of the NMR spectrum, becomes inherently ambiguous, though some ML approaches for automated deconvolution of NMR spectra have been introduced.24–26
An alternative approach to elucidating components in complex mixtures that avoids the ambiguities of assigning the signals to the unknown components is dereplication,27–32 which identifies individual components by comparing the NMR spectrum of the mixtures to those of pure compounds retrieved from reference databases. However, the limited coverage of these databases confines dereplication to those molecules already represented within them.33 Moreover, methods relying solely on spectral comparisons remain sensitive to experimental conditions due to inherent biases in the reference data.4 Consequently, no broadly applicable solution currently exists for the automated elucidation of unknown components in mixtures by NMR spectroscopy.
While, for the reasons discussed above, elucidating components in unknown mixtures by NMR spectroscopy still poses a significant challenge, identifying the structural groups that constitute these components is considerably more straightforward. This group-based task, which we call ‘NMR fingerprinting’, is based on the fact that in an NMR spectrum, the chemical shift of an analyzed nucleus reflects its electronic environment, thereby revealing the structural group containing it. Traditionally, chemical shift tables that outline characteristic ranges in NMR spectra have been used to assign structural groups to NMR signals.34 However, overlapping characteristic ranges in chemical shift tables lead to ambiguity in assigning structural groups based solely on them. Also, the “static” nature of these tables leads to problems in practice.
From an ML perspective, assigning the correct structural group to signals in an NMR spectrum represents a classification problem. Therefore, we have recently developed a support vector classification (SVC) for the automated NMR fingerprinting of carbon-, hydrogen-, and oxygen-containing structural groups in unknown mixtures based on standard NMR experiments.35,36 Trained on thousands of pure-component spectra from the open-source databases Biological Magnetic Resonance Data Bank (BMRB)37 and NMRShiftDB,38 the SVC automatically assigns structural groups to signals in 13C NMR spectra, leveraging additional information from 1H and 13C DEPT (distortionless enhancement by polarization transfer) NMR spectroscopy. Utilizing SMARTS39 strings as a machine-readable representation of the respective structural groups during model training enables straightforward modification and extension of the considered structural group list. Applied to test mixtures, the predictions by the SVC achieved good agreement with the true mixture compositions, making it a reliable method for the structural group elucidation of unknown mixtures. The results of NMR fingerprinting can subsequently be used for the rational definition of pseudo-components40 and thermodynamic modeling using group-contribution methods,41–45 enabling the conceptual design of fluid separation processes.46,47
However, due to the characteristics of NMR data, SVC-based NMR fingerprinting has significant limitations in its application. Specifically, the signals in the NMR spectrum classified into structural groups can vary substantially in number, depending on the complexity and number of different components in the mixture of interest. This poses a challenge for developing SVCs for NMR fingerprinting, as an SVC requires inputs of constant length, which we have solved by binning the NMR spectra in multiple regions of defined chemical shift width. However, binning leads to the problem that signals with very similar chemical shifts, which in consequence are assigned to the same bin, cannot be distinguished, leading to classification errors. Furthermore, while there are natural choices for ordering the NMR signals in the input of the ML models, particularly with increasing chemical shift, it is not guaranteed that all data sets consistently comply with this ordering. Similarly, there is no inherent physical order of the different NMR spectra, e.g., 1H, 13C, 13C DEPT. Since SVCs are not permutation-invariant, i.e., their results depend on the input order, this poses another source of error for the NMR fingerprinting.
Within the realm of ML, these properties suggest that NMR signals and their corresponding nuclei information are best modeled as elements of sets rather than as fixed-length data instances.48
In this work, we overcome these limitations by developing a classification model based on a deep-set architecture.49 Deep set models (DSM) are a specialized neural network (NN) class within the field of geometric deep learning,50 specifically designed to preserve the symmetries inherent in set-structured data while introducing only minimal additional model complexity.48 Our DSM incorporates both invariant and equivariant network structures, ensuring that predictions are independent of input size and permutation, allowing the model to efficiently handle the unordered and variable-sized nature of NMR signals and nuclei. To fully capture the set-based characteristics of the NMR data, we extend our approach by incorporating information on the carbon–hydrogen correlations from 1H –13C HSQC (heteronuclear single quantum coherence) NMR spectroscopy as the first 2D NMR experiment in our mixture analysis. In doing so, the HSQC information is not directly used as additional input to the DSM but instead serves to construct the set structure of the model input by linking the information gathered from the 1H and 13C NMR experiments.
Additionally, we address the challenge of limited and incomplete training data in the used open-source NMR databases by augmenting incomplete NMR spectra with information derived from magnetically identical nuclei and predicted spectra using the open-source tools RDkit51 and NMRium.52 In this way, we have obtained complete spectral information for 2767 pure components, which we have used to train the model and rigorously test its predictive performance exclusively on unseen experimental NMR spectra. Finally, we have applied the model to test mixtures whose spectra were measured using a 60 MHz benchtop NMR device, demonstrating the approach in practical low-field NMR applications.
![]() | ||
| Fig. 1 Overview on the NMR fingerprinting method based on a deep set model (DSM)49 architecture for predicting the structural groups in unknown mixtures from NMR spectra and assigning them to signals in the 13C NMR spectrum. The DSM was trained on pure-component NMR spectra from the open-source databases BMRB37 and NMRShiftDB,38 with missing information augmented from magnetically identical nuclei and predicted spectra using the open-source tools RDkit51 and NMRium.52 The architecture of the DSM and its input, which is obtained from NMR experiments, are described in Section Deep-set architecture. | ||
Currently, the NMR fingerprinting method distinguishes 13 structural main groups. The method can also distinguish between different substitution degrees, so that, in total, 30 different subgroups can be identified, which are the same as in our previous works36,44 and summarized in Table 1. The quantification of the identified structural groups is finally achieved through signal integration in the 13C NMR spectra.36,44
| Label | Structural group | SMARTS representation |
|---|---|---|
| CH3 | Methyl | [CX4;D1;!$(C[!#6])] |
| CHx | Alkyl; x ∈ {0, 1, 2} | [CX4;D2,D3,D4;!$(C[!#6]);!R] |
| CHcyx | Cyclic alkyl; x ∈ {0, 1, 2} | [CX4;!$(C[!#6]);R] |
| CHxOH | Alcohol; x ∈ {0, 1, 2, 3} | [CX4;!$(C[OX2H0][CX3H1,CX3]( O))][OX2H] |
| CHxO | Ether; x ∈ {0, 1, 2, 3} | [CX4;$(C[OD2]);!$(C[OX2H0][CX3H1,CX3]( O));!$(C[OX2H])] |
CHx![]() |
Aliphatic double bond; x ∈ {0, 1, 2} | [CX3;!$(C∼[!#6])] |
| CHxar | Aromatic carbon; x ∈ {0, 1} | [cX3;!$(c∼[!#6])] |
| RO-CHxar | Aromatic carbon with oxygen substituent; x ∈ {0, 1} | [cX3;!$(c O);$(c∼[#8X2])] |
| COOR | Ester/lactone/anhydride carbonyl | [CX3H1,#6X3]( O)[#8X2H0] |
| ROOCHx | Alkyl next to ester/lactone oxygen; x ∈ {0, 1, 2, 3} | [CX4;$(C[OX2H0;$(O(C( O)))])] |
| COOH | Carboxylic acid | [CX3]( O)[OX2H1] |
| COald | Aldehyde | [CX3H1;!$(C[!#6])](=O) |
| COket | Ketone | [#6X3H0;!$([#6][!#6])]( O) |
The DSM developed in this work combines invariant and equivariant network structures. In the first step, the input information x13Ci for the 13C nuclei and x1Hi for the 1H nuclei is independently processed by dedicated embedding networks ϕ13C and ϕ1H, respectively. Unlike classical neural networks, these embeddings are computed in parallel rather than jointly,49 ensuring that the set-based nature of the nuclei data is respected. Subsequently, the nuclei embeddings are aggregated using the summation as a permutation-invariant function α, leading to the intermediate prediction α(xi) for each structural group based only on NMR-spectroscopic information on the respective nuclei. By employing parallel embeddings and the permutation-invariant function α, the DSM ensures an invariant prediction independent of the input order of the nuclei information.
In the second step, the intermediate predictions α(xi) are refined within the context of all structural groups in the studied sample to account for mutual influences on their respective NMR signals. Therefore, the intermediate predictions α(xi) for each signal in the 13C NMR spectrum are processed in parallel by the main embedding network ϕ, directing them to the equivariant layer σ(α). The equivariant layer σ(α) is a specialized NN layer that combines a standard per-element feed-forward layer σ with summation-based aggregation α,54 allowing the interaction of the embedded structural group predictions α(xi) while maintaining the relation between input and output.48 This summation-based aggregation captures inter-signal relationships in the context of all signals, which is the simplest form of contextualization and does not explicitly encode pairwise interactions between individual signals, as employed, for example, in self-attention-based architectures.48 Finally, through parallel processing by the prediction network ρ, which uses the additional input regarding the presence of labile protons L, the prediction ρ(xi) for each 13C NMR signal is obtained, independent of the input order of the signals.
The DSM does not provide absolute predictions for structural groups; instead, it assigns a probability to each group in Table 1 for every 13C NMR signal, with many groups receiving a probability of zero. This probability is interpreted as the model's confidence in the corresponding group assignment. The structural group with the highest probability (i.e., highest model confidence) is selected as the absolute prediction.
However, some of the NMR spectra from these databases are incomplete, lacking assignments of chemical shifts to the respective nuclei. Upon closer examination, these omissions generally fall into two categories. Sometimes, only one of multiple magnetically equivalent nuclei has an assigned chemical shift. This partial assignment is likely attributable to non-standardized data structures within the databases, which manage redundant information inconsistently. In other cases, none of the magnetically equivalent nuclei have assigned chemical shifts, suggesting that the missing assignment is probably the result of human error during NMR spectra recording or evaluation. Since the DSM classifies 13C signals in the context of all structural groups in the sample rather than individually, it is essential to provide complete spectral information of the components as input. To address the issue of missing spectral data, we implemented a two-step augmentation process:
(1) Missing chemical shifts were supplemented by automatically identifying magnetically equivalent nuclei within each component using RDKit and adopting their corresponding spectral information from magnetically equivalent nuclei for which information was available.
(2) Any remaining gaps in the spectra were filled using data from synthetic spectra predicted for each pure component with NMRium. In this step, no completely synthetic spectra were used; only existing but incomplete experimental spectra were augmented.
Through these augmentation steps, the number of pure components with complete spectral information increased from 839 to 2767, substantially extending the data set available for model training. Additional details on the data augmentation from synthetic spectra are provided in the SI.
In Fig. 2, the final augmented data set covering 2767 pure components and consisting of a total of 40
838 structural groups is visualized considering the information from the 13C NMR spectrum. The analogous presentation of the data set for the respective 1H NMR-spectroscopic information is provided in Fig. S.1 in the SI.
Fig. 2a denotes the number of each of the 13 distinguished structural groups Ng (cf. Table 1) in the augmented pure-component data set, broken down to segments in the 13C NMR spectrum where their respective signals occur. It is important to note that the segmentation of the spectrum used in Fig. 2 is solely for visualization purposes but not used in the DSM, which is in contrast to our previous SVC-based approach, where spectral segmentation was required.35,36 The structural groups exhibit significant overlap in their chemical shift distributions, i.e., they are not confined to specific regions but span a wide range of the 13C NMR spectrum.
Fig. 2b gives an overview of the proportion Psyn = Nsyng/Ng of the number of structural groups Nsyng that incorporate synthetic data for either 13C, 1H, or both. Separate visualizations showing the distribution of Psyn for structural groups containing synthetic data exclusively for 13C or 1H are provided in Fig. S.2 in the SI. Overall, structural groups containing synthetic spectral data account for 11.80% of the entire data set, with 0.84% containing synthetic data for 13C and 11.14% for 1H. Most structural groups with synthetic data are concentrated in the regions below 80 ppm and between 110 and 140 ppm in the 13C NMR spectrum. This distribution likely results from the high density of various structural groups, i.e., aliphatic, cyclic, and double bound carbon groups, in these regions for organic molecules, which complicates signal differentiation and accurate assignment of chemical shifts δ1H in the crowded 1H NMR spectrum (cf. SI for details). Furthermore, augmentations with synthetic data are necessary for carbonyl ketones with signals exceeding 220 ppm, as experimental spectra do not extend to these elevated chemical shifts δ13C by default.
In the SI, we provide a detailed analysis of the influence of synthetic NMR data on the training and predictive performance of the DSM, thereby demonstrating the robustness of the model with respect to the composition of the training data.
The generated data set was randomly split into a training, a validation, and a test set, comprising 80%, 10%, and 10% of the pure components from our data set, respectively. The test set was constrained to include only pure components with entirely experimental spectral data, i.e., not including synthetic data, to demonstrate the model's performance in the most realistic scenario. Furthermore, the training set was augmented by synthetic binary and ternary mixture data obtained by simply “mixing” the spectra of the respective pure components in the training set. As a result, each pure component present in the training set appeared three times in the final training set: once with its pure component spectra, once with the spectra of a binary mixture with a randomly chosen other component from the training set, and once with the spectra of a ternary mixture with two randomly chosen other components. In Table S.1 in the SI, we provide an analysis demonstrating the robustness of the DSM to different random splits of the data set.
All models and scripts for training and evaluation were implemented in Python 3.6.8 using PyTorch 2.2.1.55 Training was performed on an A40 GPU using the CrossEntropyLoss function with default PyTorch settings. The Adam optimizer was employed for weight optimization, and a learning rate scheduler with a decay factor of 0.1 and a patience of 20 epochs based on validation loss was utilized. Training was terminated early if the validation loss did not improve for 30 consecutive epochs, and the model achieving the lowest validation loss was selected. Typical training times ranged between one and two hours, while typical inference times were between four and six milliseconds.
Hyperparameter optimization, including the weight decay λ of the Adam optimizer, the initial learning rate, the batch size, and the number of layers and nodes in each network, was performed using a grid search based on validation loss. In the SI, we discuss the sensitivity of the model to the varied hyperparameters and present the validation loss results. The following hyperparameters were selected as final settings: a weight decay of λ = 5 × 10−4, an initial learning rate of 1 × 10−4, and a batch size of one. The network architectures were defined with three layers containing eight nodes each for ϕ13C and ϕ1H and two layers containing 256 nodes each for ϕ and ρ. In all networks, the Sigmoid Linear Unit (SiLU) activation function with default PyTorch settings was applied. In all cases, the number of nodes for the equivariant layer σ(α) was chosen to match those of the networks ϕ and ρ. The input dimensions of ϕ13C and ϕ1H were set to five according to the input data dimension, while the network ρ included an additional node, to account for the boolean variable L indicating the presence of labile protons, and had an output dimension of 13, corresponding to the number of distinct structural groups.
The predictive performance of the DSM on unseen test data was evaluated using the F1 score F1,g for each structural group g:
![]() | (1) |
For comparison, we have also retrained and evaluated the SVC from our previous work36 using this work's data set and the same partitioning of the data into training, validation, and test sets, as employed for the DSM. Further information on the training and evaluation of the SVC is provided in the SI.
Furthermore, a final version of the DSM was trained by randomly using the data for 90% of the pure components from our data set for training and the remaining 10% for validation. Unlike the primary evaluation approach described above, this model was not evaluated on a separate pure-component test set. Instead, it was directly applied to experimentally studied test mixtures to demonstrate its practical applicability in predicting the structural groups in real mixtures, as detailed below.
| Mixture | Components i | xi mol mol−1 |
|---|---|---|
| I | Water | 0.9266 |
| Acetone | 0.0244 | |
| Tartaric acid | 0.0246 | |
| 1,4-Butanediol | 0.0244 | |
| II | Diethyl ether | 0.7005 |
| Butanal | 0.1494 | |
| Butyl acetate | 0.1501 | |
| III | Cyclohexane | 0.5001 |
| Hexene | 0.3000 | |
| Diglyme | 0.1999 | |
| IV | Anisole | 0.7998 |
| 1-Octanol | 0.1198 | |
| 3-Methylbutan-2-one | 0.0804 |
1H NMR, 1H –13C HSQC NMR, 13C NMR, and 13C DEPT NMR spectra with pulse angles of 90° and 135° were recorded for each test mixture using a 60 MHz benchtop NMR spectrometer (Spinsolve 60 Ultra, Magritek). The settings of the NMR experiments, spectral processing procedures, and extraction of spectral information are reported in the SI.
In 1H –13C HSQC NMR spectra of mixtures, significant signal overlap is a common challenge, obscuring the cross-signals between 1H and 13C nuclei at low concentrations, especially when using benchtop NMR spectrometers with limited sensitivity and resolution. In our experiments, we encountered this exact problem: despite clear evidence of 1H bonded to 13C nuclei as determined by the substitution degree via 13C DEPT NMR, in some cases, the absence of observable cross-signals in the 1H –13C HSQC spectra prevented the determination of the chemical shifts δ1Hi necessary for applying the DSM. To address this challenge, we have developed a model for the relationship between the chemical shifts δ13Ci of 13C nuclei and the chemical shifts δ1Hi of their connected 1H nuclei using linear regression, which we have fitted to our comprehensive data set for pure components. This regression model enables the determination of the most likely value of δ1Hi of the connected 1H nuclei for a given δ13Ci. In cases where no cross signals for a13C signal in the 1H –13C HSQC were identified in the studied mixture but the 13C DEPT results indicated that there should be a cross-signal, we have supplemented the missing experimental spectral information by estimating the δ1Hi based on the respective δ13Ci using the regression model. Further details on the regression model are provided in the SI.
The processed spectral information was then fed as input to the DSM to identify the structural groups in the test mixtures. The task here is to assign a group from Table 1 to each signal in the 13C NMR spectrum. The resolution, even of the benchtop NMR spectrometer, is generally high enough to avoid that two different groups produce signals that cannot be distinguished. Even though this case cannot be strictly excluded, we do not consider it here. While the identification and assignment of structural groups is fully automated, the subsequent quantification step is currently performed manually. Quantitative information on the identified structural groups was finally obtained by manually integrating their signals in the 13C NMR spectrum and calculating the group mole fractions xg from the signal areas Ag, see eqn (2):
![]() | (2) |
For the test mixtures studied in this work, no overlapping 13C NMR signals were observed that would affect peak integration and, consequently, the predicted mole fractions xg. In the event of signal overlap in the analysis of a mixture, integration could be performed by integrating the observable peak envelopes according to their shapes. In cases of more pronounced overlap, recently proposed ML-based deconvolution methods24–26 could be employed to facilitate signal separation prior to integration. If individual peaks cannot be reliably distinguished even after such treatment, the overlapping signals could be treated as a single contribution and assigned to all corresponding predicted structural groups.
Furthermore, we have incorporated 1H –13C HSQC NMR data as the first 2D NMR information into the NMR fingerprinting approach. This integration opens the possibility of mapping the structural groups identified in the 13C spectrum to the corresponding signals in the 1H spectrum, improving the deconvolution and interpretability of complex 1H spectra of mixtures. However, achieving such mapping remains challenging due to extensive signal overlap and low resolution of 1H spectra measured using benchtop NMR spectrometers. Therefore, realizing the mapping to 1H spectra when applying the NMR fingerprinting method to high-field NMR spectrometers could be a goal of future work.
In scenarios where HSQC acquisition is impractical, e.g. when minimal measurement time is required, the DSM-based approach cannot be applied, as the HSQC information is essential to establish the set-based structure of the NMR input. In such cases, our previous SVC-based NMR fingerprinting approach,36 which operates solely on 1D NMR data, can be used.
Evaluation on experimental test data for unseen pure components demonstrates the excellent performance of the DSM in predicting the structural groups of pure components, significantly exceeding our previous NMR fingerprinting approach, which was based on an SVC. To demonstrate its applicability to unknown mixtures, we have applied the DSM to NMR spectra of test mixtures measured using a simple benchtop NMR device. The results show remarkable agreement with the true mixture compositions, demonstrating the potential of the DSM-based NMR fingerprinting method for efficient automated mixture analysis, including in low-field NMR settings, which provides the basis for thermodynamic modeling of unknown mixtures via group-contribution methods.45,47
Despite the high-quality predictions, the current method still has limitations. Most importantly, it is restricted to structural groups consisting of only carbon, oxygen, and hydrogen. Expanding the range of structural groups poses challenges due to the increasing overlap in their spectral ranges, which can, in principle, be mitigated by combining multiple spectroscopic techniques, as demonstrated for Fourier-transform infrared (FT-IR) spectroscopy.56,57 However, the flexible architecture of the DSM allows for the incorporation of additional NMR data, e.g., from heteronuclear multiple bond correlation (HMBC) NMR, which can enhance distinguishability and enable the inclusion of further structural groups, such as those containing nitrogen, without requiring spectroscopic methods beyond NMR in future work. Furthermore, the successful integration of synthetic data in this work suggests that augmenting the training data with synthetic NMR spectra can further improve the model's predictive capabilities for new structural groups with limited available data.
Supplementary information (SI): generation of synthetic NMR data; data distribution in the 1H and 13C spectrum; sensitivity studies; comparison of DSM with support vector classification; experimental methods; augmented data set, training scripts, and final model. See DOI: https://doi.org/10.1039/d5dd00490j.
| This journal is © The Royal Society of Chemistry 2026 |