Open Access Article
Michael Nicolaou
,
Hans M. Senn
,
Emma Gibson
*,
Mario González-Jiménez
and
Laia Vilà-Nadal
*
The School of Chemistry University of Glasgow Joseph Black Building, University Avenue, Glasgow G12 8QQ, UK. E-mail: Emma.Gibson@glasgow.ac.uk; laia.vila-nadal@glasgow.ac.uk
First published on 20th January 2026
Volatile Organic Compounds (VOCs) are abundant in nature and play vital roles in industries such as food, fragrance, and pharmaceuticals. Aromatic VOCs like vanillin are especially valuable, driving research into sustainable chemical processes, including the conversion of biomass into high-value chemicals. Understanding the molecular structure and vibrational behavior of these compounds is essential for designing and optimising such processes. In this work, we explore how computational modelling can be used to predict and interpret vibrational spectra of VOCs. We also introduce a statistical approach using Bayesian inference to improve how theoretical predictions are matched to experimental observations. This combined strategy enhances the reliability and clarity of spectral interpretation, offering a more consistent framework for studying complex organic molecules.
In parallel, the structural complexity of lignin—a heterogeneous biopolymer found in plant cell walls—continues to pose a major challenge in sustainable chemical conversion. Its irregular monomer composition and variable linkages, which differ across plant species and extraction methods, complicate both structural identification and the assessment of chemical treatments.2,10 As a result, analytical tools such as GC-MS,6,7 NMR,10,15,16 and IR/Raman spectroscopy17,18 are essential for characterising lignin and its depolymerisation products.
Computational Density Functional Theory (DFT) methods provide an invaluable tool for innumerable applications, such as mechanistic studies of catalytical environments,19–21 aiding in the understanding of catalytical processes and identification of catalytical products. DFT modelling is, however, fundamentally approximate, and needs experimental benchmarking when choosing a system to model and caution when interpreting the findings. This is further compounded by a notoriously expansive catalogue of available methodology parameters,22,23 (choice of functional, basis set or other factors such as dispersion) and even strategies enhancing DFT with other approaches, such as QM/MM methods24 and machine learning.25 This can be especially daunting when studying convoluted systems, such as vibrational fingerprints, where intensities are approximate and frequencies are off-set.26 These challenges are especially evident in vibrational spectroscopy, where the harmonic approximation27 is commonly used to reduce computational cost. While practical, this simplification introduces frequency errors, as it neglects anharmonicity. Calculation of vibrational frequencies also ignores phenomena like combination bands and Fermi resonance. This becomes exacerbated when studying systems in the solid or liquid state, where environmental effects are involved, such as hydrogen bonding and π-stacking of aromatic rings. While solid state calculations are possible using periodic DFT, it escalates the complexity and thus computational cost of a calculation, and is limited by the requirement of having prior knowledge of the system's structure. Static scaling factors are typically applied to correct frequency offsets, but interpretation still requires manual assignment—often a tedious and uncertain process. Recently, the integration of statistical models and machine learning has propelled the rise of digital chemistry—a data-driven paradigm that enhances the design, analysis, and interpretation of chemical systems.28,29 In this context, IR spectral prediction is becoming increasingly automated, accurate, and scalable, offering powerful tools for materials discovery, reaction monitoring, and structural elucidation.30 To overcome these issues, we introduce a Bayesian framework that enhances vibrational mode assignment from single molecule DFT calculations to non-gas phase IR spectra by statistically linking theoretical frequencies with experimental bands. Bayes' theorem allows us to quantify the likelihood of each potential match, offering a structured and reproducible alternative to subjective interpretation.
In this work, we have combined experimental results with our calculated vibrational data to benchmark a range of DFT methods for modelling aromatic VOCs and identify the most time-efficient and accurate approach. Following the protocol established by Alecu et al.,26 we report universal scaling factors for these methods. Furthermore, we create a framework and demonstrate how Bayesian inference can be used as a tool to enhance and strengthen spectral interpretation. The following ten aromatic compounds were used to test our proposed method: vanillin, 4-hydroxybenzaldehyde, syringaldehyde, cinnamaldehyde, cuminaldehyde, eugenol, estragole, coumarin, dihydrocoumarin, and umbelliferone, resulting in facilitated spectral interpretation and vibrational mode identification (Fig. 1).
Molecular geometry optimisation and vibrational frequency calculations of vanillin molecules were performed at different exchange–correlation (XC) functional levels of theory (S1) across the GGA (Generalised-Gradient Approximation), mGGA (meta-GGA), hybrid (HF exchange energy contribution), meta-hybrid, range-separated hybrid and double-hybrid (HF and post-HF XC contribution) “rungs” (BP86,33,34 PBE,35 OPBE,35–37 M06-L38, B3LYP,33,38,39 ωB97X-D,40,41 PBE0,35,42 M06,43 M06-2X43 and PBE0-DH44), as well as using ab initio post-HF second-order Møller–Plesset perturbation theory (MP2),45–49 using the 6-311++G(2d,2p) triple-ζ split-valence polarised Pople Gaussian-type orbital (GTO) basis set for H, C and O50,51 with diffuse orbitals for all atoms.
To determine the point of basis set convergence and study the effects of basis set parameters, such as polarisation and diffuse functions, the same calculations were also performed at the M06-2X XC functional level of theory using different basis sets (3-21G, 6-31G, 6-31G(d,p), 6-311G, 6-311G(d,p), 6-311+G, cc-pVDZ, cc-pVTZ, cc-pVQZ, aug-cc-pVTZ, aug-cc-pVQZ,52 def2-TZVP, def2-QZVP53), as M06-2X has been extensively used in studies on similar molecules and suggested to be suitable for small-medium molecules54 and main-group thermochemistry.43,55
All calculations were performed using spin-restricted (“closed shell”) orbitals and structures pre-optimised using Universal Force-Field (UFF)56 Molecular Mechanics (MM).
Vibrational frequency calculations (IR and Raman), were performed using the harmonic approximation model (S3).27
DFT-calculated vibrational frequencies were convolved and plotted using Python.57–60 Gaussian broadening was applied (S8) around the calculated frequencies with a standard deviation of 8, representing a full width at half maximum (FWHM) of approximately 19 cm−1. All spectra were normalised when plotted.
Experimental IR spectra were measured using a dry-air purged Bruker Vertex 70 spectrometer equipped with a Globar lamp, a Deuterated L-alanine doped Tri-Glycine Sulphate (DLaTGS) detector, a potassium bromide (KBr) beamsplitter, and a diamond ATR accessory (Bruker Platinum ATR Unit A225), at a resolution of 1 cm−1, averaging over 16 scans at a range of 350–4000 cm−1. The gas phase IR spectrum of vanillin was taken from the NIST Chemistry WebBook.61
Fig. 2 shows the geometry of vanillin as found in the crystallographic file. The average bond length for each of the 11 bonds was taken as the average bond length of 4 vanillin molecules present in the periodic crystal unit.
![]() | ||
| Fig. 2 Structure of vanillin in single crystal (CCDC62 CID 1306628 (ref. 63)). The coloured boxes highlight characteristic functional groups of vanillin. | ||
To gauge the efficiency of each method, the Mean Absolute Deviation (MAD) (S4) between calculated and reference bond lengths was compared with the job CPU time for the geometry optimisation of a pre-optimised input structure.
The calculation of vibrational modes and frequencies employs harmonic approximation,27 in which normal modes and frequencies are calculated based on the second derivatives of the energy with respect to atomic displacements. This is done to avoid the calculation of higher-order terms and simplify the calculation. Additionally, DFT calculations were performed on a single molecule, which is different from the experimental IR spectrum of a crystal or liquid.
As a result, computed vibrational frequencies differ from those observed experimentally via IR spectroscopy. It is thus important to gauge the expected magnitude of this deviation when attempting to make such a comparison. For errors due to anharmonicity, a scaling factor, which depends on the method used, is conventionally applied.
In their study, Alecu et al.26 showed that fundamental frequency scaling factors, which can be universally applied to frequency calculations to reduce the error by that method, can be obtained from the comparison of calculated frequency values to the experimentally observed frequencies of a suggested database containing 38 modes of vibration across 15 molecules, referred to in their study as “F38/10”.26 The scaling factor (λF) can then be calculated as:
![]() | (1) |
![]() | (2) |
The scaling factor for M06-2X/def2-TZVP was calculated for validation. A scaling factor of 0.945 was obtained which is in close agreement to the 0.946 reported by the authors, exhibiting a scaled RMS deviation of the dataset of 48 cm−1, showing a clear improvement from the 147 cm−1 unscaled deviation. A clear improvement in the accuracy of the methods when calculating the frequencies of the F38/10 database (S13) can be seen with the application of the reported scaling factors.
The RMS was optimised with respect to λF (S13) and the scaled RMS deviations present similar values to those of scaling factors reported, with the significant exception of cases using basis sets lacking polarisation (3-21G, 6-31G, 6-311G, 6-311++G), which still exhibit large RMS deviations after being scaled. We selected 17 bands in the solid state IR spectrum of vanillin for comparison with theoretical values. These can be seen as the labelled bands in Table 2.
Frequency calculations on the optimised vanillin structures were performed using the selected methods. The frequencies for the fundamental modes of vibration were scaled using the scaling factors obtained from literature where available (M06-2X/: 6-31G(d,p), 6-311G(d,p), aug-cc-pVTZ, def2-TZVP and def2-QZVP)26,64,65 and using the scaling factors calculated in the present work for the other methods (Table 1). The internal coordinates and contributions of involved bonds, angles and dihedral angles for each normal mode were also recorded.
| Methodology | λF | RMS deviationa (cm−1) | |
|---|---|---|---|
| Scaled | Unscaled | ||
| a Calculated from the F38/10 database26 (S13).b Reproduced and compared with literature. | |||
| Functionals using 6-311++G(2d,2p) | |||
| MP2-FC | 0.955 | 77 | 136 |
| BP86 | 0.989 | 34 | 43 |
| PBE | 0.986 | 33 | 47 |
| OPBE | 0.969 | 42 | 88 |
| M06-L | 0.959 | 32 | 107 |
| B3LYP | 0.963 | 28 | 103 |
| ωB97X-D | 0.948 | 41 | 137 |
| PBE0 | 0.950 | 34 | 130 |
| M06 | 0.956 | 45 | 118 |
| M06-2X | 0.944 | 44 | 148 |
| PBE0-DH | 0.935 | 42 | 170 |
![]() |
|||
| Basis sets used with M06-2X | |||
| 3-21G | 0.969 | 139 | 158 |
| 6-31G | 0.954 | 136 | 178 |
| 6-311G | 0.960 | 109 | 147 |
| 6-311++G | 0.962 | 126 | 158 |
| cc-pVDZ | 0.947 | 54 | 144 |
| cc-pVTZ | 0.945 | 47 | 147 |
| cc-pVQZ | 0.944 | 49 | 149 |
| aug-cc-pVQZ | 0.945 | 49 | 148 |
| def2-TZVPb | 0.945 | 48 | 147 |
The calculated frequencies, after applying the optimised scaling factor for the respective method, were matched against the experimental bands and the difference in wavenumbers was used to calculate the Mean Signed Deviation (MSD), Mean Absolute Deviation (MAD) and Standard Deviation (STD) (S12).
Fig. 3 shows the plot of frequency MAD (cm−1) across the characteristic bands against CPU time.
The lowest MADs (10–15 cm−1) are obtained when using mGGA or higher-rung functionals and with a polarised double/triple-ζ or higher basis sets.
The hybrid functionals M06, M06-2X, PBE0 and ωB97X-D, as well as the mGGA functional M06-L, have the fastest job completion times (2–5 hours) while yielding low MADs. The fastest job completions at basis set convergence are observed using the 6-311++G(2d,2,p), def2-TZVP or cc-pVDZ basis sets (1–5 hours). PBE0 in particular shows the overall lowest error and fastest calculation time of all hybrid functionals.
![]() | (3) |
Fig. 4 illustrates the workflow of Bayesian inference in this study. In the context of this paper, a hypothesis was taken to be a unique set of assignments of all theoretical bands to experimentally observed bands within the spectrum range of 680–1800 cm−1, within a range of ±40 cm−1 (45 cm−1 in the case of p-hydroxy benzaldehyde) of the theoretical band (S19). In each iteration, the evidence was treated as the probability of a theoretical band being observed at its predicted frequency, given the experimental assignment specified by the hypothesis. This probability was then modelled using a Gaussian distribution centered on the expected calculated value, with the mean and standard deviation determined from the benchmarking of the computational method.
![]() | ||
| Fig. 5 Aligned experimental solid state (black), gas phase61 (blue) and DFT (red) IR spectra of vanillin. DFT calculation performed using PBE0/6-311++G(2d,2p) with a scaling factor of 0.950. Highlighted regions indicate assigned bands used in evaluating deviation statistics from DFT calculated frequencies to solid state bands. | ||
Table 2 shows the assigned vibrational modes to the absorption bands observed in the experimental solid state, gas phase and DFT calculated IR spectra of vanillin.
| Band | Wavenumber (cm−1) | Band assignmenta | Solid state intensity | |||
|---|---|---|---|---|---|---|
| Solid state | Gas phase | DFT | Transmittance (%) | Relative | ||
| a ν refers to stretching, δ refers to in-plane bending, γ refers to out-of-plane bending. “symm.” and “asymm.” indicate symmetrical or asymmetrical vibration between the atoms of the same group. “i. ph.” and “o. ph.” indicate in and out out-of phase vibrations between different groups.b Experimentally not clearly observable or combined into larger band (e.g. band shoulder). | ||||||
| 1 | 812 | 818 | 794 | γ(C–H) 2 adjacent aromatic symm. (wagging) | 68 | m |
| 2 | 859 | 874 | 854 | γ(C–H) lone aromatic | 62 | s |
| 3 | 1025 | 1034 | 1024 | ν(O–CH3), δ(C–C–C) aromatic ring bend | 60 | s |
| — | N/Ab | N/Ab | 1087 | δ(C–H) o.ph. adjacent aromatic (Scissoring) | — | — |
| 4 | 1121 | 1118 | 1123 | δ(C–H) aromatic, δ(O–C–H) aldehyde, ν(C–OCH3) | 53 | vs |
| 5 | 1151 | 1150 | 1151 | δ(CH3), δ(C–O–H) | 49 | vs |
| 6 | 1170 | 1182 | 1173 | δ(C–H) aromatic, δ(O–C–H) aldehyde, ν(C–OCH3) | 61 | s |
| — | N/Ab | N/Ab | 1225 | o. ph. (ν(C–OCH3), ν(C–OH)), δ(C–H) aromatic | — | — |
| 7 | 1264 | N/Ab | 1241 | ν(C–OH), δ(C–O–H), δ(C–H) aromatic | 57 | s |
| 8 | 1298 | 1274 | 1274 | i. ph. (ν(C–OCH3), ν(C–OH)), ν(C–C) aromatic, δ(C–H) aromatic | 62 | s |
| — | N/Ab | N/Ab | 1341 | δ(C–H) aldehyde, δ(C–H) aromatic, δ(C–O–H) | — | — |
| 9 | 1371 | N/Ab | 1378 | ν(C–C–C) aromatic (“Kekulé’s”), δ(O–C–H) ether, δ(C–O–H), δ(C–H) aldehyde | 78 | m |
| 10 | 1397 | 1390 | 1402 | δ(CH3) symm. (“Umbrella”), δ(C–O–H), δ(C–H) aldehyde | 76 | m |
| 11 | 1428 | 1438 | 1433 | δ(CH3) asymm., ν(C–C–C) aromatic (“semi-circular”) | 61 | s |
| 12 | 1509 | 1510 | 1485 | ν(C–C–C) aromatic (“semi-circular”) | 64 | s |
| 13 | 1586 | 1598 | 1579 | ν(C–C–C) aromatic (“quadrant”) | 60 | s |
| 14 | 1662 | 1718 | 1696 | ν(C O) aldehyde |
54 | vs |
| 15 | 2861 | 2722 | 2769 | ν(C–H) aldehyde | 84 | w |
| — | N/Ab | 2810 | 2892 | ν(C–H) methyl symm. | — | — |
| 16 | 2945 | 2962 | 2957 | ν(C–H) methyl asymm. | 83 | w |
| 17 | 3020 | 3014 | 3013 | ν(C–H) methyl asymm. | 82 | w |
| — | N/Ab | 3074 | 3065 | ν(C–H) lone aromatic | — | — |
| — | 3146 | 3578 | 3621 | ν(O–H) alcohol | 79 | w |
In regards to the solid state noticeable medium bands can be found at 812 and 859 cm−1, which correspond to the out-of-plane aromatic C–H bending fingerprint of the 1,3,4-tri substituted ring, involving the in-phase wagging mode of the adjacent 5,6-position and the bending mode of the 2-position C–H bonds respectively.66 A strong band at 1025 cm−1 is attributed to the C–OCH3 ether stretching vibration and ring deformation mode. Very strong bands observed at 1121, 1151 and 1170 cm−1 respectively are assigned to modes caused primarily by in-plane bending modes of the aromatic C–H bonds, but also alcohol and ether bending and ether stretching modes. It should be noted that the gas phase and DFT bands corresponding to these bands show lower intensity. Strong bands at 1264 and 1298 cm−1 are assigned to alcohol C–O stretching, and in-phase ether and alcohol C–O stretching modes. A medium band observed at 1371 cm−1 is assigned to the specific “Kekulé” ring stretching mode, as well as related bends. The medium and strong bands observed at 1397 and 1428 cm−1 respectively are attributed to methyl CH3 (ether) in and out-of-phase bends. Strong bands at 1509 and 1586 cm−1 correspond to characteristic ring “Semi-circular” and “Quadrant” vibrations. The strong band 1662 is caused by the aldehyde C
O stretching mode. A weak doublet at 2861 cm−1 corresponds to the aldehyde C–H stretching vibration. Weak bands can be observed at 2945 and 3020, which are loosely assigned stretching vibrations of the methyl and aromatic C–H bonds. The gas phase and DFT spectra provide clearer profiles of these vibrations, allowing clearer distinction between the symmetric and asymmetric methyl C–H, and a ring lone C–H stretching mode. A characteristic broad band at 3146 cm−1 corresponds to the O–H stretching mode, seen as sharp stronger bands at higher frequencies in the gas phase and DFT-calculated spectra.
In the context of this paper, a hypothesis was taken to be an assignment of all represented theoretical vibrations to experimental bands. The evidence presented in each iteration was calculated as the likelihood that a theoretical vibration would have the observed frequency it has, given that it represented the experimental band assigned in the specific hypothesis. This probability was modelled as a Gaussian probability density distribution around an expected value (experimental band shifted by the MSD) and the STD calculated during the benchmark. The result of this procedure is a relative probability value for each hypothesis, which quantifies the likelihood of that hypothesis to be correct, given the expected accuracy of the DFT calculation, thus identifying band assignments that are more plausible than others (S19). Those hypotheses were then examined, and the most reasonable hypothesis was considered based on IR intensity and chemical sense. Fig. 6 shows the experimental-to-theoretical comparison of the chosen modes for all studied molecules. The x-axis corresponds to the experimentally obtained band frequencies for the assigned bands, and the y-axis corresponds to the scaled calculated frequencies for the vibrational modes (scaling factor of 0.950 for PBE0/6-311++G(2d,2p)).
A systematic underestimation of the frequencies was observed, as most assigned frequencies predict a lower value than the experimental band. While some of the vibrations show great agreement between theory and experiment, aromatic C–H wagging modes and C–O stretching modes are calculated within 10–30 cm−1 of the experiment, leading to a deviation of 1–4% for C–H wagging and 2% for C–O stretching modes. CH3 deformations show excellent agreement for vanillin, whereas a consistently negative deviation of around 30 cm−1 (2–3%) is observed for syringaldehyde, cuminaldehyde, eugenol, estragole and dihydrocoumarin (CH2 deformations). Aromatic ring “semi-circular” and “quadrant” modes show good agreement across all the molecules with most calculations falling within 40 cm−1 below the experimental bands (∼2.5%). The MSD for all molecules, except umbelliferone, is negative, indicating that for the chosen set of molecules and vibrations, the mean error metrics could also be improved by applying a higher scaling factor. This is expected when considering the comparison between single molecule calculations to solid or liquid state systems, in which the bond vibrations and affected by intermolecular interactions.
The calculated values across all identified modes exhibit MADs ranging from 10 cm−1 to 20 cm−1 (0.9–1.8%) (Fig. 7), showing good agreement between the experimental spectra and assigned theoretical frequencies.
![]() | ||
| Fig. 7 DFT-calculated (PBE0/6-311++G(2d,2p)) MAD (%) (opaque) and MSD (%) (transparent) across all studied vibrational modes for all compounds compared to experimental IR bands. | ||
Using this approach to assign theoretical frequencies to experimental bands allows for more coherent identification of recurring key vibrations for the studied molecules. These vibrations include the fingerprint aromatic C–H out-of-plane bending modes (650–900 cm−1), phenyl and aryl ether C–O stretching modes (1000–1300 cm−1), CH3 and CH2 deformation (1300–1500 cm−1) and ring C
C stretching (“semi-circular” and “quadrant”; 1400–1650 cm−1) modes.
Optimised scaling factors, following the method proposed by Alecu26 et al. are reported in this paper for method/basis set combinations for which scaling factors were not previously available in the literature, to the best of our knowledge.
A spectral analysis of vanillin has been performed, in which the characteristic DFT-calculated vibrational modes are assigned to experimentally observed solid state IR bands. This was enabled by comparing between DFT, gas phase61 and solid state, allowing for better association between experiment and theory, and for gauging of the expected deviations when comparing single molecule DFT to non-gas phase experimental IR spectra. Using the normal modes obtained by the frequency calculation, the nature of each vibrational mode is thus explained, allowing for better understanding of the characteristic groups of vanillin responsible for each vibration.
A Bayesian approach was then applied to identify important vibrational modes in a selection of crucial volatile aromatic compounds (p-hydroxy benzaldehyde, syringaldehyde, cinnamaldehyde, cuminaldehyde, eugenol, estragole, coumarin, dihydrocoumarin and umbelliferone). The theoretical frequencies were assigned to experimental bands based on the calculated deviation statistics, using Bayes' theorem (S7) (S19). The performance of our approach was assessed by comparing the theoretical frequencies with their matched experimental bands, showing good performance with MAD (%) values between 1 and 2% for all compounds.
Key vibrations found in aromatic systems (aromatic C–H bending and ring C
C stretching modes) were identified (where observable) in all molecules (S18) and compared with the experimental spectra.
A negative MSD across all molecules, except umbelliferone, in the studied spectral range suggests that a higher scaling factor for the method would improve the fit of these frequencies to their experimental spectra.
In summary, an efficient computational method and a universal scaling factor (PBE0/6-311++G(2d,2p) with a scaling factor of 0.950) are reported and used to analyse the solid state experimental IR spectrum of vanillin. Bayesian inference is used to provide a framework for matching single molecule theoretical to experimental solid and liquid state IR vibrational frequencies of a selection of important volatile organic chemicals and the consistency of the method is tested by comparing indicative vibrational mode frequencies to experimentally assigned IR bands. Key vibrational modes of all compounds are identified and matched on the experimental IR spectrum.
Supplementary information (SI): the methods used, data collected and processed and information used in the current study. See DOI: https://doi.org/10.1039/d5dd00453e.
| This journal is © The Royal Society of Chemistry 2026 |