Przemysław
Pastwa
and
Piotr
Bruździak
*
Department of Physical Chemistry, Gdańsk University of Technology, Narutowicza 11-12, 80-233 Gdańsk, Poland. E-mail: piotr.bruzdziak@pg.edu.pl
First published on 30th May 2025
This paper introduces VaporFit, an open-source software for automated atmospheric interference correction in Fourier-transform infrared (FTIR) spectroscopy, based on a refined correction algorithm. It significantly improves the accuracy and reproducibility of chemical and biological FTIR analysis by effectively removing variable contributions from water vapor and carbon dioxide that often obscure spectral features. Unlike traditional methods relying on subtraction of a single reference spectrum, which struggle with atmospheric variability, VaporFit employs a multispectral least-squares approach to automatically optimize subtraction coefficients based on multiple atmospheric measurements recorded throughout the experiment. The software provides a user-friendly graphical interface (GUI) and built-in tools, including objective smoothness metrics and a principal component analysis (PCA) module, to facilitate parameter selection and intuitively evaluate correction quality. Furthermore, we offer practical recommendations for data acquisition strategies tailored for effective atmospheric correction. VaporFit, the user guide, and sample data sets are freely available at https://zenodo.org/records/15411176 and https://github.com/piobruzdpg/VaporFit/releases/tag/v1.0.
This interference mainly stems from water vapor (H2O, D2O, or HDO when heavy water is used) and carbon dioxide (CO2), as well as, in some cases, other volatile compounds in the samples used in the laboratory. Each component absorbs light independently, and their proportions depend on factors beyond the experimenters control, such as ambient humidity, the number of people in the room, frequency of opening the sample compartment, purity of purging gases, solvent content, and even the stability of the infrared source. To minimize it, instruments are typically purged with dry gas (nitrogen or dried air). However, this method may be imperfect since purging gas may contain impurities. Pressure fluctuations–caused by frequent chamber openings or gas regulator malfunctions–can also introduce inconsistencies in internal atmosphere properties.
In this article, we introduce VaporFit, a free, open-source software based on a new version of the atmosphere correction algorithm. The new, streamlined version of the algorithm has been stripped of elements that proved unnecessary (e.g., considering the baseline at the stage of optimizing correction parameters). The philosophy of the Python script has been changed, and the entire core of the algorithm has been enclosed in a single class, which significantly facilitates its potential use in users' own projects and possible modification. We also elucidate why the algorithm works at all and what factors influence the limitations of its applicability. The most important functionality from the perspective of an average user is the graphical user interface (GUI), which significantly facilitates correction for people less advanced in programming. The GUI currently includes tools facilitating more rational selection of smoothing parameters and a PCA (Principal Component Analysis) module allowing for visual assessment of correction quality.
![]() | (1) |
![]() | ||
Fig. 1 Scheme of the iterative correction of spectra affected by atmospheric contribution. All symbols are consistent with eqn (1). |
Unlike classical subtraction, which relies on a single reference spectrum, this algorithm dynamically combines multiple vapor spectra with optimized coefficients an. The core idea is an iterative process: starting with initial coefficients (set to 0.1 in the current version of the VaporFit code), the algorithm calculates a currently corrected spectrum. This spectrum is then smoothed to provide an estimation (Ȳν) of what the ideal, atmospheric-free spectrum should look like. The difference between the currently corrected spectrum and this smoothed estimation forms the residual (rν). The least-squares method then adjusts the coefficients an to minimize this residual, effectively driving the currently corrected spectrum closer to its smoothed version, thereby removing sharp atmospheric features while preserving the broad sample bands (see an example in Fig. 2). The corrected spectrum is then the result of applying the final, optimized coefficients an. The corrected spectrum Ȳν is approximated using Savitzky–Golay (SG) smoothing.6 The SG method requires two key parameters:
• Polynomial order – the degree of the polynomial used for local approximation,
• Window size – the number of adjacent points used for fitting.
Proper selection of these parameters may be crucial for the algorithms effectiveness (see Fig. 3).
![]() | ||
Fig. 3 Influence of Savitzky–Golay smoothing parameters on spectral smoothing in the OH/COO− band region of the 10th spectrum of the betaine test set (the same as in Fig. 2). |
• Streamlining of the algorithm code by removing unnecessary elements and functionalities.
• Introduction of a GUI and additional tools for optimizing SG parameters and evaluating result quality.
• Code improvement for better understanding of results and enabling its use in custom projects (see Section 3.1.1).
• Precise identification of the rationale behind the algorithm's effectiveness and its limitations (see Sections 3.2.3 and 3.4.1).
• Demonstration of its effectiveness also in the case of correction of other types of FTIR spectral interference, such as the CO2 band (see Section 3.2.3).
• Comparison to other popular methods for this type of correction (see Section 3.3).
• Based on our current experience, we propose unified recommendations for conducting measurements and using the algorithm (see section 3.4).
To improve accessibility, we have translated the original command-line script into an open-source desktop application featuring a user-friendly graphical interface. The new software, VaporFit, includes support for batch processing, visual inspection of input/output spectra, and export of correction parameters, significantly lowering the barrier to adoption by non-expert users.
The previous version of the algorithm required the user to specify SG parameters, but the quality of correction using them could only be visually assessed by the user after the calculations were completed. VaporFit introduces several tools to facilitate the selection of these parameters. The program now performs parallel corrections in the background for several defined window sizes around the one selected in the GUI and allows visualizing quantitative indicators. Their visualization allows for a more rational assessment of which combination of smoothing parameters yields the smoothest series of spectra. For series typically measured in our laboratory, default parameters (polynomial order 3, window size 11) are usually optimal. However, it should be noted that these parameters may differ for spectra with significantly larger or smaller band full width at half maximum (FWHM) or different spectral resolution compared to those presented in this work.
To enhance the selection process for Savitzky–Golay smoothing parameters, VaporFit provides objective smoothness metrics. These include:
• Spectral Smoothness Index (SSI, eqn (2), where yi are the spectral values at points i, and N is the number of data points):
![]() | (2) |
• Second derivative variance (SDV).
• Standard deviation of residual signal (SD, where residual is the result of the subtraction between the corrected spectrum and its smoothed version).
In general, lower values for these metrics indicate smoother signals, thus aiding in optimal parameter choice. However, interpretation depends on specific signal characteristics, requiring users to develop their own assessment strategies.
Principal component analysis (PCA) is another tool that VaporFit uses to visually check how well atmospheric correction works across a whole series of spectra. We recommend selecting SG parameters based on a visual comparison of the pre- and post-correction principal components (PCs) of spectral series. If atmospheric interference is present, it appears as contamination in principal components before correction (see Fig. 4). After successful correction, these bands should ideally disappear. In addition to principal component shapes, the PCA module provides explained variance values, estimating the minimum number of components required before and after correction. However, atmospheric spectra rarely appear as pure components, as their contributions often correlate with sample spectrum changes. Thus, variance trends should be interpreted cautiously, and a reduction in variance does not always indicate complete atmospheric component removal.
![]() | ||
Fig. 4 The first two principal components (PCs) of the betaine test set shown in Fig. 2 were obtained both before and after correction. Subtraction coefficients were determined using the following SG parameters: polynomial order 3, window size 11. The contribution of atmospheric spectra is mainly visible between 1800 and 1500 cm−1 before correction and disappears after correction. |
Users can also inspect correction coefficients determined for each atmospheric spectrum, aiding experiment monitoring and further analysis of changes in atmospheric composition during experiments (see Fig. 5).
![]() | ||
Fig. 5 Optimized subtraction coefficients for two atmospheric spectra, which were used to correct the spectra in Fig. 2. Colors correspond to the middle panel of Fig. 2. The initial atmospheric spectrum was measured at the beginning of the experiment, and the final one was measured at the end. Clearly, the contribution of the initial atmospheric spectrum diminishes over time, while the second one increases. The change is gradual, but that is not always the case. |
• Wavenb – 1D array of wavenumbers.
• Spectrum – 1D or 2D array of the measured spectrum or spectra.
• atm_spectra – 2D array of atmospheric reference spectra (e.g., water vapor, CO2).
• sg_poly, sg_points (optional) – Savitzky–Golay smoothing parameters (polynomial order and window size).
• fit() – performs non-linear least-squares fitting of atmospheric spectra to the measured spectrum and returns the best-fit scaling coefficients.
• atm_subtract() – applies the fitted parameters to subtract atmospheric contributions and returns the corrected spectrum.
An internal method residuals (params) is used during the optimization procedure to compute the smoothed residual signal that is minimized in the fitting process. Although it is publicly accessible, it is intended for internal use only and not meant to be called directly by the user.
VaporFit source code or its fragments can be customized for specific research needs. It is available under the GNU GPL v3.0 license, which includes an additional citation requirement. VaporFit in its current version uses the following Python packages: NumPy,7 SciPy,8 Matplotlib.9
The requirement for a difference in the smoothability of the corrected and correcting signals has a very important consequence. Such correction would not be possible if the FWHM of the bands of both types of signals or spectra were similar, as both would be similarly affected or not affected at all by the smoothing step. In that case, the difference before and after smoothing (residual) would always be either zero or completely random. On the other hand, even if the spectrum contamination is very large, almost ideal correction is still possible if the FWHM of the bands of both signals are significantly different. An example is the heavily contaminated series of betaine solution spectra in Fig. 2, where the bending OH and carbonyl bands and other skeletal bands of betaine have much larger FWHM than the gaseous water bands, and still the correction is very good.
For these reasons, it is possible to use the algorithm to remove other types of interference. Initially developed for FTIR spectrum correction in the amide I band of proteins, VaporFit has also proven effective for other spectral regions, including the CO2 asymmetric stretching band (∼2400 cm−1). This capability enables efficient CO2 interference removal, which can obscure key vibrational bands of CN groups, sulfur-based groups, or, as in Fig. 6, D2O stretching bands.
At first glance, manual atmospheric correction may seem simple: recording an empty chamber spectrum and subtracting it from sample spectra using a scaling factor. However, two major challenges arise. First, for large spectral series, manually adjusting the subtraction coefficient is tedious and time-consuming, requiring trial and error. Second, atmospheric conditions fluctuate, altering spectral shape and intensities. As shown in Fig. 7(a), the composition of the atmosphere inside the instrument or sample chamber can change due to sample evaporation. In this case, the sample contained large amounts of D2O. The atmospheric spectra for temperatures of 40.0 °C and 50.0 °C in this figure contain bands of gaseous heavy and ordinary water, as well as likely semi-heavy water. This type of measurement is common in protein structure studies, so this type of atmospheric variability should be taken into account. In such a case, it would be practically impossible to correctly subtract all contributions using a single atmospheric spectrum.
It should also be emphasized that the ro-vibrational bands of gaseous phases are very narrow and react strongly to any environmental changes (e.g., temperature, humidity) and clearly distort the spectral image in the regions of approximately 3600 cm−1 and 1600 cm−1, even if the atmosphere itself does not change its chemical composition. This variability can be illustrated by the example of atmospheric spectra from Fig. 2. Both spectra were measured with a time difference of approximately 1 hour. We calculated the differences between both spectra in the ranges of the OH and CO2 bands, and the results are presented in Fig. 8. Although the instrument was purged with dry nitrogen, and from the user's perspective, the conditions during the measurement did not change, an hour's difference in the measurement of these two atmospheric spectra made direct subtraction of one from the other ineffective. Instead of improving spectral quality, subtraction can introduce sharp differential bands, adding noise rather than eliminating it. Thus, a single atmospheric spectrum is rarely effective unless background, atmospheric, and sample spectra are recorded in quick succession.
![]() | ||
Fig. 8 Differences between two atmospheric spectra from the betaine spectra test set. Subtraction coefficients were chosen such that the average absorbance value of all points was close to zero. (a) Difference in the amide I band range (as in Fig. 2), (b) difference in the CO2 band range. Although both spectra were measured approximately 1 hour apart, complete subtraction of the spectra is not possible. |
VaporFit also performs better compared to automatic atmospheric correction methods available in spectrometer software, which are typically based on a database of atmospheric spectra provided for a range of different measurement parameters for a given series of instruments. They work well, but only if the time difference between the background spectrum and the actual spectrum is small (i.e., when the amount of water vapor has not changed significantly) or when the conditions inside and outside the instrument remain practically unchanged, which is rather rare. These types of methods are excellent for routine, fast, and undemanding measurements, but they can introduce artifacts that become problematic for very precise measurements, for example, in studies of interactions in solutions or determining protein secondary structure. This is clearly visible in Fig. 9, although it should be noted that the series was measured approximately 30 minutes after the dry nitrogen source was initialized, so this effect is more pronounced than in most situations. However, the distorted CO2 bands, distortions around 2000 cm−1, and a number of small bands in the water vapor range overlapping with the solution bands are clearly impossible to correct effectively with standard methods. Our team's experience shows that artifacts introduced in spectra that require high quality, even if barely visible, can affect subsequent analysis steps. An example of this is protein spectra in the amide I band range, whose deconvolution depends on the effectiveness of the correction. Small irregularities and artefacts present on the band surface can determine the position, or even the existence, of small component bands. For this reason, automatic correction is never used in our laboratory for this type of measurement.
A greater problem arises when measuring series for which temperature change is crucial. In such cases, the variability of atmospheric spectra and their composition absolutely cannot be ignored. Subtracting a single atmospheric spectrum throughout the entire series makes no sense, and the only solution when using the manual single spectrum subtraction method is most often to measure the background spectrum before and an atmospheric spectrum after each sample spectrum. It should be emphasized that even when purging the instrument with dry nitrogen or another gas, the stability of the atmosphere during continuous heating of the cuvette or the measurement accessory stage is very poor, and the spectrum measured even in such a configuration will have clear atmospheric bands, similar to Fig. 8. The procedure proposed in this publication, i.e., measuring one starting background spectrum and several (min. two) atmospheric spectra covering the entire temperature variability (see Section 3.4 and Fig. 10), provides much better results at a much lower cost of work and time. An example is the series of transmission spectra of lysozyme solutions in D2O in the temperature range, presented in Fig. 7, for which it was possible to measure and correct the protein spectra in the range of the amide I′, amide II, and amide II′′ bands. The background spectrum for the series was measured once at 30.0 °C. Each of the atmospheric spectra simultaneously compensated for any background fluctuations related to temperature changes and atmosphere composition.
In the context of striving for the most accurate atmospheric correction, it is also worth mentioning the concept of measurements with increased spectral resolution (oversampling), as suggested by Goormaghtigh et al.4 This approach, which involves recording spectra with a resolution higher than nominally required for broad sample bands, aims to better characterize narrow, sharp gaseous bands, which facilitates their differentiation from the sample signal. Although this strategy is valuable, it is associated with experimental challenges, such as extended measurement time and potentially greater sensitivity to dynamic changes in atmosphere composition during acquisition – a problem that we illustrated with the example of the difficulty in compensating for two atmospheric spectra measured at a time interval (Fig. 8). We believe that the multispectral algorithm implemented in VaporFit, through its ability to adaptively select combinations of reference atmospheric spectra, could be a valuable complement to data collected by the oversampling method. This would allow for more effective handling of atmosphere variability even with high-resolution measurements, while minimizing the difficulties associated with manual correction of single, very “sharp” reference spectra.
• We advise recording the background spectrum once at the beginning of the experiment for serial measurements (as in Fig. 10). Although seemingly counterintuitive, this approach often increases consistency within a spectral series. However, because atmospheric absorption can make the raw recorded spectra look noisy, obscuring important bands and hindering real-time quality assessment, this approach can make measurements stressful. To mitigate this, humidity reduction techniques should be used.
• We recommend recording at least two atmospheric spectra—one near the beginning and one at the end of an experiment involving spectra series (see Fig. 11 for a comparison of the results of correction with a single atmospheric spectrum and with two different ones). In most cases, their linear combination effectively corrects all spectra measured between them. It is, of course, possible to correct with just one atmospheric spectrum, although a single spectrum may not exhibit variations due to temperature, pressure, and humidity fluctuations in the laboratory.
![]() | ||
Fig. 11 Comparison of correction quality using a single atmospheric spectrum versus two atmospheric spectra. The dataset is the same as in Fig. 2. |
• VaporFit performs best when correcting with a small number of atmospheric spectra (2–5), making experimental planning easier. Using too many atmospheric spectra may lead to overfitting, introducing baseline fluctuations and unnecessary noise instead of improving correction accuracy. We suspect that the reason for this is the limitations of the least-squares method used for optimizing correction parameters.
• Correction is most effective when the FWHM of sample bands is significantly larger than that of atmospheric bands. In other words, the more distinct the sample spectrum is from atmospheric interference, the easier the correction process.
• A key requirement for the method to work correctly is that the spectrum must be smoothable, as Ȳν in eqn (1) should reflect the real spectrum devoid of atmospheric components. This means bands should be relatively broad or recorded at sufficiently high resolution. Standard resolution of 2–4 cm−1 for protein and aqueous solution spectra should be sufficient.
• Proposed default SG parameters in VaporFit (3/11, i.e., polynomial of degree 3 and 11 smoothing points) work very well for the measurements mentioned in this publication.
• SG parameters, used for Ȳν estimation, influence correction accuracy. If the parameters are set too tightly or too loosely, the algorithm might converge at an unsatisfactory stage. This could result in theoretical spectra that aren’t smooth enough or are too smooth, which could then cause atmospheric spectra to be subtracted with random coefficients. Excessively high values lead to an unrealistically estimated spectrum Ȳν, increasing noise in the final corrected spectra (see Fig. 3). We recommend setting the lowest practical polynomial order (typically 3 for 4 cm−1 to 2 cm−1 FTIR spectra). For fingerprint FTIR spectra, the best SG window size is usually between 5 and 21 points, depending on the FWHM and spectral resolution of the main bands. This parameter primarily determines the smoothing effect. For correction with a single atmosphere spectrum, the algorithm is relatively insensitive to SG parameter selection.
• Atmospheric correction should be performed before ATR correction. The atmospheric spectrum is primarily related to gases in the optical path inside the instrument, not the atmosphere above the crystal, and therefore does not depend on the optical properties of the crystal or the sample. ATR correction can treat atmospheric bands as sample bands, thus incorrectly assigning them variability specific to the sample's refractive index. Correcting such a spectrum may later be impossible or very difficult.
• The method may be less effective for spectra with strong oscillations, irregularities, or high local variability, though typical FTIR measurements rarely pose such issues. Naturally, the atmospheric spectrum, which is going to be subtracted, is not subject to this restriction.
A limitation resulting from the optimization method used is the relatively small number of atmospheric spectra that can be subtracted simultaneously. We estimate that 5 spectra are reasonable, although much depends on the measurement conditions and this number may vary upwards or downwards. Although in the vast majority of cases, two spectra covering the variability of conditions throughout the experiment (according to our proposed measurement scheme) are sufficient for satisfactory correction, the ideal solution would be the possibility of building one's own database of atmospheric spectra characteristic of a given laboratory and spectrometer, which would eliminate the need for measuring atmospheric spectra at all. Such a database would be created based on many atmospheric spectra measured over time (month or a year), considering factors such as air humidity, measurement parameters, and possibly others. Here, we see the potential for method development, which could be based on the use of more advanced optimization algorithms or machine learning methods that would handle fitting several hundred or several thousand spectra more effectively. The current streamlined version of the algorithm should be much easier to use for this type of modification. In our opinion, building such a database would only make sense at a “local” level, i.e., within one measurement station, as the variability of conditions between laboratories and the variability in instrument quality is too large and would require even more work.
As indicated in Section 3.2.3 (“Why it works?”), the primary reason why the algorithm works is the difference in the smoothability of the corrected and atmospheric spectra. One of the steps of the algorithm is smoothing using the Savitzky–Golay method, which works very well for FTIR spectra of solutions, typical organic compounds, and biomolecules. However, there are phenomena (e.g., Fano resonance) or other types of spectroscopy in which signals are characterized by bands with asymmetric shapes and sharp peaks. In such situations, smoothing with the method implemented in the algorithm may be ineffective and lead to the formation of artifacts, as with suboptimal values of SG parameters (see Fig. 3). However, it would probably be possible to use other less destructive signal smoothing methods, such as denoising using wavelets. The current form of the algorithm extracted as a class should facilitate this type of modification.
We suspect that VaporFit, or the core algorithm itself, could become an invaluable tool in the data preparation stage for machine learning and related algorithms. It effectively removes unnecessary variability in spectra, resulting in cleaner input data that is better suited for advanced analytical techniques, including machine learning methods. Models based on, among other things, the analysis of environmental data collected by FTIR or the construction of such databases would thus become more reliable, as the variance element associated with this type of spectral contamination would be entirely absent during model training. This is crucial because the spectrum of water vapor or other gaseous interferences is very similar across different samples and could be misinterpreted by an ML algorithm as a characteristic feature for a given class of compounds, leading to errors in identification. Spectra cleaned with VaporFit are particularly useful for algorithms that identify functional groups directly from FTIR data, significantly enhancing the efficiency of spectral interpretation. Data prepared in this way are particularly applicable in environmental analyses, e.g., for identifying pollutants in complex samples.10,11 Similarly to the previously described method proposed by Goormaghtigh,4 ML models would significantly benefit from the pre-processing step utilizing VaporFit, among other tools.
This journal is © the Owner Societies 2025 |