Open Access Article
Amy L. Farmera,
Kelly Browna,
Sophie E. T. Kendall-Pricea,
Partha Malakarb,
Gregory M. Greethamb and
Neil T. Hunt
*a
aDepartment of Chemistry and York Biomedical Research Institute, University of York, York, UK. E-mail: neil.hunt@york.ac.uk
bSTFC Central Laser Facility, Research Complex at Harwell, Harwell Science and Innovation Campus, Didcot, UK
First published on 8th January 2026
The dynamic three-dimensional structures of proteins dictate their function, but accessing structures in solution at physiological temperatures is challenging. Ultrafast 2D-IR spectroscopy of the protein amide I band produces a spectral fingerprint that derives directly from the 3D backbone structure within minutes, using microlitres of label-free samples, in aqueous (H2O) solution and with picosecond time resolution. However, transforming 2D-IR fingerprints into quantitative, solution-phase protein structures relies on decoding the fundamental link between the atomistic structure and the 2D spectrum. We demonstrate a top-down approach to solution-phase protein structure determination that combines 2D-IR spectral libraries with machine learning (ML). Using a dataset consisting of 6732 spectra of 35 proteins in H2O that span a range of structures, Support-Vector Machine (SVM) models classified unknown protein samples according to structural content and measured quantities of α-helix and β-sheet with an RMS error of ≤7%. The potential for hybrid 2D-IR-ML tools to predict the number and length of helices in a protein, and identify the presence of parallel and antiparallel β-sheets from the 2D-IR fingerprint is also demonstrated. These results lay the groundwork for rapid, quantitative analysis of dynamic protein structures under physiologically relevant conditions.
Since its first demonstration,7 ultrafast two-dimensional IR (2D-IR) spectroscopy has been used to probe protein structure. The protein amide I vibrational mode, essentially of the C
O stretch of the peptide unit, is extremely sensitive to the three-dimensional conformation of the folded peptide chain. Inter-residue vibrational coupling, hydrogen bonding, solvation, structural dynamics, and the local electrostatic environment all contribute to the form of the amide I absorption band.7–17 By spreading the amide I signature over two frequency axes, 2D-IR measures a detailed fingerprint that is directly linked to the unique structure of the protein, while picosecond time resolution offers dynamic insight. Furthermore, 2D-IR suppresses the background H2O signal that inhibits linear absorption spectroscopy, enabling measurements under more physiologically relevant conditions.13 Changes in protein structure and dynamics upon drug or ligand binding and melting have been measured with 2D-IR,18–24 as have proteins in complex biological matrices such as blood serum, including clinical samples.18,19,25 Progress in 2D-IR experimental methods has reduced spectral acquisition times to just a few minutes, while data pre-processing has enabled accurate standardisation of spectra from different samples.26–28
This combination of bond-level resolution and sensitivity to small changes in label-free molecular structure and dynamics means that 2D-IR offers a promising route to accurate protein structure determination in solution. But decoding the structure–spectrum relationship that links the 3D arrangement of atoms to the spectral fingerprint remains a barrier to quantitative interpretation of the spectra. Bottom-up experimental methods using bespoke peptides, small model proteins or isotopic labelling have allowed specific regions of the amide I band to be identified with certain types of structural elements.7–9,14,17,26 Transition dipole strength analysis has also been used to examine protein secondary structure,15,16,29,30 including to determine the maximum α-helical length in a protein.30 Singular value decomposition (SVD) was used to quantify protein structure from 2D-IR spectra of proteins in D2O in 2011,31 achieving accuracies approaching those of CD, but was not pursued further. Simulations based on molecular dynamics and frequency maps are also instructive,32–37 but challenges arise from computational cost and a lack of experimental data for validation.38
Here, we describe an alternate, top-down, approach that combines the spectral information density of 2D-IR with the strengths in pattern recognition of machine learning (ML). We have created the first label-free 2D-IR protein spectral library in H2O-based buffer solutions, containing 6732 spectra of 35 different proteins. The proteins were selected to encompass a range of secondary structure configurations, and each have a high-resolution structural analysis available via the protein data bank (PDB).39 Our goal, inspired by ML applications to simulated spectral datasets40–42 and multidimensional NMR,43 was to use ML firstly to classify protein spectra according to structural type, and then to quantify secondary structure content. We show that both aims are achievable, and that 2D-IR-ML has the potential to go further and predict the number of helices present in protein structure and identify the presence of parallel and antiparallel β-sheets. Based on this, we consider the scope for 2D-IR to play a meaningful future role in determining dynamic protein structures in solution.
![]() | ||
| Fig. 1 Schematic diagram outlining how the protein library (a) was formatted into an input data frame for ML analysis (b). | ||
Each spectrum was labelled with either a class assignment, the proportions of α-helix and β-sheet, the number of α-helices, or the proportions of parallel and antiparallel β-sheet (Table S1) depending on the aim of the ML task (see below). Protein structural properties were determined through the Dictionary of Protein Secondary Structure (DSSP) algorithm.45
To avoid leaking testing data into the training process, we used a nested-cross validation framework (nested-CV, Fig. S1). This contains an inner-loop CV, where the model's parameters are tuned according to the training data, from a given iteration of the outer-loop CV, where the model's final performance is tested. This process is repeated for all folds of the outer-loop CV. As the library contains multiple spectra of 35 proteins, a group CV was used for both loops where each group contained all of the spectra of one protein (35 groups in total).
The inner-loop consisted of a pipeline within a 5-fold group CV containing two transformers and a final predictor. The first transformer was a standard scaler that standardised the features to a normal distribution of zero mean and unit variance. The second was a feature selection module where the most relevant features for accurate learning were identified. This was used to minimise the risk of overfitting and improve model performance. The feature selection method and final predictor were varied according to the task and these are outlined below. All of the models were implemented through a custom python script using the scikit-learn package.46
Four different final predictors were tested: Support Vector Machine (SVM) with a radial basis function (RBF) kernel, k-Nearest Neighbours (kNN), Decision Tree (DT), and Random Forest (RF). The training: testing split of the outer-loop was varied, where the performance of three randomly generated 80
:
20 (training
:
testing) splits and a Leave-One-Out Cross Validation (LOO-CV) were examined. For the LOO-CV, the total protein library was separated into the 35 spectral groups and for each outer-loop iteration a different spectral group was used as the test set. As ANOVA-F feature selection with the SVM predictor was found to deliver the best performance for classification (see Results), these were taken forward to subsequent tasks.
It is well established that an α-helix structure gives rise to an amide I response that features two contributions; an intense A-symmetry mode centred at ∼1650 cm−1 and a much weaker, rarely-resolved, E-symmetry mode near ∼1640 cm−1.9,29,47 Similarly, the inter-peptide coupling in β-sheets leads to the presence of an intense v⊥ mode near 1630–1640 cm−1 and a weaker v‖ mode near 1670–1680 cm−1.14,15,17,47,48 As a result, the 2D-IR response of a β-sheet manifests as a ‘z-shaped’ pattern where the v⊥ and v‖ diagonal peaks are linked by off-diagonal features arising from the coupling between them. The v⊥ mode is known to shift to lower frequencies and increase in amplitude with increasing length of the β-sheet.14,17 Based on this information, it is possible to account qualitatively for some of the results in Fig. 2. The v = 0 → 1 transition of the amide I band of myoglobin has one main feature centred at 1660 cm−1 (Fig. 2(a)) that derives from its almost entirely α-helical structure, accounting for the expected upshift of ∼10 cm−1 in H2O-based solvents relative to the more normally used D2O.49 In the case of catalase, an elongation of the amide I band along the spectrum diagonal and more apparent off-diagonal features (e.g. pump, probe = 1705, 1655 cm−1, orange arrow in Fig. 2(b)) are observed due to the greater β-sheet content.50 Both IgG and β-lactoglobulin have structures that are dominated by β-sheets, and the ‘z-shaped’ pattern with significant v⊥ mode amplitude (pump = probe = 1649 and 1638 cm−1, respectively, blue arrows, Fig. 2(c and d)) is apparent in both cases. However, despite both proteins having similar proportions of α-helix and β-sheet (Fig. 2(c, d) and Table S1),51,52 the 2D-IR responses differ in the intensities of the 1660 cm−1 diagonal contributions (purple arrows) and the ratio of the amplitude of this feature to that of the v⊥ peak (blue arrows). This situation exemplifies both the sensitivity of the 2D-IR amide I response and the challenge faced by structure quantification methods as a given secondary structure content does not necessarily produce one distinct spectral pattern. This is due to the influence of other factors such as chain length, the strength of the vibrational coupling, and the global environment of the structural feature including tertiary structures.15,53 We therefore explore whether ML-based approaches offer a route towards unravelling this nuanced spectral fingerprint.
The proteins in the library were assigned to one of three classes, ‘α-enriched’, ‘β-enriched’ and ‘mixed structure’ (Fig. 3(a)) according to the definitions:
(1) α-Enriched: α − β ≥ 0.2,
(2) β-Enriched: α − β ≤ 0.2,
(3) Mixed structure: all other proteins.
Where α and β represent the fraction of α-helix and β-sheet respectively. This resulted in eight proteins (1518 spectra) in the ‘α-enriched’ class, 16 (3168) in the ‘mixed’ class, and 11 (2046) in the ‘β-enriched’ class.
The aim was to train an ML model to assign the correct structural class to an unseen (test) protein or group of proteins based on their 2D-IR spectra. A variety of feature selection methods and final predictor combinations were assessed. The best performance across three randomly generated 80
:
20 (training
:
testing) splits was delivered by an SVC model combined with the ANOVA-F (AF) feature selection. This combination performed well according to a number of metrics (Fig. 3(b) and Table S2), not least showing strong predictive power with a testing accuracy of 89% and a κ value of 0.80. The F1 scores between the 3 classes were also generally well balanced, with the α-enriched class giving the smallest score (0.80 vs. 0.90 for the other classes, Table S2). A breakdown of the individual performances for the three test sets using the AF-SVC model are given in the SI (Fig. S3 and Table S3).
To analyse the performance of the AF-SVC model further, an outer-loop Leave-One-Out analysis (LOO) was implemented in which the spectra of each protein in turn were used as a test set and the remaining library as the training set. This is useful in giving a better estimate of model performance, especially when the data set is relatively small.54 The results (Fig. 3(c)) show that the average testing accuracy across the 35 proteins was 79.3%, with the model achieving ≥95% accuracy for 23 proteins.
Fig. 4 shows the 50 features with the highest F-values from an ANOVA-F test on the total protein library (coloured squares) overlaid with the 2D-IR spectrum of one example from each protein class (myoglobin: α-enriched; catalase: mixed and β-lactoglobulin: β-enriched). It should be noted that, in the reported implementation of the AF-SVC model, an ANOVA-F test was performed on the given training set, not the total library. As this is a supervised feature selection method, this means that the features with the highest F-values will vary for each unique training set. However, inspection of the results of an ANOVA-F test across the training sets used and the total protein library showed that, while some of the specific selected features varied, the features lay consistently in this same region of the spectrum.
The selected features track the change in position and shape of the nodal line between the negative v = 0 → 1 and positive v = 1 → 2 peaks across the three classes. The appearance of a high frequency cross peak and the edge of the v⊥ β-sheet mode are also used as spectroscopic markers. Across the different protein classes, the selected features exhibit a change in amplitude from positive (myoglobin, α-enriched), through to near zero (catalase, mixed), and to negative (β-lactoglobulin, β-enriched) (Fig. 5). This change in the position and shape of the nodal plane can be explained by the complex overlap of positive and negative features that occurs when several structurally unique amide I contributions are present. This is especially true when there are cross peaks that lie close enough in frequency and have a large enough bandwidth to interfere with the v = 0 → 1 and v = 1 → 2 bands as this can lead to further convolution of the response. From this, we conclude that the model is learning to distinguish between the classes successfully based on the spectral amplitude in the regions known to be associated with the presence of β-sheet structures.17
In this case, the input data frame consisted of the diagonal slices through the 2D-IR spectra (35 data pixels) annotated in the same way as for the full 2D-IR model described above (Fig. S4). PCA and ANOVA-F feature selection was applied alongside SVC, kNN, DT and RF predictors. PCA-SVC was found to perform best, reaching a promising 85.8% accuracy though with a lower κ value of 0.66 across the three randomly generated 80
:
20 splits (Fig. S5). Examining the results of the LOO analysis (Fig. 3(c)) shows that, although a number of proteins were classified 100% correctly, there is a greater spread of misclassifications than when using the full 2D-IR plot.
To account for the negative correlation between the proportions of α-helix and β-sheet structures in the protein library (Fig. 3(a) and Table S1), a regression chain was used.56 Here, the training set is passed first into an ML model which performs a prediction of β-sheet content, from which the β-sheet predictions and training set are then run through a second ML model which generates a prediction of α-helix content. This specific order was chosen as the selected features for ML analysis cover more regions known to be associated with the presence of β-sheet structures, though we found little difference in performance when the chain order was reversed.
As the ANOVA-F-feature selection test only measures the statistical significance of the features of the spectrum against one target variable, the implementation had to be adapted for use with a regression chain. Two independent F-tests were performed, one using the proportions of α-helix as a target variable, and the other using the proportions of β-sheet. The highest scoring features from both tests were then combined into one list and the duplicates removed to give a final list of features that were used as input for both ML steps. This also meant that the optimal number of features had to be identified manually outside of the inner-loop CV. For this, the number of features selected from the separate F-tests for combination into a final list were varied, and each final list passed into a pipeline without a feature selection transformer for hyperparameter tuning. The final list that gave the smallest training RMSE was selected to produce a tuned model for assessment against the test set of the outer-loop. 40 features from both independent F-tests were consistently selected, giving 47 features in total. These are represented in Fig. S6.
First, the performance of the AF-SVR approach was assessed using an outer-loop LOO analysis where, after training, the model was tested against a given protein spectral group. The results of this are shown in Fig. 6 where each black square represents the average predicted secondary structure proportion over all of the spectra in the protein spectral group for each protein ((a) and (b) show α-helix and β-sheet results respectively). Error bars show the standard deviation of the predicted values. A perfect prediction is shown in green. A RMSE in prediction of ≤7% for both α-helix and β-sheet content shows that the AF-SVR model is predicting the secondary structure distribution well, while the linearity also displays good agreement across the range of structures included in the library. When RNase A and DT diaphorase, two of the six proteins that the model produced some of the lowest testing accuracies for in the structural classification task, were removed, these RMSE values reduced to 6.2% and 5.6% for α-helix and β-sheet, respectively.
Despite this demonstration of good predictive power for the SVR approach, there is still error between the predicted and actual secondary structure proportions. It is clearly plausible that a larger protein library that covers a more diverse structural space would lead to improved predictions. However, it must also be considered that, since the model is examining solution-phase spectra labelled with DSSP calculated crystallographic secondary structure proportions, some error might arise from any discrepancies between the crystal structure and the dynamic structure that exists in solution. Alternative sources of structural information may therefore also need to be considered for the training process.
The length of an α-helix is known to alter a protein's IR response. Longer helices typically shift the A-mode to lower frequencies, whilst the degeneracy of the E-mode is lost in shorter helices.47,57 Consequently, helices shorter than around six residues can generate unusual responses, with a number of bands distributed throughout the amide I region.58 Therefore, subtle differences in the 2D-IR spectra of proteins containing long or short α-helices would be expected.
Taking this into consideration, we attempted to further classify the proteins in the protein library using ML according to the length of the α-helices they contain. The proteins were separated into two classes; ‘short helices’ if they contained no α-helices longer than 15 residues, and ‘long helices’ if they contained α-helices longer than 15 residues. The cut-off value of 15 residues was selected to give an even distribution of proteins between the two classes (16 ‘short helices’ and 19 ‘long helices’). This classification also ensured the separation of proteins containing extremely long helices (e.g. glycogen phosphorylase b contains a 30-residue long α-helix, and human serum albumin a 33-residue long α-helix).
An ANOVA-F test was again used as the feature selection method, and assessed in combination with SVC, kNN and DT across three randomly generated 80
:
20 (training
:
testing) splits. All three of these models produced good testing accuracies but poor κ values (72% and 0.413 for SVC, 70% and 0.370 for kNN, and 74% and 0.489 for DT), and so an ensemble approach, Adaptive Boosting (AdaBoost), was considered. AdaBoost is a common boosting algorithm that assembles a collection of weak learners, usually single-level decision trees, into one larger classifier.59,60 At each iteration, the training examples are re-weighted such that each subsequent weak learner focuses on correcting the misclassified samples from the previous weak learner. These ensemble methods can better handle outliers and limit the chance of overfitting. On average across the three test sets, an AF-AdaBoost pipeline delivered an 83% testing accuracy, with a more confident κ value of 0.623. These predictions were made using the same region of the amide I response as the previous classification and regression models (see Fig. S9), which is a region that would be altered by the described moving A-mode and splitting E-mode with helix length.
Predicting the number of α-helices in a protein was also attempted. Using an independent AF-SVR model, a RMSE across 6 randomly selected proteins of 8.6% was obtained. This reduced to 4.8% when a regression chain was used that predicted α-helix proportion first which then fed into a prediction of the number of α-helices. The prediction was therefore improved by taking into account the weak positive correlation between the total proportion and number of α-helices in a protein structure. It would be important to also consider the total number of residues in a protein in this context (a small protein could have a large proportion of α-helix but not necessarily a large number of helices) but it was not possible to predict this well using the protein library 2D-IR spectra alone.
The β-sheet properties of a protein were also expanded upon through a prediction of the proportions of parallel and antiparallel β-sheet. Again, using independent AF-SVR models, a RMSE across 6 randomly selected proteins of 4.7% was obtained for antiparallel sheets. A much larger error of 8.8% was obtained for parallel sheets which, when considering that the proportions of parallel sheet in the library only vary between 0 and 13.1%, is comparatively very poor. By examining the features of the 2D-IR spectrum selected during the F-test for parallel sheets (Fig. S10(a)), the reason for this performance becomes clear. It is the high probe frequency region just outside of the amide I response that is selected, where any model trained on these features would likely overfit to the noise and incorrectly use that as a marker of parallel sheets. This is possibly a consequence of the little variance in the proportions of parallel sheet across the library (0 to 13.1%) where 14 proteins (almost half of the dataset) have 0% parallel sheet. In contrast, the F-test for antiparallel sheets selects the familiar nodal region (see Fig. S10(b)). When the SVR model for parallel sheets is made to also consider these antiparallel sheet selected features, the error reduces to 4.2%. This emphasises the necessity of domain knowledge for ML applications, especially when operating at the limit of what our protein library can achieve.
Overall, these predictions of other structural properties further confirm the potential of 2D-IR-ML methods for protein structural analysis. Whilst it is clear that more work, and more spectral data on a wide range of proteins, is necessary to improve predictive capabilities, this provides a proof-of-concept for the ability of ML analysis to retrieve the dense structural information from 2D-IR spectroscopy.
It is constructive to compare the results of this study to the state-of-the art. An error in prediction of ≤7% for both α-helix and β-sheet obtained here compares favourably with the only other report of using experimental 2D-IR spectra for ML-based quantitative secondary structure prediction, which used singular value decomposition (SVD) on a library of 16 proteins measured in D2O.31 This study assumed that a total protein 2D-IR spectrum can be made by the linear addition of contributions from pure α-helix, β-sheet and ‘unassigned structures’. While the approach was successful, producing RMSE values of 12.5 and 9.2% for α-helix and β-sheet, respectively,31 the method reported here produces more accurate results. Given that the new data is obtained in the more physiologically relevant H2O, where data collection is more challenging, this indicates that the ML approach is benefitting from the greater amount of experimental data that we are able to include here.
The performance metrics of our model also compare favourably to those from Circular Dichroism (CD) spectra. For an example SVM model, RMS errors of 5.7 and 6.9% for α-helix and β-sheet, respectively, were obtained.3 In the CD study, the model was trained on a larger database of 72 proteins (SP175 reference set), of which there is a match in 21 proteins with this library, but only in six PDB structures.61 So, whilst the performance of the model reported here is numerically poorer than the example CD trained model, it emphasises the prospect of ML protein analysis in tandem with a 2D-IR spectral library, as performance is still good even when trained on a dataset with comparatively lower structural diversity. As such, this provides a vindication for the further development of 2D-IR ML models but also reinforces the importance of library size. The demonstrated ability of 2D-IR to go beyond α-helix proportion to identify size and number of helix units and predict proportions of parallel and antiparallel β-sheets also shows considerable potential for future development beyond basic secondary structure elements.
It is particularly interesting that the outcome of our ANOVA-F feature selection process led to the ML models exploiting the off-diagonal region of the 2D-IR spectrum, rather than the more traditional spectrum diagonal. Indeed, a direct comparison here using the diagonal slice has shown that, while the latter contains sufficient information to make a good prediction, using the full spectrum leads to a more interpretable, and for most metrics, better outcome. This conclusion has also been reached in previous studies employing simulated 2D data and machine learning models, where it was found that the off-diagonal region of the spectrum was important for improved model performance.41,42 Analogous observations were also made recently when applying 2D-IR-ML to examine the spectra of biofluids.25 As well as validating the use of ML, this provides an important point of distinction with methods such as CD, which use a linear, one-dimensional spectrum to achieve the same quantification. The ability to unravel complex spectral contributions via a second spectral dimension and probe the inter-peptide interactions occurring within the structure provide a basis for a more detailed quantitative analysis of 2D-IR spectra than is currently possible.
As with all ML-based approaches, it will be important to keep adding to the protein spectral library in order to enhance its predictive ability. This will bring challenges in terms of transferring data between instruments and protein conditions, though with sufficient data and careful standardisation, these hurdles should be surmountable. There is also considerable scope to combine simulated and experimental data to expand models, improve simulations, and allow access to prediction of further structural properties, leading towards a detailed understanding of the structure-spectrum relationship.
Whilst the work here represents another step towards potential 2D-IR structural analysis tools, there are practical factors that must be considered. First is the sensitivity of 2D-IR, which can reduce the concentration range over which the technique can be applied. Using current technology, the detection limit for a protein in H2O is ∼5 mg mL−1, or around 70 µM for a protein like Human Serum Albumin (HSA). This is comparable to, and in many cases lower than, the concentrations used for NMR or to prepare samples for crystallography and cryo-EM. The value of 2D-IR is enhanced by the fact that the proteins are both unlabelled and fully solvated, which is not the case for many other methods. At higher concentrations, protein aggregation can be an issue, but we find that for the test protein BSA at three different concentrations, there is no indication of aggregation (Fig. S11). More broadly, we also note that when aggregation occurs it is clearly identifiable by a characteristic signature near 1620 cm−1.62,63 Separate to aggregation, the ability of 2D-IR to detect the formation of protein multimers in solution is an important question that will need to be addressed. Overall, while there will inevitably be proteins that cannot be studied in solution due to a lack of solubility, this issue is likely to be no more prohibitive than, for example, proteins that do not crystallise well. A multi-method approach will always be vital to build a picture of a protein's structure and dynamics.
A second issue relates to the instrumentation. At present, 2D-IR is a somewhat niche technique, relying on ultrafast lasers and specialist laboratories. However, if one considers that since the first measurements in 1998,7 laser systems have advanced to robust turn-key sources and that one-box spectrometers have become commercially available, then the direction of progress towards an accessible solution becomes clear. By comparison, 28 years elapsed between the characterisation of nuclear magnetic moments and the first commercial NMR spectrometer, and a further 17 years before Fourier Transform technology emerged.64 Similar timescales were also required for techniques like cryo-EM to reach their full capabilities. It is therefore reasonable to assume that the accessibility and applications of 2D-IR will only continue to progress.
The success of these hybrid approaches to protein structure predictions using experimental data should be predictable given the achievements of AI-driven tools such as AlphaFold and related platforms. This shows that, given sufficient information, the link between primary structure and 3D confirmation can be discerned. Our approach is similar, but with the focus on unravelling the spectrum–structure relationship. It is however of note that our ML methods do this successfully by using a very different portion of the spectrum to that which most previous 2D-IR studies have focused on. We thus believe that this study marks an encouraging step along the road to implementation of 2D-IR in applications ranging from quality control and regulation, to structure-based drug design involving dynamic protein structures.
| This journal is © The Royal Society of Chemistry 2026 |