Open Access Article
Pau
Rocabert-Oriols
,
Camilla
Lo Conte
,
Núria
López
and
Javier
Heras-Domingo†
*
Institute of Chemical Research of Catalonia (ICIQ-CERCA), Avinguda dels Països Catalans, 16, Tarragona, 43007, Spain
First published on 11th November 2025
Identifying molecular structures from vibrational spectra is central to chemical analysis but remains challenging due to spectral ambiguity and the limitations of single-modality methods. While deep learning has advanced various spectroscopic characterization techniques, leveraging the complementary nature of infrared (IR) and Raman spectroscopies remains largely underexplored. We introduce VibraCLIP, a contrastive learning framework that embeds molecular graphs, IR and Raman spectra into a shared latent space. A lightweight fine-tuning protocol ensures generalization from theoretical to experimental datasets. VibraCLIP enables accurate, scalable, and data-efficient molecular identification, linking vibrational spectroscopy with structural interpretation. This tri-modal design captures rich structure–spectra relationships, achieving Top-1 retrieval accuracy of 81.7% and reaching 98.9% Top-25 accuracy with molecular mass integration. By integrating complementary vibrational spectroscopic signals with molecular representations, VibraCLIP provides a practical framework for automated spectral analysis, with potential applications in fields such as synthesis monitoring, drug development, and astrochemical detection.
Emerging machine learning approaches offer a transformative path forward by enabling the interpretation of complex spectroscopic data and laying the foundation for automated and scalable analysis,8 for instance, in NMR5 spectroscopy. By bridging the gap between qualitative signals and quantitative insights, these tools open new possibilities to understand and design molecular systems with unprecedented speed and precision.9,10 However, in reality human-based experimental identification relies on integrating information from at least two independent techniques. In AI, combining complementary data streams is known as multi-modal learning, but its application to molecular characterization remains a challenge.
Vibrational spectroscopies are often interpreted with Density Functional Theory (DFT), which balances accuracy and cost.11,12 Systematic deviations remain: harmonic frequencies are overestimated requiring empirical scaling.13 Intensities depend on functional and basis; B3LYP generally gives reliable IR intensities, though Raman intensities are more basis sensitive.14–16 Moreover, temperature, intermolecular interactions, and solvation broaden or shift experimental spectra beyond idealized DFT predictions. To address such challenges, recent advancements in AI-driven spectra-to-molecule mapping broadly follow two main strategies: (i) predicting spectral outputs directly from molecular structures, and (ii) interpreting spectroscopic data to infer chemical structures, enabling inverse design. Graph Neural Networks (GNNs) have been developed to predict IR spectra from molecular graphs,17–20 while Convolutional Neural Networks (CNNs) have been applied to classify IR spectra by functional groups.21,22 Other models, such as support vector machines (SVMs), random forest (RF), multilayer perceptrons (MLPs), and deep reinforcement learning have been used to identify functional groups from IR spectra.23–26 SMILES-based representations using the Transformer architecture27 have emerged as an alternative for inferring molecular structures directly from IR spectra.28 Existing approaches focus on individual modalities, such as molecular graphs or single spectral techniques.20,29–31 Unlike, CNNs, RNNs, GNNs, or random forests, which rely on supervised prediction of labels or properties, contrastive learning operates more like a spectroscopic fingerprinting process: it aligns heterogeneous data (Graph, IR, Raman) in a shared latent space by maximizing the similarity of true pairs and minimizing that of mismatches. Just as a chemist identifies a compound by matching experimental spectra against reference patterns, contrastive learning enables unsupervised cross-modal representation learning that is essential for molecular elucidation.
Vibrational spectroscopies like infrared (IR) and Raman are widely used to identify functional groups but remain underutilized when it comes to combining their complementary strengths.32 Integrating both offers a robust foundation for multi-modal analysis, as both spectra originate from the same underlying physics,6 the molecular vibrations expressed as normal modes (Fig. 1). Current methods struggle to unify these modalities, and incorporating spectral data with molecular graphs introduces further challenges in aligning disparate data types within a shared latent space. Overcoming these limitations is essential for advancing multi-modal characterization and extracting deeper molecular insights.
Multi-modal models based on contrastive learning, such as the CLIP architecture,33 have emerged as powerful tools to bridge diverse data modalities. These models are particularly well-suited for characterization techniques, where relationships between different streams of data, such as molecular structures and spectra, must be learned to allow molecular identification. While current applications of contrastive learning focus on dual-modal relationships, they leave room for further exploration in more complex multi-modal tasks (i.e., a figure assigned to its caption). For instance, MolCLR34 leverages self-supervised pre-training on molecular graphs to encode chemically meaningful similarities, improving property prediction tasks with limited labeled data. The CReSS system applies contrastive learning to directly connect 13C NMR spectra with molecular structures, enabling high-recall cross-modal retrieval (i.e., molecular assignment) in large molecular libraries and supporting molecular scaffold determination.35,36 Similarly, the CMSSP framework establishes a shared representation between tandem mass spectrometry (MS/MS) spectra and molecular structures, improving metabolite identification.37 However, these approaches used molecular Morgan fingerprints38,39 together with graph embeddings to better anchor the MS spectra-molecule pair, though combining both may introduce redundant information, underscoring the need for careful feature selection. More recently, the MARASON implementation introduces neural graph matching to retrieval-augmented molecular machine learning, significantly improving mass spectrum simulation accuracy over existing methods.40 Expanding beyond molecular systems, MultiMat introduces a self-supervised multi-modality framework for materials science, leveraging diverse streams of bulk material properties to enhance property prediction, material discovery, and scientific interpretability.41
Despite this rapid progress, existing models remain restricted to dual-modal formulations (e.g., structure–spectrum) alignment. Our benchmarking of recent methods (see SI, S-1) highlights their strengths and trade-offs. For example, Chemprop-IR17 reaches high Spectral Information Similarity (SIS) scores (0.969 theoretical, 0.864 experimental data), while Graphormer-IR19 scales to large datasets but with 139 M trainable parameters. Contrastive approaches like CReSS35 (13C-NMR + SMILES, Top-10 = 91.6%) or CMSSP37 (MS/MS + Graph, Top-10 = 76.3%) demonstrate the potential of cross-modal retrieval but remain constrained by their pairing nature. Among these, the closest approach to our work is SMEN,42 which aligns molecular graphs with IR spectra. While effective (Top-1 = 94.1%, Top-10 = 99.8% in QM9 dataset43), SMEN is limited to two modalities and requires 24 M parameters, more than twice the size of our approach.
VibraCLIP advances this frontier by introducing a tri-modal framework for vibrational spectroscopy that jointly aligns molecular graphs, IR, and Raman spectra in a unified latent space. By exploiting the complementarity of IR and Raman signals, it enables richer and more comprehensive molecular characterization than dual-modal systems currently allow. As demonstrated in the SI (Section S-2), a non-learned baseline further underscores the need for explicit alignment to reliably recover molecular structures from different data streams (i.e., molecular structure and vibrational data). With a maximum of 11 M trainable parameters, VibraCLIP effectively captures complex structure–spectra relationships, enabling molecular elucidation from vibrational data and bridging characterization techniques with molecular interpretation. Its adaptable and scalable design unifies and leverages these complementary modalities, accelerating structural analysis, facilitating knowledge transfer across modalities, and establishing it as a powerful tool for advancing characterization across diverse scientific domains.
![]() | ||
| Fig. 2 Overview of VibraCLIP. (A) VibraCLIP pre-training. The SMILES representation is converted into the graph representation (G) and the vibrational spectra are both pre-processed by interpolation and normalization between 0 and 1. Each modality is fed to the VibraCLIP simultaneously in batches of 128 systems, where the graph encoder is based on the DimeNet++ architecture46,47 while the spectral encoders and projection heads are based on fully connected neural networks (FCNN). The contrastive loss is utilized to maximize the agreement between the projected embedding vectors coming from the molecular graph (Gn), IR (In) and Raman (Rn), building a shared latent space. (B) The retrieval strategy, the IR and Raman spectra of an unknown molecule is fed to the spectral encoders and the cross-modality similarity score is then calculated across the database, providing the Top-K-most aligned embeddings vectors, which contains the most likely molecules. | ||
All the employed spectra are standardized and curated following the procedure described in the Vibrational Spectra Pre-processing section. For pre-training, we use the QM9S20 dataset that contains 130
000 optimized organic molecules with synthetic (DFT) spectroscopic data, see Datasets section for details. To generalize, we fine-tuned (realign) the model on an external PubChem dataset,44 which contains 5500 molecules following the same strategy as the QM9S,20,44 expanding the chemical space, together with their corresponding synthetic spectra, and molecular size range encountered by the pre-trained model. This addition, introduces a minimal realignment of the latent space to accommodate unseen molecules. Finally, as it is well-known in the community, experimental and computed IR and Raman spectra differ in peak position and width. Therefore, both experimental realignment and validation are crucial to be predictive under realistic conditions. To this end, 320 gas-phase molecular spectra were used from NIST Webbook45 (IR) and from standard libraries (Raman). The experimental dataset features chemically diverse, real-world compounds with richer spectral complexity, listed in the SI (Section S-5).
The representations of organic molecules via the Graph Encoder allows to extract structural patterns from the graph, in our case based on the DimeNet++,46,47 architecture (see Methods section). DimeNet++ is a dedicated graph neural network to learn geometric patterns in molecular structures containing both angular and distance based features thus closely resembling z-matrix molecular representations. DimeNet++ produces a continuous vector space representation of molecular graphs that, in our case, is further enhanced by only concatenating the standardized molecular mass of the molecule (without isotopic considerations) before the projection heads, see Fig. 2. The molecular mass provides additional chemical context crucial to distinguishing between similar structures particularly enhancing the quality of the embeddings in downstream tasks.
Similarly, spectral encoders are needed. In this case, the applied architecture is a multi-layer fully connected neural network (FCNN) designed to transform the IR or Raman spectra data into spectral embeddings representation, see Methods section for details. The encoder first employs an input layer matching the dimensionality of the given spectra, followed by a sequence of two hidden layers with progressively decreasing dimensions ensuring a smooth downsampling in the feature space. The network produces a fixed-length feature vector using a final linear layer.
Next, the projection heads are designed to map embeddings from modality-specific encoders into a shared representation space. It starts by a linear layer reducing the input dimension to the projection dimension, and follows by a GELU activation function48 to introduce non-linearity. Two more layers, one that refines the projection and an optional one for normalization that ensures consistent, well-regularized representations.
The multi-modal training strategy incorporates separate learning rates and optimization schedules for each component to facilitate efficient learning across multiple data types, see Methods for details. The learning ability is then controlled by the contrastive loss functions. In the original CLIP framework,33 two distinct modalities, denoted as A (image) and B (caption), are paired via the cosine similarity metric, see Methods for details. In our adaptation, A = G denotes the molecular graph representation, while the infrared (IR) or Raman spectra correspond to B. Then we can evaluate loss functions, L, for different pairs L(G, IR), L(G, Raman) and L(IR, Raman). The CLIP loss is independently applied between the graph representation and each spectroscopic modality, unifying all three in a single model. The overall loss is the sum of individual CLIP losses involving the graph G (see Contrastive loss functions section).
To evaluate the retrieval accuracy, we implemented a dedicated PyTorch Lightning callback,49–51 executed exclusively on the test dataset in two scenarios: the single and dual spectral modalities. The cosine similarity then scores and ranks all candidate graph embeddings, creating a sorted list of most likely molecules (i.e., Top-K matches) either for single or dual spectra data streams (see Retrieval accuracy section). In the retrieval phase, IR and Raman spectra serve as queries to generate embeddings, which are compared to candidate molecular structures. Candidates are ranked by cosine similarity (eqn (2)), quantifying spectral–structural alignment and prioritizing the most chemically consistent matches. As shown in Fig. 3, retrieval accuracy plots highlight VibraCLIP's effectiveness in matching spectra to molecular structures. Furthermore, the model was also realigned (fine-tuned) using experimental data, demonstrating the effectiveness of the minimal realignment strategy in adapting to real-world spectra.
![]() | ||
| Fig. 3 VibraCLIP retrieval performance. (A) Performance comparison of VibraCLIP using different contrastive loss strategies without anchoring features. (B) Performance comparison of VibraCLIP using different contrastive loss strategies including the standardized molecular mass as anchoring feature. (C) Latent space realignment on the PubChem dataset44 with the standardized molecular mass as anchoring feature. (D) Latent space realignment on the experimental dataset with the standardized molecular mass as anchoring feature. Training epochs with maximum of 200 epochs with early stopping strategy: (A) IR only 153, IR + Raman 115, IR + Raman (allpairs) 95; (B) IR only 107, IR + Raman (allpairs) 112; (C and D) realignment with 15 epochs. | ||
The retrieval accuracies highlight the substantial gains achieved by incorporating the Raman spectra and aligning vibrational modalities within the contrastive learning loss function. Without considering the standardized molecular mass as anchoring feature, adding Raman spectra increases the Top-1 accuracy from 12.4% (IR only) to 55.1%, and further to 62.9% when explicitly aligning IR and Raman embeddings. As shown in Fig. 3A, this improvement underscores the complementary role of Raman spectroscopy in refining molecular identification. For Top-25, performance improves from 63.6% (IR only) to 94.0% and 94.3% with Raman and full IR-Raman alignment, respectively, demonstrating the value of combining vibrational modalities in a unified latent space. Learned alignment is essential, as the non-learning baseline yielded near-random retrieval across modalities (<0.30% Top-K, Section S-2). VibraCLIP surpasses this lower bound by orders of magnitude, confirming that its performance stems directly from the contrastive training objective.
Building on this, the inclusion of the standardized molecular mass as an anchoring feature (i.e., similar to adding the total mass from MS experiment), results in notable improvements across all retrieval thresholds. With mass included, Top-1 accuracy for IR-only models rises from 12.4% to 24.2%, and with the fully aligned IR-Raman loss function, from 62.9% to 81.7%, a remarkable 18.8% absolute gain. As shown in Fig. 3B, the Top-25 accuracy increases to 98.9%, confirming the effectiveness of mass as a chemically grounded global descriptor. This anchoring strategy improves consistency in the latent space, especially in distinguishing structurally similar candidates.
The model was further fined-tuned, realigned, on an external dataset of randomly selected organic molecules from PubChem44 (Fig. 3C) and an experimental dataset (Fig. 3D). These evaluations demonstrate the model's adaptability and generalization to previously unseen chemical and spectral distributions. This realignment was deliberately minimal, updating only the final layer of the projection heads over 15 epochs, while leaving the remaining 8.7 million of the model's 11 million parameters frozen. This lightweight adjustment enabled smooth realignment to new data domains with minimal computational cost.
On the PubChem dataset, realignment yielded strong results: Top-1 accuracy rose to 67.6%, and Top-25 accuracy reached 98.2% when the IR-Raman embeddings alignment is considered in the loss function. While a moderate drop from the QM9S20 benchmarks (98.9%), this performance reflects the increased chemical diversity of PubChem44 and confirms that minimal fine-tuning effectively mitigates dataset shift. Similarly, experimental validation using IR and Raman spectra demonstrated the robustness of the approach, with Top-25 accuracy reaching 100% and Top-1 performance at 34.4% in the fully aligned, mass-anchored setup. These findings confirm that VibraCLIP's realignment strategy is not only data-efficient but also transferable across theoretical and real-world domains.
The necessity of learned alignment is further highlighted when compared to the non-learning baseline (SI, Section S-2), where molecular fingerprints and spectra projected into a shared space yielded near-random retrieval (<0.30%). Similar to comparing an experimental IR spectrum with an uncalibrated reference library, without alignment, meaningful matches cannot be recovered.
The model shows particularly strong performance in Top-K retrieval scenarios. While Top-1 accuracy improves substantially with IR–Raman alignment (from 12.4% to 62.9%) and further to 81.7% with mass anchoring, the real value lies in the Top-25 accuracy of 98.9%. Importantly, this framework is not designed to retrieve a single exact match, but rather to guide molecular identification by narrowing the search space to a small pool of highly similar candidates. In contexts such as drug discovery, high-throughput screening, or the identification of unknown chemical species in extraterrestrial environments,3,4 this level of precision is both meaningful and practically valuable. This highlights that, even when the exact structure is not ranked first, VibraCLIP consistently retrieves chemically similar candidates sharing key scaffolds. As shown in Fig. 4, correct molecules often appear within the top-ranked set, and retrieved structures tend to be structurally coherent, offering actionable insight in practical settings.
To probe the robustness of VibraCLIP under incomplete data scenarios, we performed a missing-data analysis. In the dual-modality setting (Graph, IR with mass anchoring), IR spectra were progressively removed. Interestingly, performance remained stable up to 10–15% missing data, but dropped sharply at 50%, highlighting the model's reliance on complete spectral information in this configuration. In contrast, in the three-modality case (Graph, IR, Raman with the all-pairs loss function), the accuracy decline at 50% missing data, from one of the spectra modalities, was far less severe. This suggest that the model continues to learn effectively from the Graph, IR and the remaining Raman spectra underscoring the complementarity of the two vibrational modalities. When spectra were missing, the spectral encoder part was frozen to prevent weight updates, ensuring stable optimization of the remaining modalities (SI, Section S-6).
VibraCLIP also supports realignment to new data distributions through lightweight fine-tuning. Using only the final projection layers, we adapted the model to a PubChem44 derived dataset and to an internally curated experimental vibrational dataset, confirming transferability across synthetic and real-world domains. Nevertheless, significant limitations remain, largely due to the scarcity of multi-modal, machine learning-ready spectroscopic datasets. The core pre-training relies on QM9S,20 restricted to small organics and element counts of C: 9, N: 7, O: 5, F: 6, leaving larger and chemically richer structures underrepresented. PubChem extends this chemical space (C: 21, N: 6, O: 8, F: 0), and the experimental dataset (C: 30, N: 6, O: 6, F: 6), while limited in size, provides valuable diversity for benchmarking with real-world data, and highlights the opportunity for broader dataset development.
Lastly, while DimeNet++,46,47 provides a strong basis for graph encoding, future VibraCLIP iterations can adopt more expressive GNNs to capture complex molecular features. Its modular design enables expansion to modalities such as NMR,35,36 UV-Vis, and mass spectrometry,37 opening new paths to AI-driven molecular identification.
In summary, VibraCLIP is a scalable, efficient, and generalizable framework for spectral interpretation. By embedding molecular and spectral information in a unified latent space, it provides groundwork for next-generation tools in molecular discovery, structural elucidation, and AI-augmented spectroscopy.
000 organic molecules with re-optimized geometries. It includes diverse molecular properties, from scalar values (e.g., energies, partial charges) to high-order tensors (e.g., Hessian matrices, quadrupole and octopole moments, and polarizabilities). Spectral data, including IR, Raman, and UV-Vis spectra, were computed via frequency analysis and time-dependent DFT at the B3LYP/def-TZVP level of theory using Gaussian16.52 The inclusion of IR and Raman spectra in QM9S enabled the development of VibraCLIP, a model designed for multi-modal alignment and spectroscopic representation learning.
VibraCLIP was fine-tuned on 5500 molecules from the PubChem-derived subset of QM9S,20,44 expanding the chemical space and molecular size range encountered by the pre-trained model. This subset includes SMILES representations, 3D coordinates, and Hessian matrices. IR and Raman spectra were inferred using the DetaNet model,20 trained and validated on QM9S, which accurately predicts these spectral features. This fine-tuning process improved VibraCLIP's ability to generalize to a broader range of molecular and spectroscopic data.
Experimental realignment and validation were performed using IR spectra from the NIST Webbook45 and Raman spectra from the OMNIC software's standard library for the same molecules. The resulting dataset includes 320 examples, each combining molecular structure, IR, and Raman spectra. Unlike computational benchmarks such as QM9S20 or PubChem,44 which remain biased toward small, synthetically accessible molecules and lack the chemical richness of real systems, our experimental set features diverse compounds with more complex spectra. Although limited in size, it provides a crucial first step toward validating VibraCLIP in real-world scenarios, underscoring both its generalization ability and the need for larger, experimentally grounded benchmarks.
Since VibraCLIP is a multi-modal model, the processed IR and Raman spectra are also incorporated within the graph object, following a common strategy in PyTorch Geometric51 for multi-modal data integration. This approach enables seamless access to multiple data types within the model and supporting efficient multi-modal alignment and representation learning. Further implementation details are available in the SI (S-3).
We further enhanced the molecular graph representation by concatenating the standardized molecular mass to the molecular vector generated by DimeNet++. This addition improved the model's overall performance, as the molecular mass provides additional chemical context that proved to be valuable for distinguishing between similar structures. Importantly, this addition occurs before passing the graph embeddings to the projection head, allowing the projection model to fully leverage this SI. We believe that the projection head benefits from the inclusion of the standardize molecular mass by refining the embeddings in a way that better captures molecular distinctions relevant to the downstream tasks.
Each hidden layer consists of a linear transformation followed by an activation function, which introduces non-linearity to enhance the network's expressiveness. Then batch normalization is enabled and applied after each linear layer to stabilize training by normalizing the activations. Finally, the network outputs a fixed-length feature vector through an additional linear layer, which serves as the final layer. This spectral encoder produces embeddings that capture meaningful spectral information, enabling effective interaction with other modalities for alignment within the VibraCLIP model.
![]() | (1) |
![]() | (2) |
Therefore, the symmetric loss can be represented as:
![]() | (3) |
It is worth noting that the CLIP model was originally presented in the context of image-caption pairs, where A represents an image modality and B a text modality.
![]() | (4) |
![]() | (5) |
The cosine similarity (eqn (2)) scores between spectral and graph embeddings are then calculated. For each target spectrum, the model ranks all candidate graph embeddings based on similarity, creating a sorted list of likely matches. These similarity scores are stored and exported as pickle file for further analysis of the retrieval accuracy and Top-K matches.
To obtain a combined similarity measure, the geometric mean (GM) of the three pairwise similarity scores (Graph-IR, Graph-Raman, IR-Raman) is calculated, providing a comprehensive metric for alignment across three modalities. This combined similarity helps identify the molecular graph that aligns with both spectra most closely. Specifically, a low score in any one pair will drastically reduce the overall geometric mean, reflecting the joint alignment among all three modalities. The results are saved as a pickle file, enabling analysis of retrieval accuracy in multi-modal alignment within the VibraCLIP framework.
![]() | (6) |
These enhancements in similarity scoring establish a robust and interpretable multi-modal alignment, ensuring that VibraCLIP captures the intricate relationships between molecular structures and vibrational spectra. Employing the geometric mean across three modalities, our approach maximizes retrieval precision while mitigating discrepancies from individual spectral contributions.
Supplementary information: Benchmarks with recent models, a non-learning baseline, model implementation details, hyperparameter optimization, retrieval accuracies from experimental data, and a missing-data analysis. These provide additional insights into the performance and intepretability of the proposed approach. See DOI: https://doi.org/10.1039/d5dd00269a.
Footnote |
| † Present address: Departament de Química Inorgànica i Orgànica (Secció de Química Inorgànica), Institut de Química Teòrica i Computacional (IQTC), Universitat de Barcelona, C/Martí i Franquès, 1, Barcelona, 08028, Spain. E-mail: E-mail: javier.heras@ub.edu |
| This journal is © The Royal Society of Chemistry 2025 |