Open Access Article
Salvatore Sorrentino†
*abc,
Alessandro Gussoni
d,
Francesco Calcagno
ef,
Gioele Pasotti
a,
Davide Avagliano
g,
Ivan Rivalta
ef,
Marco Garavelli
e and
Dario Polli
*ah
aDepartment of Physics, Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milan, Italy. E-mail: salvatore.sorrentino@polimi.it; dario.polli@polimi.it
bLaser Biomedical Research Center, G. R. Harrison Spectroscopy Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
cCutaneous Biology Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
dCitizen Scientist, Italy
eDepartment of Industrial Chemistry “Toso Montanari”, Universitá degli Studi di Bologna, Via Piero Gobetti, 85, I-40129 Bologna, Italy
fCenter for Chemical Catalysis – C3, Alma Mater Studiorum University of Bologna, via Piero Gobetti 85, 40129 Bologna, Italy
gChimie ParisTech, PSL University, CNRS, Institute of Chemistry for Life and Health Sciences (iCLeHS UMR 8060), 75005 Paris, France
hCNR-Institute for Photonics and Nanotechnologies (CNR-IFN), Milan, Italy
First published on 25th November 2025
Raman spectroscopy is a powerful technique for probing molecular vibrations, yet the computational prediction of Raman spectra remains challenging due to the high cost of quantum chemical methods and the complexity of structure–spectrum relationships. Here, we introduce Mol2Raman, a deep-learning framework that predicts spontaneous Raman spectra directly from SMILES representations of molecules. The model leverages Graph Isomorphism Networks with edge features (GINE) to encode molecular topology and bond characteristics, enabling accurate prediction of both peak positions and intensities across diverse chemical structures. Trained on a novel dataset of over 31
000 molecules with state-of-the-art Density Functional Theory (DFT)-calculated Raman spectra, Mol2Raman outperforms both fingerprint-based similarity models and Chemprop-based neural networks. It achieves a high fidelity in reproducing spectral features, including for molecules with low structural similarity to the training set and for enantiomeric inversion. The model offers fast inference times (22 ms per molecule), making it suitable for high-throughput molecular screening. We further deploy Mol2Raman as an open-access web application, enabling real-time predictions without specialized hardware. This work establishes a scalable, accurate, and interpretable platform for Raman spectral prediction, opening new opportunities in molecular design, materials discovery, and spectroscopic diagnostics.
Computational methods, particularly Density Functional Theory (DFT), provide an alternative for calculating Raman-active vibrational frequencies.7,8 While highly accurate, DFT-based approaches scale poorly with molecular size due to their high computational cost, making them impractical for high-throughput applications.9,10 Similarly, universal machine learning models, such as the Artificial Intelligence-Quantum Mechanical (AIQM) approach, have been successfully applied to infrared (IR) spectrum prediction, offering accuracy comparable to DFT with significantly reduced inference times.11,12 However, these models still face scalability challenges in the context of large-scale screening, where sub-second predictions are often required. Fast and accurate computational predictions of spectroscopic properties are particularly crucial in fields such as molecular identification and molecular design, where Raman-active molecules are used in applications such as biomedical imaging, environmental sensing, and anti-counterfeiting technologies.13,14 Other predictive models, such as heuristic fingerprint-based approaches, often fail to capture the nuanced relationships between molecular structure and vibrational properties.15 Descriptor-based methods, such as polarizability tensor predictions, introduce propagation errors and transferability limitations, leading to misalignment between predicted and actual Raman peak localization.16–19 In addition, recent advances in machine learning for vibrational spectroscopy have also demonstrated the significant potential of neural network architectures in predicting vibrational density of states and phonon band structures,20,21 though these methods have not been specifically optimized for the unique challenges of Raman spectroscopy.
To address these limitations, we introduce Mol2Raman, a Graph Neural Network (GNN)-based framework that directly predicts spontaneous Raman spectra from SMILES (Simplified Molecular Input Line Entry System) molecular representations.22–24 GNNs have demonstrated significant impact across various scientific fields, particularly in drug discovery, where they have enabled accurate prediction of molecular interactions, drug-target binding affinities, and de novo molecular design, accelerating pharmaceutical research.25–27 Unlike descriptor-based approaches, Mol2Raman learns from graph-based molecular representations, where atoms and bonds are modeled as nodes and edges, aligning naturally with the underlying physics of molecular vibrations.28–30 By leveraging Graph Isomorphism Networks with Edge (GINE) convolutions,31 Mol2Raman effectively captures both local atomic interactions and global molecular structure, improving predictive accuracy over conventional ML models.30 In addition, we integrate traditional chemical descriptors such as Daylight and Morgan fingerprints into the GINE molecular representation.32,33 This combination provides a more comprehensive molecular description, improving the model's ability to predict Raman spectral properties, such as differences in enantiomeric Raman modes. By leveraging this hybrid approach, Mol2Raman benefits from both detailed molecular graph learning and the chemical insights encoded in established descriptors. One of the key contributions of this work is the development of a large-scale, high-fidelity dataset comprising over 30
000 organic molecules taken from the QM9 database,34 each paired with DFT-calculated Raman spectra using the academic-free ORCA software,8,35 used to train the Mol2Raman model. This study focuses on training the model on high-quality DFT-generated spectra, given the challenges of assembling sufficiently large and standardized experimental Raman spectral datasets available online,36 representing also a foundation model for possible future fine-tuning with experimental data.
Additionally, we introduce a custom loss function specifically designed for Raman spectrum prediction. This function balances global spectral similarity with precise peak-position constraints. Unlike conventional loss functions that prioritize only overall spectral shape, our hybrid approach enhances Raman-active peak localization, leading to significantly improved predictive accuracy.
While not a perfect match to DFT calculations, our model provides highly accurate predictions with an inference time of less than a second per molecule, compared to the hours required for DFT calculations. This balance of speed and accuracy makes Mol2Raman particularly valuable for preliminary molecular design studies, where researchers need to rapidly screen thousands of candidate molecules to identify those with desirable spectroscopic properties. By enabling efficient pre-selection, Mol2Raman significantly accelerates high-throughput screening workflows. Once an initial shortlist is identified, more precise DFT calculations or experimental measurements can be performed to refine the selection and obtain final optimized molecular candidates.
Finally, we developed a free web-based platform for Mol2Raman, enabling real-time Raman spectrum predictions directly from SMILES input. By combining a novel dataset, an advanced GNN architecture, and an accessible web application, Mol2Raman provides a fast, scalable, and high-accuracy alternative to first-principles Raman spectrum calculations. To the best of our knowledge, this work represents the first deep-learning framework specifically designed for Raman spectral predictions,37 with broad implications for materials discovery, molecular design, and high-throughput chemical screening.
776 molecules. These molecules are extracted from the dataset provided by Ramakrishnan et al.,34 which contains the elements C, H, O, N, and F with up to nine heavy atoms. This dataset comprises 134
000 molecules and provides a chemical space with diverse molecular and stoichiometric properties.
The molecular geometries in the QM9 dataset were optimized using DFT at the B3LYP/6-31G(2df,p) level of theory,38,39 which balances computational efficiency with predictive accuracy for organic compounds.34 The dataset contains a wide range of quantum chemical properties, such as dipole moments, isotropic polarizabilities, frontier orbital eigenvalues, harmonic vibrational frequencies, and thermodynamic properties like atomization energies, enthalpies, and free energies at 298.15 K.
Starting from these optimised geometries in Cartesian coordinates, we calculate the Raman spectral activities using ORCA software8 and a high-performance computing (HPC) cluster provided by CINECA,40 using the BP86/DEF2-SVP level of theory as suggested in ORCA.41,42 This calculation allows us to obtain molecular Raman spectral information in the region from 500 cm−1 to 3500 cm−1. We retain only molecules whose numerical frequency calculations are completed without any error or missing displaced geometry, and discard all incomplete runs. The final curated dataset, after excluding 2
782 molecules that report SCF/NUMCALC failures in finite-difference frequency calculations, contains 31
776 molecules used for training, validation, and testing of the model. No additional filtering of DFT activities is applied. The ORCA software package provides advanced tools for calculating vibrational spectra using DFT,43 including Raman activities and IR spectra, which are essential for understanding molecular interactions with light. Raman activity is a distinct concept from Raman intensity. The former is an intrinsic molecular property derived from quantum chemical calculations, while the latter depends on experimental conditions, such as the wavelength of incident light and the temperature of the system.44 Raman activities are determined using the derivatives of the molecular polarizability tensor with respect to the vibrational normal coordinates as given by:43
To encode the chemical information of each molecule, we use its SMILES representation as input to the neural network. SMILES is a widely used notation to represent the structure of chemical molecules in a compact and machine-processable format,22–24 encoding the molecular structure as a linear string of ASCII characters, where atoms are represented by their atomic symbols, and bonds are described (implicitly or explicitly) by specific characters. The simplicity and expressiveness of SMILES made it one of the standards for molecular representation in cheminformatics and computational chemistry. However, it should be noted that while SMILES encodes connectivity, it does not inherently capture all the three-dimensional geometric information of molecules, which may be necessary for certain property predictions.45
Given the distinct physical mechanisms governing Raman modes in these spectral regions, we divided the global spectrum (500–3500 cm−1) into two subregions: the fingerprint region between 500 cm−1 and 2100 cm−1 and the C–H stretching region between 1900 cm−1 and 3500 cm−1. To ensure a smooth transition between spectral subregions, we introduce an overlapping window (1900–2100 cm−1), facilitating seamless integration of predictions across both regions.
Moreover, because DFT calculations yield discrete Raman-active vibrational frequencies, they lack natural peak broadening effects,8 resulting in sparse spectra with numerous zero-intensity points. This sparsity introduces challenges during neural network training by adding unnecessary complexity. To address this, we employ a two-stage max pooling strategy,49 commonly used in computer vision, to downsample spectra. Max pooling involves selecting the maximum value within a defined window, helping to reduce the dimensionality of the data while retaining important features. First, we apply max pooling with a resolution of 2 cm−1 to the raw DFT-calculated Raman activities. The resulting 800-point spectra are used as the reference spectra, namely these are considered the “true” Raman spectra of each molecule and are used for evaluation throughout the paper. Second, for the training phase of the Mol2Raman model, we further apply max pooling with a 6 cm−1 resolution to reduce the 800-point spectra to 267 points. This coarser representation simplifies the learning task by reducing sparsity and dimensionality, while retaining the key vibrational features. The resulting training spectra retain spectral integrity, with Raman-active frequencies still well-aligned and a maximum deviation of only 3 cm−1 from the reference.
In addition to these graph-based local descriptors, we also employ two other sets of features to enhance the input descriptive power of molecules, the Daylight fingerprint32 and the Morgan fingerprint.33 Briefly, the Daylight fingerprint encodes the global molecular structures into a binary vector where each bit represents the presence or absence of a specific substructure or molecular pattern. The fingerprint is generated by systematically decomposing the molecule into all possible linear substructures up to a certain length. Each substructure is then converted into a numerical representation that activates one or more specific positions within a fixed-length binary vector, encoding the presence of that feature, in a process called hashing. This method is particularly effective in similarity searching, where the Tanimoto similarity is often used to compare fingerprints.51 Alongside this, the Morgan fingerprint is an advanced and widely used fingerprinting method, particularly in modern machine learning applications. It is based on the Extended Connectivity Fingerprints algorithm.33 Unlike the Daylight fingerprint, which relies on linear paths, the Morgan fingerprint captures local circular environments around each atom, making it more effective at encoding molecular topology and chemical context. The algorithm iteratively expands around each atom up to a specified radius, generating unique identifiers (hash codes) for the substructures at each step. These identifiers are then mapped to a fixed-length binary or integer vector.
The combination of Daylight and Morgan fingerprints as input to a neural network allows the model to learn from the complementary strengths of linear path-based substructure detection and circular neighbourhood encoding. These complementary descriptors encode long-range molecular interactions and structural patterns, refining the ability of the model to capture both local and global spectroscopic variations.
As illustrated in Fig. 1A, the network follows a graph-based deep learning architecture. The input SMILES undergoes transformation through the molecular featurizer, converting the molecular structure into a graph representation that captures atomic and bond-level information. This featurization enables the model to extract meaningful structural features directly from the molecular graph.
At its core, the network consists of four GINEConv layers,31 which allow the atomic features to be influenced by neighboring atoms up to four hops away. This hierarchical aggregation captures complex interatomic relationships and molecular topologies that are essential for predicting vibrational frequencies. Each GINEConv layer incorporates a linear transformation of the node features, followed by batch normalization to stabilize learning and improve convergence,52 and a ReLU activation function is applied after each linear transformation to introduce non-linearity.53 These steps ensure robust feature extraction, preserving both local atomic environments and global molecular connectivity.54
Following the GINEConv layers, a global pooling operation is applied to aggregate node-embedding information across the entire molecular graph54 through a sum pooling function. This step allows the network to produce a fixed-size embedding, irrespective of the number of atoms in the molecule. The pooled representation is then passed through two fully-connected linear layers. The first linear layer expands the dimensionality of the pooled features by a factor of 4 to enhance expressiveness before making the final predictions. The second linear layer maps these features to the final output, which corresponds to the predicted number of peaks in the spectrum and it is obtained via a Softplus function.55 Between the two layers, a dropout of a factor 0.25 is used.
The output from the aggregation layers goes in input to three fully connected layers, with two dropout layers having a dropout parameter of 0.4 between them. The last layer is represented by a Softplus layer having a dimension equal to the training spectra (267 points, see Dataset Preprocessing), which generates the network output, namely the Raman spectrum. The final output of the network is obtained after a Monte Carlo dropout step with 10 rounds of predictions.56 Fixed hyperparameters are selected on the validation set to balance model complexity and computational efficiency, following a 20-trial random search over the architecture's layers, dropout rate, and loss weights, and targeted manual adjustments informed by validation curves.
Like the previous one, also this architecture works both in the fingerprint and C–H regions, providing in output two vectors of size 267 representing the predicted Raman activities for the two spectral regions. Moreover, an overlap of 33 points between the fingerprint and C–H regions is included in these 267-sized vectors, which is given by the overlap in the range between 1900–2100 cm−1 discussed above.
129 molecules include oxygen, 19
416 contain nitrogen, and 484 feature fluorine. More details on the molecular properties of the dataset are reported in SI Table 1. The dataset is randomly split into 80% (25
440 molecules) for training, 10% (3
168 molecules) for validation, and 10% (3
168 molecules) for testing. SI Fig. S1 shows the distribution of training and test molecules in the space of the first two Principal Components (PC1 and PC2) calculated over the Morgan fingerprint representation of each SMILES, which proves the random sampling of the test molecules from the entire dataset population. The training and validation datasets are used to iteratively refine model parameters, while the independent test set is reserved for final performance evaluation.
The two networks discussed above are trained using different loss functions. The network predicting the number of Raman-active frequencies is optimized using the Root Mean Squared Error (RMSE), whereas the network predicting Raman intensities is trained with a custom peak-weighted RMSE loss function. This loss function enhances the model's ability to correctly identify Raman peaks by assigning different weights to true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).57 The function is formally defined as:
![]() | (1) |
Model parameters are optimized using Stochastic Gradient Descent (SGD) with momentum.58 The learning rate is set to an initial fixed value and gradually reduced throughout training to ensure stable convergence. Momentum is included to accelerate updates in the correct direction while mitigating oscillations, and weight decay is applied to prevent overfitting by penalizing excessively large parameter values. Training is conducted for a maximum of 1500 epochs under an early-stopping framework, where every two epochs the validation loss is computed, and the model parameters corresponding to the lowest loss value are retained.59 The computational time for the prediction of Raman spectra is on average 22 milliseconds per molecule, representing a substantial improvement over DFT calculations, which require hours. More details on the hardware and computational time for training are reported in SI Note 1.
The number of trainable parameters differs significantly between the two models due to architectural variations. The network predicting the number of Raman-active frequencies consists of 1
530
193 parameters per spectral region, while the network predicting Raman activity has 77
091
419 parameters per spectral region. The larger parameter count in the latter model results from the inclusion of both global molecular features and the predicted number of Raman-active frequencies before the fully connected network. This additional information increases the dimensionality of the first layer, leading to a more complex multi-layer architecture.
In detail, the initial post-processing step involves upscaling the 267-dimensional vectors (obtained from max pooling) corresponding to the fingerprint and C–H predictions back to 800-dimensional vectors. This is accomplished through linear interpolation, ensuring a continuous and coherent spectral representation across the entire frequency range.
After the upscaling process, we perform a filtering step on the predictions of Raman activities. This is done using the prediction on the number of Raman-active modes using prominence as a parameter.60 In this context, prominence measures how much a Raman peak rises above its neighboring valleys. This metric assures that only peaks that are distinctly higher than their immediate surroundings are detected, filtering out minor fluctuations or noise. Specifically, using the information on the predicted number of Raman peaks for the fingerprint and C–H regions, we retain in the prediction of Raman activities only the points corresponding to the most significant peaks. To achieve this, the predicted peaks are first ranked in descending order based on their prominence. We then select only the top peaks, matching the number predicted by the first network, ensuring that the most prominent spectral features are preserved while less significant fluctuations are discarded. Here, prominence is computed with the signal.peak_prominences method in the Scipy library, which measures how much a peak rises above its surrounding valleys and it is used solely for ranking predicted peaks and does not alter the peak definition. This approach effectively refines the predicted spectrum by focusing on the most relevant Raman signals.
We then concatenate the predictions from the two distinct spectral regions, performing an averaging operation within the overlapping window between 1900 and 2100 cm−1. This averaging step ensures a smooth and continuous transition between the two spectral domains, facilitated by the few Raman activities that generally occur in this region.46 Eventually, the final spectrum accurately represents the complete Raman profile (from 500 cm−1 to 3500 cm−1 Raman shift with a step of 2 cm−1), essential for downstream analyses and comparison with experimental spectra.
Subsequently, since the model output consists of discrete peak predictions, we apply a convolution with a Lorentzian function defined by a full width at half maximum (FWHM) of 10 cm−1.61 This specific FWHM value ensures that each predicted peak is broadened to realistically reproduce the natural line shape typically observed in experimental Raman spectra.62 This convolution process broadens the predicted peaks, generating a continuous and smooth spectrum that more accurately resembles an experimentally measured Raman spectrum.
After that, we normalize the resulting spectrum dividing it by the sum of its vector representation, such that the sum of the entire spectrum equals 1. These smoothing and normalization processes facilitate a continuous and coherent representation across the entire spectrum of the Raman predictions and are important for accurate performance evaluation and comparisons with other models.
1,
2 ,…,
n}, the set of True Positives (TP) is defined as:
TP = { j|∃νi such that | j − νi| ≤ δ & | k − νi| > δ∀k < j}.
| (2) |
This formulation ensures that each true peak is matched only once within the defined tolerance, preventing multiple assignments and maintaining evaluation integrity. We adopt three tolerance values—δ = 10, 15, and 20 cm−1—to benchmark model flexibility and accuracy.
Based on this framework, we compute the F1 score, defined as the harmonic mean of precision (the fraction of predicted peaks that are correct) and recall (the fraction of true peaks that were identified):
![]() | (3) |
Extending from the concept of F1 score with tolerance, seamlessly also a Precision with Tolerance and a Recall with Tolerance can be defined.
Fig. 2A and B illustrate the model's predictive performance by comparing predicted and actual Raman-active peaks on the test set of 3,168 molecules. The density of points in these scatterplots highlights the frequency distribution of prediction outcomes.
To quantitatively assess model performance, standard regression metrics, including the coefficient of determination (R2), root mean squared error (RMSE) and accuracy, are computed. Here, accuracy is defined as the ratio between the number of molecules for which the predicted number of Raman-active modes exactly matches the DFT-calculated value, and the total number of molecules. The results, summarized in Table 1, confirm the model's strong predictive capability.
| Metric | Fingerprint region | C–H region |
|---|---|---|
| R2 | 0.937 | 0.844 |
| RMSE | 1.282 | 1.089 |
| Accuracy | 0.349 | 0.458 |
The high R2 values (0.937 for the fingerprint region and 0.844 for the C–H region) indicate that the model successfully captures the correlation between molecular structure and Raman activity. Notably, the observed accuracy values for fingerprint (0.349) and C–H (0.458) regions are meaningful considering the difficulty of predicting exact counts across a wide range of possible peak values in the two regions, as shown in SI Fig. S2 and S3. Additionally, the low RMSE values suggest minimal deviation from actual peak counts, ensuring reliable performance. The few mispredictions observed are primarily limited to differences of one or two peaks, demonstrating the robustness of the model. An analysis of different performances for the prediction of the number of Raman-active frequencies in the C–H and fingerprint regions is reported in SI Note 2 and Fig. S2, S3.
This predictive accuracy is critical for the overall Mol2Raman pipeline. First, the estimated number of Raman-active peaks guides the subsequent network in predicting Raman activities, helping it focus on the correct number of peaks as an initial molecular feature. Second, it enhances the post-processing step by refining the final spectral output, ensuring that only the most chemically relevant peaks are retained. This filtering improves the interpretability of the predicted spectra and their alignment with experimental data. The strong performance of this model establishes a reliable foundation for subsequent stages in the Mol2Raman framework.
168 molecules. This model is responsible for generating Raman spectra, which are subsequently refined through the post-processing pipeline. The primary metric used to assess the model performance is the F1 score with tolerance. This choice is motivated by the central role of peak positions in defining Raman spectral specificity,3 ensuring both selectivity and completeness in spectral peak identification, a key requirement in molecular vibrational analysis.
Results in terms of F1 score, precision and recall for different tolerance levels are shown in Table 2. These outcomes highlight the robustness of the Mol2Raman model in accurately predicting Raman activities in both the fingerprint and C–H stretching regions, respectively achieving F1 scores of 0.631 and 0.680, with a tolerance window of 15 cm−1. This performance aligns well with standard experimental practices in Raman spectroscopy, acknowledging that experimental Raman peaks often exhibit shifts of up to 10–15 cm−1 due to thermal broadening, matrix effects, and instrumental resolution, thus a ±15 cm−1 tolerance window provides a chemically realistic and spectroscopically grounded evaluation criterion.67,68
| Metric (tolerance) | Fingerprint region | CH region |
|---|---|---|
| F1 score (10 cm−1) | 0.551 | 0.617 |
| F1 score (15 cm−1) | 0.631 | 0.680 |
| F1 score (20 cm−1) | 0.705 | 0.739 |
| Precision (10 cm−1) | 0.549 | 0.614 |
| Precision (15 cm−1) | 0.629 | 0.677 |
| Precision (20 cm−1) | 0.703 | 0.736 |
| Recall (10 cm−1) | 0.553 | 0.624 |
| Recall (15 cm−1) | 0.634 | 0.688 |
| Recall (20 cm−1) | 0.708 | 0.748 |
To better evaluate the model, we also calculate the spectral similarity between predicted and calculated Raman spectra using Spectral Information Similarity (SIS) and Cosine Similarity,63,64 as shown in Table 3. These metrics assess the overall spectral shape and intensity distribution, complementing the peak-based evaluation. For this purpose, both the predicted and calculated spectra are convolved with Lorentzian functions to simulate natural peak broadening, as discussed previously.
| Metric | Fingerprint region | C–H region |
|---|---|---|
| SIS | 0.604 | 0.698 |
| Cosine similarity | 0.689 | 0.737 |
The results in Table 3 reveal differences in model performance between the two spectral regions. Higher average SIS and cosine similarity scores in the C–H region (0.698 and 0.737, respectively) indicate superior performance in predicting C–H stretching vibrations compared to the fingerprint region (0.604 and 0.689). This discrepancy is likely due to the simpler vibrational modes in the C–H region, which are predominantly influenced by localized molecular bonds. On the other hand, the fingerprint region reflects the complex global 3D molecular geometry, which is more challenging to model from SMILES representations.69,70
After evaluating the two separated spectral windows, we also assess the full spectral range between 500 and 3500 cm−1 consequent to the concatenation of the two regions as discussed previously.
As shown in Table 4, the model demonstrates consistent performance across the entire Raman spectrum, in terms of both distribution mean and median. The model achieves a mean F1 score of 0.642, with corresponding precision and recall values of 0.640 and 0.645. Additionally, the F1 score distribution across the entire test dataset, shown in Fig. 3A, further highlights the model's reliability and generalizability, with the majority of predictions achieving high F1 scores, reflecting consistent performance across diverse molecular structures.
| Metric | Full spectrum mean (15 cm−1 tolerance) | Full spectrum median (15 cm−1 tolerance) |
|---|---|---|
| F1 score | 0.642 | 0.656 |
| Precision | 0.640 | 0.651 |
| Recall | 0.645 | 0.658 |
Furthermore, the SIS score of 0.669 and cosine similarity of 0.735 (Table 5) reflect the ability of Mol2Raman to integrate predictions from both spectral regions into a coherent and accurate full-spectrum representation reproducing DFT calculated spectra at a very high degree. The combined analysis benefits from the complementary information in the fingerprint and C–H regions, enhancing the overall spectral prediction. In SI Table 2 we also show that combining the predicted number of Raman-active modes with Daylight and Morgan fingerprints yields the best performance across metrics, outperforming all other combinations of these three global molecular descriptors.
| Metric | Full spectrum |
|---|---|
| SIS | 0.669 |
| Cosine similarity | 0.735 |
Moreover, Fig. 3B shows the distribution of F1 scores at 15 cm−1 tolerance for different molecules of the test dataset, namely molecules with at least one oxygen atom (2
703 molecules), at least one nitrogen atom (1
940 molecules) and at least one fluorine atom (44 molecules), whose results are provided in Table 6. We see that molecules with at least one fluorine atom are strongly underrepresented in both the training and test datasets (SI Table 1); however, they still present a fairly good F1 score of 0.481. Instead, molecules with at least one nitrogen (mean F1 score of 0.622) or one oxygen atom (mean F1 score of 0.641) more closely resemble the global F1 score distribution of Fig. 3A, as expected due to the larger representativeness in the training dataset.
| Typology | Mean | Median | St. dev. |
|---|---|---|---|
| Atomic species | |||
| Oxygen | 0.641 | 0.655 | 0.093 |
| Nitrogen | 0.622 | 0.633 | 0.096 |
| Fluorine | 0.481 | 0.488 | 0.108 |
![]() |
|||
| Chirality | |||
| Chiral | 0.667 | 0.675 | 0.074 |
| Not chiral | 0.584 | 0.596 | 0.103 |
We further examine the distribution of F1 scores for chiral and achiral molecules in the test dataset, as presented in Fig. 3C and summarized in Table 6. Notably, the model exhibits stronger predictive performance on chiral molecules compared to not-chiral ones. This trend is partially attributed to the underrepresentation of non-chiral compounds in the training set (see the SI), but also to the fact that chirality is explicitly included as a molecular input feature. Moreover, the model successfully captures the influence of the enantiomeric inversion on Raman spectra, reproducing the differences between enantiomers' Raman modes, as illustrated in SI Fig. S9 and S10. We mainly attribute this result to the comprehensive set of descriptors used to represent molecules, combining both local features through GINE layers and global features through Morgan and Daylight fingerprints. Thus, this combination allows the model to learn the subtle structural relations which produce spectral differences under enantiomeric inversion.
Additional analyses of model performance concerning other molecular properties are provided in SI Fig. S4–S10.
To qualitatively assess the predictive performance of the model, Fig. 4 displays the comparison between the DFT-calculated and Mol2Raman-predicted spectra for four representative molecules across the 80th, 60th, 40th and 20th percentiles in the distribution of F1 scores with 15 cm−1 tolerance. As shown in Fig. 4, the predicted spectra in the 80th and 60th percentiles closely match the DFT-calculated spectra, demonstrating excellent agreement in both the fingerprint and CH stretching regions. Even in the 40th and 20th percentiles, where performance slightly drops, the model still successfully predicts most of the significant peaks, with only a few missed or shifted peak-positions above the 15 cm−1 threshold. This consistency across various performance levels highlights the robustness and reliability of the Mol2Raman model in capturing both prominent and subtle spectral characteristics, showing strong predictive capability for both peak localization and spectral shape. Additional comparisons between DFT-calculated and Mol2Raman-predicted spectra for these percentile thresholds are shown in SI Fig. S11–S14.
To probe how Mol2Raman arrives at its predictions, we performed peak-conditioned Integrated Gradients (IG) on the graph inputs and visualized atom- and bond-level relevance maps.71 For a target peak j (chosen as the largest peak in either the fingerprint or the C–H region), we integrated the gradient of the scalar output ŷj from a zero baseline to the true input; atom scores are obtained by summing absolute attributions over node-feature channels, and bond scores by averaging the scores of their two incident atoms. Scores are normalised per molecule and reported on a scale of 0 to 1. Fig. 5 illustrates this analysis for the shown molecule: panel (A) reports the Mol2Raman prediction and the DFT-calculated spectrum, whereas panels (B) and (C) display the relative attributions from the peak–conditioned IG analysis, respectively, for the C–H and the fingerprint regions. As shown in Fig. 5B, in the C–H stretching region the most intense peak is captured almost entirely by the local C–H environment: attributions concentrate on carbon atoms and their adjacent C–H/C–C bonds, consistent with CH2/CH3 stretching. In contrast, Fig. 5C shows attributions distributed across the entire molecular scaffold, indicating that the model relies on global, molecule-wide patterns typical of fingerprint vibrations. Full methods and an additional example are provided in SI Note 3 and Fig. S15.
![]() | ||
| Fig. 5 Peak-conditioned integrated gradient attributions for the most intense C–H and fingerprint peaks. (A) Mol2Raman- and DFT-calculated spectra for the molecule shown in (B) (C–H region) and (C) (fingerprint region); atom and bond colors denote relative attribution (scaled to [0, 1]) from the peak-conditioned IG analysis. The localized attributions on C–H bonds in (B) corroborate the well-known prominence of C–H vibrations for Raman activity near 3000 cm−1,72 whereas the broadly elevated attribution across the molecule in (C) supports a delocalized, concerted vibration consistent with fingerprint-region modes.66 | ||
As shown in Fig. 6A and Table 7, the Mol2Raman model significantly outperforms the Tanimoto-based benchmark. The mean F1 score for Mol2Raman is 0.642, which is 81% higher than the F1 score calculated with the weighted Tanimoto model. A Mann–Whitney test also shows that the F1-score distribution of Mol2Raman is statistically larger than the Tanimoto model distribution with a p-value lower than 10−8.74 This first comparison is motivated by the intuitive idea that the Raman spectrum of a molecule could be fairly approximated by the average of the most similar molecules. However, this analysis proves that this intuition leads to unreliable predictions. Raman modes do not just easily follow chemical similarity but are generated by more subtle and complex structure–spectral relationships that Mol2Raman can capture better than this benchmark model.75 The results in Table 7 emphasize the advantage of the Mol2Raman model over the Tanimoto-based benchmark also across all the other tolerance windows, underlining how the difference in F1 score increases enlarging the tolerance window.
| Metric | F1 tol. 10 cm−1 | F1 tol. 15 cm−1 | F1 tol. 20 cm−1 |
|---|---|---|---|
| Mol2Raman | |||
| Mean | 0.565 | 0.642 | 0.713 |
| Median | 0.576 | 0.656 | 0.727 |
| St. dev. | 0.094 | 0.092 | 0.088 |
![]() |
|||
| Tanimoto benchmark | |||
| Mean | 0.353 | 0.355 | 0.356 |
| Median | 0.353 | 0.355 | 0.356 |
| St. dev. | 0.057 | 0.056 | 0.056 |
![]() |
|||
| Chemprop benchmark | |||
| Mean | 0.347 | 0.391 | 0.428 |
| Median | 0.346 | 0.386 | 0.426 |
| St. dev. | 0.076 | 0.079 | 0.085 |
To further examine model generalization capabilities on structurally novel compounds, we evaluated both models on a subset of the test dataset, which consists of 425 structurally diverse molecules with a Tanimoto similarity of less than 0.6 to any molecule in the entire training dataset. Fig. 6B illustrates that Mol2Raman is again consistently better than the benchmark even on structurally novel compounds, achieving an F1 score of 0.568 compared to 0.392 of the benchmark. The performance gap between Mol2Raman and the Tanimoto-based model further widens on the low-similarity dataset. This result underlines the inherent limitation of relying solely on molecular similarity for spectral prediction. In contrast, Mol2Raman effectively captures complex molecular interactions, making it a more robust and scalable solution for Raman spectra prediction.
The comparison is performed using the F1 score with a tolerance of 15 cm−1, evaluating both models on the entire test set and on the subset consisting of molecules with a Tanimoto similarity lower than 0.6, as done against the Tanimoto benchmark. As shown in Fig. 6C, the Mol2Raman model outperforms Chemprop across the full test dataset. The mean F1 score for Mol2Raman is 0.642, compared to Chemprop's 0.391, providing a 64% improvement. The F1 score for the Chemprop models is calculated using a prominence of 0.05 not to include peaks originated by noise in its calculation. A Mann–Whitney U test confirms that this difference is statistically significant (p < 10−6), highlighting Mol2Raman's superior capability in identifying Raman-active frequencies. The F1 score distribution of Mol2Raman is skewed towards higher values, indicating its superior ability to predict Raman-active frequencies accurately. This better performance can be explained by the Mol2Raman specialized architecture, designed to handle the spectral sparsity and peak-specific information inherent in Raman spectra. Table 7 reports a comprehensive comparison of Mol2Raman and Chemprop performance in terms of different F1 score tolerances, showing how Mol2Raman outperforms Chemprop for each considered tolerance window.
The comparison on the low similarity dataset is reported in Fig. 6D and Table 8. Fig. 6D demonstrates that Mol2Raman maintains its advantage on unseen molecular structures, with an F1 score of 0.568 compared to Chemprop's 0.412. This 38% improvement underlines Mol2Raman's high robustness in generalizing to novel chemical spaces. Table 8 presents the comparison of both models on this low-similarity dataset. Interestingly, the Chemprop model shows slightly better performance on the low-similarity subset (Tanimoto similarity <0.6) compared to its predictions on the full test dataset. This counterintuitive result can be attributed to several factors. Firstly, Chemprop may exhibit overfitting tendencies towards molecular structures prevalent in the training dataset, limiting its ability to generalize effectively within structurally similar groups in the full test dataset. In contrast, the more structurally diverse molecules in the filtered dataset could reduce this bias, enabling the model to generalize better. However, even with this modest improvement, Mol2Raman consistently outperforms Chemprop across both datasets, highlighting its superior capability to model Raman spectra and generalize across varying chemical spaces.
| Metric | F1 tol. 10 cm−1 | F1 tol. 15 cm−1 | F1 tol. 20 cm−1 |
|---|---|---|---|
| Mol2Raman | |||
| Mean | 0.491 | 0.568 | 0.639 |
| Median | 0.500 | 0.576 | 0.655 |
| St. dev. | 0.107 | 0.107 | 0.105 |
![]() |
|||
| Tanimoto benchmark | |||
| Mean | 0.306 | 0.309 | 0.311 |
| Median | 0.311 | 0.314 | 0.316 |
| St. dev. | 0.050 | 0.050 | 0.049 |
![]() |
|||
| Chemprop benchmark | |||
| Mean | 0.353 | 0.412 | 0.468 |
| Median | 0.356 | 0.415 | 0.467 |
| St. dev. | 0.082 | 0.077 | 0.079 |
The observed performance gap between Mol2Raman and Chemprop can be explained by the differing design philosophies of the two models. Chemprop was originally optimized for IR spectra, which are inherently denser and smoother than Raman spectra. In contrast, Mol2Raman was specifically engineered to model the sparse and peak-oriented nature of Raman spectra, effectively capturing both peak localization and relative intensities. Mol2Raman's GINE layers, combined with traditional chemical descriptors, allow for a richer encoding of molecular properties, enabling more accurate peak prediction. In contrast, Chemprop's standard MPNN architecture struggles to capture the nuanced spectral features inherent in Raman spectroscopy, particularly in the fingerprint region where structural complexity is more pronounced.
A key advantage of this web-based platform is its integration of visualization tools that allow users to explore and interpret the predicted spectra effectively. This enhances the interpretability of computational predictions, facilitating their adoption in experimental workflows. The model is optimized to provide fast and accurate spectral predictions, making it suitable for both small-scale academic research and high-throughput industrial applications.79 This democratization of access is particularly relevant for researchers working in disciplines such as drug discovery, materials science, and process chemistry, where rapid molecular analysis is essential for decision-making.1,25
The development of this web-based interface exemplifies how cutting-edge machine learning models can be translated into practical, accessible tools that enhance scientific discovery. Beyond its immediate application, this platform lays the foundation for future expansions, including the integration of additional spectroscopic techniques and more sophisticated analytical capabilities. By providing an intuitive and interactive framework, Mol2Raman not only demonstrates the feasibility of deep learning-based Raman spectrum prediction but also highlights the broader potential of artificial intelligence in transforming computational chemistry into a more accessible and impactful discipline.
000 molecules with DFT-calculated Raman spectra, Mol2Raman demonstrates a significant step forward in integrating deep learning approaches into molecular spectroscopy.
The performance of Mol2Raman is rigorously evaluated against two benchmarks: a Chemprop-based model adapted for Raman spectral prediction and a Tanimoto similarity-based model. Across the F1 score with various tolerance windows (10, 15, and 20 cm−1), Mol2Raman consistently outperforms both benchmarks. Notably, the model achieves a mean F1 score of 0.642 with a 15 cm−1 tolerance, substantially surpassing the Chemprop model (0.391) and the Tanimoto benchmark (0.355). This consistent outperformance is observed not only on the full test set but also on more challenging subsets composed of structurally novel molecules (Tanimoto similarity <0.6), highlighting the model's robust generalization capabilities.
One of the most impressive aspects of Mol2Raman is its ability to generalize beyond structurally similar compounds. The model's superior performance on the low-similarity subset emphasizes its capacity to capture complex structure–spectrum relationships that cannot be adequately modeled by simpler similarity-based approaches. This capability is particularly important for real-world applications, where new or previously unseen molecules are frequently encountered.
Despite its larger number of parameters (157 million), comprising the two models and the two spectral windows for each model, Mol2Raman demonstrated fast inference time (22 ms). This efficiency enables large-scale molecular screening in drug discovery, materials optimization, or spectroscopic probe design, where rapid pre-selection of candidates based on predicted vibrational fingerprints is essential, as shown by Stokes et al.25 Our model goes exactly in this direction, facilitating and speeding up this first screening. Then, starting from this shortlist, more detailed DFT calculations or experimental measurements can be performed to reach the final optimal molecules. This efficiency stems from the model's architecture, which effectively uses its higher capacity to learn and generalize complex molecular features without introducing significant computational overhead.
The development of a user-friendly web application further extends the impact of this work. By deploying Mol2Raman through an accessible web interface, it is possible to easily generate high-quality Raman spectra predictions without the need for specialized hardware or advanced computational skills, enhancing the model's practical utility.
While Mol2Raman presents a significant advancement in Raman spectrum prediction, some limitations remain. The current model is trained exclusively on DFT-calculated spectra, which, while accurate, may not fully capture experimental complexities such as instrumental noise and environmental effects. However, this DFT-based training may also be seen as a strong pretraining stage that can be adapted to experimental data through transfer learning. Fine-tuning on curated experimental Raman spectra should help close the gap between simulation and measurement by accounting for baseline backgrounds, instrument response, and temperature effects, as also shown in Chemprop-IR.76 In practice, this can be achieved by freezing the early GINE layers and updating only the final blocks and readout with a small learning rate, thereby preserving the information learned from DFT while aligning predictions to experimental variability. We therefore expect that incorporating experimental spectra into the training process could further improve the model's robustness and real-world applicability. Additionally, although the model shows excellent performance in predicting peak positions and intensities, expanding its capabilities to predict other spectroscopic properties or extending its application to molecular datasets composed of more atomic species could further enhance its utility. Exploring hybrid models that combine data-driven approaches with physical constraints could also offer new avenues for improving spectral predictions.
In summary, Mol2Raman represents a significant advancement in applying machine learning to spectroscopic analysis. By combining an innovative GNN-based architecture with strong predictive performance and computational efficiency, the model provides a practical and scalable solution for molecular spectroscopy. The additional deployment as a web application further broadens its accessibility, enabling seamless integration into research and industrial workflows. Beyond its immediate applications, this work lays the foundation for future developments aimed at expanding the model's capabilities. Incorporating experimental spectra into training could enhance robustness, while extending its framework to other spectroscopic techniques would further increase its utility. With continued open-source refinement of the model, Mol2Raman has the potential to accelerate discoveries across materials science, pharmaceuticals, and chemical engineering, contributing to molecular design and diagnostics in an increasingly data-driven era.
776 molecules performed using the ORCA software, including the Raman and IR active modes, and the actual dataset used to train the model. The code for the architecture and training of Mol2Raman can be found at https://doi.org/10.5281/zenodo.17654620 for the version used in this work, and at https://github.com/salvasorrentino/Mol2Raman.git for any future updates. In addition, the code for the webapp is available at https://doi.org/10.5281/zenodo.17654690 and at https://github.com/vibralab/mol2raman_webapp.git.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00210a.
Footnote |
| † Current address: Laser Biomedical Research Center, G. R. Harrison Spectroscopy Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, USA; E-mail address: E-mail: ssorrent@mit.edu. |
| This journal is © The Royal Society of Chemistry 2025 |