Open Access Article
Martin Priessner
*a,
Richard J. Lewis
b,
Magnus J. Johansson
a,
Jonathan M. Goodman
c,
Jon Paul Janet
d and
Anna Tomberg
*a
aMedicinal Chemistry, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Pepparedsleden 1, 43183 Mölndal, Sweden. E-mail: martin.priessner@gmail.com; anna.tomberg@astrazeneca.com
bDepartment of Medicinal Chemistry, Research & Early Development, Respiratory & Immunology, BioPharmaceuticals R&D, AstraZeneca, Pepparedsleden 1, 43183 Mölndal, Sweden
cCentre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
dMolecular AI, Discovery Sciences, R&D, AstraZeneca, Pepparedsleden 1, 43183 Mölndal, Sweden
First published on 11th February 2026
We introduce a novel workflow integrating reasoning-capable language models with specialized chemical analysis tools to enhance molecular structure determination using nuclear magnetic resonance spectroscopy. Generally, structure elucidation involves generating candidate molecular structures, comparing their predicted spectral features to experimental data, and identifying the best-fitting structure. Our workflow systematically generates diverse molecular candidates through chemical synthesis predictions, regioisomer exploration, and direct spectral-based methods. The language model bridges the gap between quantitative data and chemical insight by evaluating candidates through a reasoning process that analyzes spectral evidence, explains discrepancies, and assesses overall structural plausibility, moving beyond simple numerical error. This LLM-driven reasoning stage proved crucial, increasing correct top-ranked structure identification accuracy by 26.4%. Simulated spectral data with introduced noise artifacts and solvent peaks further highlighted the robustness of our method, showing accuracy improvements by 35.3%. The language model's confidence scores effectively correlated with prediction accuracy, facilitating efficient triage of results. While currently focused on HSQC data, this framework offers a flexible foundation for next-generation structure elucidation tools combining chemical expertise with advanced reasoning capabilities.
Computer-Assisted Structure Elucidation (CASE) systems have emerged as valuable tools to accelerate this process by generating candidate structures based on spectral constraints and ranking them according to predicted versus experimental spectral deviations.5–10 However, these systems often struggle with interpretability, especially in cases where multiple candidate structures align similarly well with spectral data.11 The lack of clarity regarding how individual spectral features contribute to structural assignments makes it difficult for chemists to critically evaluate and refine these automated proposals, limiting the practical utility of CASE in ambiguous or complex structural elucidation scenarios. Rather than replacing these established tools, our work explores how reasoning-capable LLMs can provide a complementary analysis layer applicable to candidates from any source — whether generated by commercial CASE systems or specialized ML models.
Recent years have witnessed the rise of Large Language Models (LLMs) such as GPT-4,12 Claude13 and Gemini,14,15 which have demonstrated remarkable capabilities across various domains, including chemistry. These models have evolved to become increasingly multimodal,16,17 capable of analyzing text, images, and structured data simultaneously, making them particularly promising for spectroscopic analysis.
In the field of chemistry, LLMs have demonstrated notable results across various tasks, from property prediction to reaction planning.18,19 Two distinct approaches have emerged for applying LLMs to chemical problems. The first focuses on domain-specific fine-tuning, exemplified by ChemLLM,20 which achieves GPT-4-comparable performance across essential chemistry tasks through specialized training. The second approach employs multi-agent systems with foundation models, as demonstrated by ChemCrow,21 which coordinates multiple specialized LLM instances and leverages tool-calling capabilities to interface with chemical software for complex tasks like synthesis planning and reaction prediction. The success of these multi-agent systems in integrating different types of chemical information suggests a promising direction for structure elucidation, where various spectroscopic data must be analyzed in concert.
Particularly relevant to structure elucidation is the recent advancement in LLMs' reasoning capabilities. Advanced prompting techniques, such as Chain of Thought (CoT) reasoning22 and step-by-step problem-solving approaches, have enabled LLMs to tackle complex tasks by breaking them down into manageable steps. This capability has been further enhanced in recent models like OpenAI's o1/o3, Google's Gemini-Thinking, and DeepSeek's open-source R1 model,23 which use reinforcement learning to improve their reasoning processes. These models can systematically evaluate problems, consider alternative approaches, and even backtrack when necessary, closely mirroring the analytical process of expert chemists in structure elucidation. While these reasoning-focused models have not yet been extensively applied to chemistry tasks, their ability to provide clear reasoning paths and step-by-step analysis makes them particularly promising for structure determination, where understanding the logic behind structural assignments is crucial.
In our workflow, we specifically harness these reasoning capabilities to perform the critical evaluation step. The LLM is tasked not just with processing data, but with interpreting spectral patterns, identifying inconsistencies, and constructing a logical argument for or against each candidate structure. Our method needs a guess structure and the corresponding spectra (1H, 13C, HSQC, COSY and MS) as input. The structure elucidation process consists of three main stages: candidate generation, spectral analysis, and LLM-driven reasoning.
For candidate generation, we implemented three complementary approaches. First, Chemformer24–26 generates synthetic analogues via retrosynthesis and forward reaction predictions, grounding structural predictions in synthetic feasibility. Mol2Mol27–30 systematically explores regioisomers and analogues, emphasizing structural diversity. Finally, MultiModalSpectralTransformer (MMST)31 directly derives candidate structures from spectral patterns, dynamically adapting to the chemical context through on-the-fly fine-tuning.
For the spectral analysis stage, we extend our previous work32 on atom-specific HSQC peak matching by quantitatively assessing carbon–hydrogen connectivity predictions of the generated analogue molecules against experimental spectra. This approach provides precise structural validation at the atomic level by explicitly identifying structural mismatches for individual carbon–hydrogen bonds. For example, when a predicted methylene group (CH2) shows a simulated HSQC peak at δH 2.7 ppm, δC 32 ppm, but the experimental spectrum reveals a significantly different chemical shift environment (δH 3.5 ppm, δC 45 ppm), this indicates a local structural error—perhaps the carbon is adjacent to an electronegative atom rather than being in a purely aliphatic environment. Such atom-level analysis complements global similarity measures with essential local structural information that can pinpoint specific regions where candidate structures deviate from reality. HSQC peak matching evaluates the similarity of two spectra, by calculating the error between their peaks. These values can be used to rank candidate structures by how well they fit experimental data, referred to as HSQC peak matching.
For the final LLM-driven analysis stage, we employ a two-step process. Initially, Claude 3.5 Sonnet's multimodal capabilities are used to evaluate molecular images alongside spectral data, performing preliminary assessments of structural consistency. Subsequently, DeepSeek R1 applies chemical reasoning to systematically assess all accumulated evidence. It moves beyond the initial HSQC ranking by interpreting the significance of specific spectral features, weighing evidence for and against each structure, generating confidence scores grounded in this analysis, identifying regions of uncertainty, and providing detailed, step-by-step explanations. The workflow, illustrating these stages and their interconnections, is presented in Fig. 1.
To evaluate these approaches, we used different inputs depending on the method. For Mol2Mol, we used only the target molecule structure as input. For the Chemformer approach, the target molecule was retrosynthetically broken down, and the synthetic pathways then used to produce candidate structures. For MMST, we utilized both simulated NMR spectral data and initial structural information. For this initial comparison, performance was based solely on whether the correct structure was present within the generated pool of candidates (see Methodology for the specific number of considered molecules for each method).
Our evaluation revealed distinct performance patterns that highlight the complementary nature of these approaches across different experimental scenarios (Fig. 2). This evaluation was conducted on our test set of 34 diverse organic molecules, described in the methodology section, for which we had both the correct structures and complete experimental data. Each method demonstrated unique strengths and limitations that proved valuable in different contexts of the structure elucidation process.
With correct initial structures as input, Chemformer achieved perfect performance (100.0%), successfully recovering all original molecules through its synthesis prediction pathway. The MMST-driven generation likewise reached 100.0% accuracy without requiring structural biasing, demonstrating its ability to derive correct structures directly from spectral features. In contrast, the Mol2Mol approach generated no correct molecules (0.0%) in this scenario, which aligns with its design principles—Mol2Mol intentionally uses the target structure as its starting point for generating structural variations and is not configured to reproduce its input structure.
Performance dynamics shifted dramatically when incorrect regioisomeric structures were provided as starting points. Chemformer completely failed to recover correct structures (0.0%), revealing its inability to reconstruct correct connectivity patterns from incorrect starting hypotheses. In contrast, the MMST-driven approach maintained nearly identical performance (97.1%), demonstrating robustness to input quality. However, we note that MMST is not fully de novo—it benefits from approximate structural context to guide its on-the-fly fine-tuning cycle and is therefore best suited for structure verification and regioisomer discrimination rather than true de novo elucidation from spectral data alone. The Mol2Mol approach demonstrated modest recovery capability (11.8%), successfully transforming some incorrect structures into correct ones through its systematic structural modification methodology.
These results, obtained from simulated NMR data, highlight the complementary capabilities that justify our multi-pronged approach to structure generation. Chemformer excels when provided with reliable starting structural information but fails with incorrect initial hypotheses; MMST delivers consistent performance regardless of starting conditions, serving as a robust backbone for spectral-based structure prediction; and Mol2Mol, despite its relatively weak performance in direct structure recovery tasks, still adds value by systematically expanding the chemical space with diverse regioisomers and structural variants that maintain molecular weight constraints but explore alternative connectivity patterns. Importantly, these methods are designed for complementary failure modes: MMST is probabilistic and not guaranteed to succeed in all cases, while Chemformer deterministically generates synthetically feasible products given correct chemistry. This distribution of strengths across methods enables our pipeline to address a broader range of structure elucidation challenges than any single approach could manage independently. To capitalize on these diverse generation strategies, the subsequent steps of our workflow operate on a unified candidate pool, created by aggregating the unique molecules generated by Chemformer, Mol2Mol, and MMST. This combined set of structures is then subjected to the ranking and LLM-driven analysis detailed below.
Our experimental design explored five key conditions to test the robustness of both approaches: simulated data with target structures (Sim Target), simulated data with analogue structures (Sim Analogue) where attachment points of functional groups were systematically modified, simulated data with target structures and deliberately introduced spectral artifacts (Sim Target + Noise) including both a DMSO solvent peak (δ 2.52 ppm, 39.57 ppm) and a random noise peak (δ 2.52 ppm, 103.42 ppm), experimental data with target structures (Exp Target), and experimental data with analogue structures (Exp Analogue). This design allowed us to systematically evaluate performance across increasing levels of real-world complexity.
A critical metric for practical structure elucidation is whether the top-ranked candidate is the correct structure, as this determines the system's ability to autonomously identify molecules without expert intervention. Fig. 3 illustrates how effectively DeepSeek R1 converts HSQC top-5 predictions into accurate top-1 rankings—a key capability for automated structure determination.
Under ideal conditions with simulated NMR data and target structures (Sim Target), the baseline HSQC peak matching demonstrated strong performance with a top-1 accuracy of 85.3% (29/34 molecules). Applying DeepSeek-R1's reasoning-based analysis to re-rank the top HSQC candidates further improved this to 94.1% (32/34 molecules), showing modest but meaningful gains even in this favorable scenario. Similar performance patterns were observed with simulated data using analogue initial structures (Sim Analogue), where both approaches maintained comparable accuracy levels (85.3% for both HSQC matching and DeepSeek-R1).
The most significant performance differentials emerged when analyzing imperfect spectral data—a crucial test for real-world applicability. For simulated data containing additional noise peaks (Sim + Noise), the baseline HSQC peak matching approach's top-1 accuracy declined dramatically to 50.0% (17/34 molecules), while DeepSeek-R1 maintained robust performance with 85.3% accuracy (29/34 molecules). This 35.3% improvement highlights the LLM's effectiveness in applying chemical reasoning to navigate the inherent variability. The advantages of LLM enhancement became even more pronounced with experimental NMR data (Exp Target), where the baseline HSQC peak matching approach achieved only 41.2% top-1 accuracy (14/34 molecules), while DeepSeek-R1 substantially improved this to 67.6% (23/34 molecules). This 26.4% increase highlights the LLM's effectiveness in navigating the variability and complexity of real-world spectral data.
In the most challenging scenario—experimental data with analogue initial structures (Exp Analogue)—baseline HSQC peak matching achieved a modest 20.6% top-1 accuracy (7/34 molecules). Even here, the DeepSeek-R1 approach provided improvement, reaching 23.5% (8/34 molecules) accuracy in this particularly difficult context.
Notably, we observed comparable performance across all tested LLM systems, including commercial models such as Claude 3.5 Sonnet, Claude 3.7 Sonnet-Thinking, Gemini 2.0 Flash-Thinking, KIMI 1.5 and o3-mini, as detailed in Fig. S1–S5. The performance consistency across different LLM architectures suggests that structural reasoning capabilities are well-distributed among current state-of-the-art language models. We selected DeepSeek-R1 as our primary model for detailed analysis due to its open-source nature, which offers greater flexibility for future domain-specific fine-tuning. We anticipate that targeted fine-tuning with chemistry-specific data or reinforcement learning could further enhance performance, particularly for the most challenging experimental scenarios.
These results demonstrate that the LLM-enhanced approach adds the most value in precisely the scenarios that challenge traditional methods: when dealing with noisy spectral data, experimental artifacts, or imperfect initial hypotheses. The ability to recover correct structures despite these challenges represents a significant advancement for automated structure elucidation workflows in realistic settings. Furthermore, our reproducibility analysis on a representative subset (N = 6 replicates) confirmed that while the system is generally robust, the non-deterministic nature of the LLM can lead to variations in candidate ranking across independent runs (see Methodology (Reproducibility and stochasticity analysis) and SI Table S1). The results indicate that while the model exhibits high reasoning stability in the majority of cases (with 60% showing zero variance), ranking oscillations occur in scenarios with ambiguous spectral data. This confirms that the workflow is not strictly deterministic and that prediction outcomes can be sensitive to token generation probabilities.
To isolate the impact of multimodal integration, we performed an ablation study (N = 6) removing the visual analysis provided by Claude 3.5 Sonnet. While visual context was critical for solving complex cases (improving rank in 40%), it reduced performance in one instance by inducing ‘over-rationalization,’ where the model prioritized qualitative structural narratives over superior quantitative error scores (see SI Table S2). This specific failure mode—where the AI struggles to appropriately weigh conflicting quantitative and qualitative evidence—contrasts with expert chemical intuition. It underscores that while reasoning models can automate data synthesis, a human-in-the-loop remains essential to adjudicate cases where narrative plausibility diverges from hard spectral metrics.
For each molecule, DeepSeek-R1 assigned confidence scores (0–1) to candidate structures based on spectral evidence analysis and then ranked them from highest to lowest confidence. Fig. 4 shows the distribution of these confidence scores for correct versus incorrect structure predictions at each ranking position.
Our analysis of the top-1 candidates revealed that correct structures received a higher mean confidence score (0.92) compared to incorrect ones (0.87). While this difference in mean confidence is relatively small and the distributions for correct and incorrect predictions show substantial overlap (as shown in Fig. 4), we do observe a meaningful pattern in how confidence scores correlate with the overall ranking. The separation between correct and incorrect structures generally decreased progressively through positions 2–5, with position 5 containing only a single correct structure among 188 candidates (0.5%). Notably, confidence scores showed a clear stepwise decrease from position 1 (median ∼0.92) to position 5 (median ∼0.20), indicating that DeepSeek-R1's confidence values correlate with ranking position. While the overlap in confidence distributions for correct and incorrect structures at position 1 limits the ability to use confidence scores alone as a reliable differentiator, the overall trend suggests that higher confidence scores are generally associated with better candidates. A combination of confidence thresholds and other metrics would likely be needed to effectively automate the verification process and identify cases requiring expert review.
While baseline HSQC peak matching ranked candidate 5 (the correct structure) last due to the highest numerical error (4.547), DeepSeek-R1 employed a reasoning process that looked beyond this single metric. Instead of being solely driven by the error score, the LLM interpreted the spectral evidence in the context of chemical structure. As its internal analysis reveals, it identified and prioritized diagnostic signals consistent with key structural motifs:
“Candidate 5 has the highest HSQC error (4.547), but… despite the higher error, it aligns well with morpholine groups. The SMILES includes a morpholine ring (C1CCOCC1), and the HSQC peaks (2.56, 60.69; 3.64, 65.99) match morpholine's CH2 groups… Comparing all candidates, candidate 5 has the correct substituents (morpholine vs. oxazolidine or oxazinan in others). The seven-membered ring in candidates 2 and 3 introduces strain and shifts that don't fit, whereas candidate 5's six-membered rings (tetrahydroindole and morpholine) have expected shifts… The error might be higher due to a specific atom's deviation but overall structural fit is better.”
Here, the model reasoned that the presence of the correct morpholine substituent, confirmed by specific HSQC peaks, constituted strong positive evidence. It actively compared this key feature against the incorrect substituents in other candidates. Furthermore, the LLM applied chemical knowledge regarding ring strain to discount the plausibility of candidates with seven-membered rings, noting their predicted shifts would not fit the data as well as the six-membered rings in candidate 5. Critically, it weighed the conflicting evidence—the high numerical error versus the strong diagnostic spectral matches and structural plausibility—and hypothesized a justification for the discrepancy, suggesting the error might be localized rather than indicative of an overall poor fit.
This interpretive analysis, prioritizing diagnostic spectral features and chemical plausibility over a potentially misleading numerical score, led DeepSeek-R1 to correctly identify candidate 5 and assign it a high confidence score (0.85), showcasing the power of its reasoning approach in navigating noisy data.
DeepSeek-R1, however, employed chemical reasoning to interpret the underlying spectral evidence rather than relying solely on the overall HSQC error. Its analysis involved scrutinizing the details of the peak matches for both candidates and weighing the significance of specific deviations versus overall structural consistency. The LLM's internal reasoning highlights this process:
“Starting with candidate 1: its HSQC error is 3.245, which is the lowest among all. The detailed analysis mentions that most peaks match well except for a significant deviation in atom 5. The problem there is a methylene group with an error of 0.504… candidate 3 has an HSQC error of 3.657, slightly higher than candidate 1 but ranked first in the overall analysis. The detailed analysis shows moderate deviations in some atoms but no extreme errors. The methoxy group aligns well, and the core structure fits the data… The final decision weighs structural compatibility over minor error differences. Since the HSQC error difference between candidates 1 and 3 is small, but the overall analysis strongly supports candidate 3's substituent positions, that's likely the right choice.”
This excerpt shows the LLM identifying the specific source of error reduction in candidate 1 (a localized deviation in atom 5) while simultaneously recognizing the superior overall spectral consistency and alignment of key functional groups (like the methoxy group) in candidate 3. Crucially, the model reasoned that the small difference in total HSQC error was less significant than the better explanation of the overall spectral pattern provided by candidate 3's specific structural arrangement, particularly concerning substituent positions which often yield diagnostic NMR signals.
By interpreting the spectral data through the lens of chemical structure and prioritizing overall evidence quality over a single numerical metric, DeepSeek-R1 correctly identified candidate 3. It assigned a higher confidence score of 0.85 to candidate 3, compared to 0.75 for candidate 1, reflecting its reasoned assessment that candidate 3 represented the chemically more plausible structure despite the marginally higher HSQC error.
Our case studies reveal crucial advantages of using LLM-enhanced structure elucidation methods. The model demonstrates an ability to look beyond simplistic numeric error metrics, recognizing when these can be misleading due to isolated deviations in specific atoms. DeepSeek-R1 applies chemical reasoning that mimics expert analysis, evaluating structural features based on their expected NMR characteristics across various molecular frameworks. Particularly impressive is how the model prioritizes diagnostic spectral features that strongly indicate specific structural elements, even when contradicted by overall error scores. The transparent reasoning process provides chemists with insight into the structural assignment logic, facilitating verification of the proposed structures. These findings demonstrate that LLM integration brings a sophistication to structure elucidation that traditional scoring systems lack, especially when dealing with complex scenarios involving spectral noise or subtle structural variations.
Second, our validation was limited to molecules of 180–420 Da, representative of typical small-molecule pharmaceutical compounds. Effectiveness for larger molecules (e.g., peptides, natural products >420 Da) with more complex spectral patterns remains untested and would likely require adapted approaches.
Third, we tested robustness using a controlled noise scenario (one DMSO solvent peak plus one artifact peak). Real-world spectra may contain multiple overlapping impurity signals, and systematic evaluation of performance degradation with increasing spectral complexity is needed.
Fourth, our current workflow focuses primarily on HSQC for ranking, with 1H/13C/COSY as supporting data. Extension to other 2D techniques (e.g., HMBC for long-range correlations) would require adapted prompts and analysis strategies, representing valuable future development.
Two detailed case studies illustrated how DeepSeek R1 successfully navigated structural ambiguities beyond numeric error metrics alone, prioritizing chemically-informed reasoning and diagnostic spectral features. This interpretable reasoning process, explicitly articulated by the LLM, facilitates greater trust and practical utility in real-world chemical settings.
Our flexible, reasoning-enhanced workflow thus represents a significant advancement over traditional structure elucidation methods, offering a foundation for next-generation tools where advanced AI reasoning actively interprets complex chemical data, augmenting traditional methods and expert analysis. Importantly, the LLM reasoning layer is modular and tool-agnostic—while demonstrated here with our candidate generation pipeline, the same approach could enhance outputs from commercial CASE systems such as ACD/Structure Elucidator or MestreNova. Future work could explore several promising directions: (1) extension to additional spectral modalities, particularly HMBC for long-range C–H correlations, enabling cross-modal consistency checking across 1H, HSQC, and HMBC data; (2) domain-specific fine-tuning on experimental spectra; (3) training open-source reasoning models with chemistry-specific reinforcement learning to improve spectral interpretation capabilities; and (4) systematic benchmarking of newer reasoning models as LLM capabilities continue to advance. These avenues could further push the boundaries of automated, interpretable, and accurate molecular structure elucidation.
The dataset encompasses organic molecules with molecular weights ranging from 180–420 Da, representing diverse structural features. The molecules contain various functional groups, heteroatoms (N, O, S, F, Cl, Br), fused and non-fused ring systems, and different degrees of unsaturation. This structural diversity provides a rigorous test set representative of challenges typically encountered in pharmaceutical and synthetic chemistry. Detailed molecular structures and corresponding spectral data are presented in Fig. S6.
These manually curated regioisomers served as challenging “starting guesses” in our experimental design, allowing us to test whether our methodology could successfully identify the correct molecular structure even when initialized with an incorrect but structurally related hypothesis. This aspect of our experimental design addresses real-world structure elucidation scenarios where initial structural proposals may contain inaccuracies in atomic connectivity. The full set of regioisomeric analogues are provided in Fig. S7.
(1) Data type: we utilized both simulated NMR data (generated using our SGNN model) and experimental NMR data acquired under standard laboratory conditions. This allowed us to evaluate the robustness of our approach when transitioning from idealized to real-world spectral data.
(2) Initial structure guess: two different approaches were employed for the initial structure hypotheses:
• Correct target molecule: using the actual structure as the initial guess.
• Regioisomeric analogue: using a regioisomer of the target molecule as the initial guess to simulate scenarios where the initial hypothesis contains structural inaccuracies.
(3) Data augmentation: for all simulated data, we introduced controlled noise to test system robustness. We augmented HSQC spectra with two specific peaks: a DMSO solvent peak (δH 2.52 ppm, δC 39.57 ppm) and a consistent artifact peak (δH 3.25 ppm, δC 103.42 ppm). While we placed this artifact at a fixed position across all spectra for experimental control and reproducibility, it represents the type of unpredictable spectral artifact commonly encountered in real-world NMR analysis. This standardized approach enabled us to systematically evaluate how both traditional and LLM-based methods handle well-defined spectral interference, simulating real-world experimental conditions where solvent signals and various artifacts are commonly encountered.
For our main analysis, we employed a molecular weight delta filter of Δ = 0.5 Da, constraining candidates to those with near-identical molecular weights. This focused approach allowed us to evaluate structure elucidation performance in scenarios where the chemical space is more precisely defined, as is often the case in targeted synthesis verification.
For each experimental condition, we analyzed performance using both our baseline HSQC peak matching approach and the LLM-enhanced evaluation process. While we evaluated multiple LLMs (DeepSeek-R1, Claude 3.5 Sonnet, Claude 3.7 Sonnet-Thinking, Gemini 2.0 Flash-Thinking, o3-mini, and Kimi 1.5), our main text focuses primarily on results from DeepSeek-R1 due to its strong performance and open-source nature, which enables potential future fine-tuning for specialized chemical applications. Comprehensive results for all evaluated models are provided in the SI Section S1.
This experimental design allowed us to systematically assess the impact of each variable on elucidation accuracy and identify the conditions under which our LLM-augmented approach provides the greatest advantage over traditional methods, with particular emphasis on realistic scenarios involving experimental data and imperfect structural hypotheses.
For retrosynthesis prediction, we employed a Chemformer model24–26 configured to generate up to 20 retrosynthesis suggestions for the target molecule. The model analyzes the target structure (provided as a SMILES string) and proposes potential disconnections, yielding a diverse set of starting materials from just one synthetic step backwards. These starting materials are canonicalized and filtered for uniqueness to ensure a non-redundant set of precursors for the subsequent forward synthesis step.
In the forward synthesis phase, the same Chemformer model is repurposed to predict possible products that could be formed from each of the identified starting materials. For each starting material, the model generates up to 20 potential product molecules, expanding the search space to include synthetically accessible analogues that maintain chemical similarity to the original target. This approach grounds our structure elucidation in synthetic reality, prioritizing molecules that could plausibly be formed through established chemical transformations.
The Mol2Mol model is configured with specific parameters to control the generation process: DELTA_WEIGHT = 0.5 to constrain molecular weight deviation from the target; TANIMOTO_FILTER = 0.2 to ensure a minimum level of structural similarity; NUM_GENERATIONS = 100 to specify the total number of analogues to generate and MAX_TRIALS = 500 to limit generation attempts.
This approach is particularly valuable for generating regioisomers and molecules with alternative functional group arrangements, complementing the synthesis-based approach by exploring regions of chemical space that are structurally plausible.
• Base training: the model was first trained for five epochs to predict SMILES strings from the corresponding spectral data. This stage utilized a cross-entropy loss function for the SMILES tokens.
• Dropout training: following the base training, a second stage introduced a 50% spectral data dropout, where individual spectra were randomly omitted during training. This dropout was applied uniformly across all modalities, including HSQC. This forced the model to become more robust and less reliant on any single data modality, improving generalization to real-world scenarios where spectra may be incomplete.
This entire pre-training was performed on four Nvidia V100 GPUs using the AdamW optimizer and a ReduceLROnPlateau learning rate scheduler. The resulting pre-trained model serves as the starting point for the subsequent improvement cycle.
The MMST workflow begins with an initial test using simulated NMR data of the provided target molecule. We assess the pre-trained model's prediction capability by comparing its output to the known target structure using the RDKit framework, which evaluates molecular graph isomorphism and produces a similarity score between 0 and 1. If this score exceeds our predefined threshold (IC_THRESHOLD = 0.5), the model is deemed sufficiently accurate for the current chemical space and proceeds directly to generating candidate structures using the experimental data.
However, if the accuracy falls below 0.5, indicating the pre-trained model struggles with this particular chemical class, an iterative improvement cycle is initiated before tackling the experimental data:
(1) Generation of similar molecules to the target using the Mol2Mol model with parameters optimized for fine-tuning data generation (MF_GENERATIONS = 200, MF_DELTA_WEIGHT = 100).
(2) Simulation of NMR spectra (1H, 13C, COSY, HSQC) for these generated molecules using the SGNN model.34
(3) Fine-tuning of the pre-trained MMST model on this simulated data (NUM_EPOCHS = 15, LEARNING_RATE = 0.0002).
(4) Deployment of the fine-tuned model to generate and sample new candidate structures (MULTINOM_RUNS = 30).
This cycle is repeated up to three times (IMPROVEMENT_CYCLES = 3), allowing the MMST model to progressively refine its predictions based on the specific spectral characteristics of the chemical space surrounding the target molecule. This comprehensive approach generates a diverse pool of candidate molecules that are subsequently ranked using HSQC peak matching and then subjected to in-depth LLM-enhanced analysis using DeepSeek-R1, as described in the following section.
For each candidate molecule, we simulate NMR spectra using the SGNN model34 and perform quantitative HSQC peak matching against the experimental spectra.32 We prioritize HSQC data for ranking due to its high information content and reliability for structural discrimination.
Beyond overall HSQC matching scores, we calculate per-atom error metrics for each carbon–hydrogen bond, providing a detailed view of structural agreement or discrepancy at the atomic level. These granular error metrics prove particularly valuable for subsequent LLM-driven analysis, enabling focused assessment of specific structural features.
The ranked candidates from all three generation approaches are combined into a unified pool, with the top-ranked molecules (the top 5) selected for detailed evaluation in the subsequent LLM-enhanced analysis stage using DeepSeek-R1.
Using these enriched structural representations, Claude 3.5 Sonnet performs a two-stage spectroscopic analysis. In the first stage, each candidate undergoes individual assessment where the LLM analyzes the molecule's visual representation and IUPAC name alongside experimental HSQC data and per-atom error metrics. Through a chain-of-thought process, the LLM evaluates structural-spectral alignment, considering peak patterns, functional group contributions to chemical shifts, and potential structural anomalies explaining spectral discrepancies.
The second stage involves comparative evaluation, where Claude 3.5 Sonnet examines all top five candidates simultaneously. This side-by-side comparison, mimicking the approach of expert spectroscopists, enables the LLM to identify distinguishing spectral features and structural elements that differentiate candidates. The resulting analyses provide chemically-informed insights that highlight specific structural elements supporting or contradicting each candidate structure based on the spectral evidence.
(1) Data synthesis: for each top candidate molecule, our system aggregates comprehensive information including IUPAC name, molecular properties, HSQC error scores, and spectral analyses from earlier evaluation stages.
(2) Evidence analysis: using chemistry-specific chain-of-thought prompting, DeepSeek-R1 systematically evaluates each candidate by analyzing spectral data (particularly HSQC shift comparisons), structural features, and molecular properties. The model justifies its assessments with specific data references and assigns a confidence score (0–1) to each candidate.
(3) Structured output: analysis results are formatted as standardized JSON containing confidence scores, detailed reasoning, and notes on data quality or structural ambiguities. This structured approach enables rigorous evaluation of model performance in molecular structure determination.
Additionally, to contextualize DeepSeek-R1's effectiveness, supplementary evaluations were conducted using other reasoning-focused models (Claude 3.7 Sonnet-Thinking, Gemini 2.0 Flash-Thinking, o3-mini, Kimi 1.5, and standard Claude 3.5 Sonnet). These comparative benchmarks confirmed broadly similar reasoning capabilities among leading LLMs, reinforcing the robustness of our DeepSeek-R1-based primary evaluation approach. Comprehensive benchmarking details are provided in the Fig. S1–S5.
For completeness, we also examined the distribution of correct structures across ranking positions (first through fifth) for both the baseline HSQC peak matching and LLM-enhanced approaches. These distributions, presented as histograms in the Fig. S1–S5, provide insights into how each method shifts the ranking of correct structures, particularly in challenging scenarios with experimental data or added noise.
The comparative analysis between HSQC matching's top-5 accuracy and LLM-enhanced top-1 accuracy proved especially informative, revealing the LLM's ability to effectively re-rank candidates and elevate correct structures to the top position. This relationship demonstrates that while the LLM is constrained by the candidate pool generated through HSQC matching, it substantially improves prioritization within that pool.
(1) For each candidate molecule, we matched simulated and experimental HSQC peaks using a nearest-neighbor double assignment algorithm with Hungarian distance optimization as described in our previous publication.32
(2) This algorithm optimally pairs each experimental peak with its closest corresponding simulated peak, minimizing the overall matching distance across all peak pairs.
(3) For each matched peak pair, we calculated the Euclidean distance between corresponding peaks in the 2D space defined by the carbon and proton chemical shifts.
(4) The overall HSQC error score for a candidate molecule was calculated as the sum of these distances across all matched peak pairs.
This approach ensures optimal peak matching even in cases with complex or overlapping signals. The resulting error score provides a quantitative measure of how well a candidate structure's predicted HSQC spectrum matches the experimental data, with lower scores indicating better matches. These scores were then used to rank candidate molecules for subsequent LLM-based analysis.
To evaluate the reliability of these confidence scores, we performed a correlation analysis between:
- LLM-assigned confidence scores for candidate structures.
- Whether the candidate was the correct structure (ground truth).
This analysis, visualized as a confidence–accuracy correlation plot, allowed us to assess whether the LLMs' self-reported confidence levels were reliable indicators of prediction accuracy. High correlation between confidence and accuracy suggests that the system can effectively self-evaluate the reliability of its predictions—a critical feature for practical applications where knowing when to trust automated results is essential.
The sample size for our confidence score analysis included all molecules across experimental conditions (n = 204), with slightly smaller samples for positions 4 and 5 (n = 198 and n = 188, respectively) as some molecules had fewer than 5 candidate structures generated.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00359h.
| This journal is © The Royal Society of Chemistry 2026 |