Open Access Article
Dagny Aurich
and
Emma L. Schymanski
*
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, Luxembourg. E-mail: emma.schymanski@uni.lu
First published on 9th February 2026
Computer-Assisted Structure Elucidation (CASE) is a powerful yet underused approach in chemistry to determine molecular structures from experimental data without necessarily being restricted to the contents of chemical databases. This review provides a comprehensive overview of the current state of CASE, encompassing methodologies, computational techniques, applications, challenges, and future directions. The historical evolution of CASE tools is traced, highlighting key milestones and influential technologies. Moreover, the methodologies employed in CASE, including reduction and assembly methods, as well as hybrid approaches, are examined. Special attention is given to the integration of analytical data, such as NMR, MS, and IR, into CASE algorithms, along with computational techniques such as machine learning approaches. Through a series of case studies and real-world applications, the utility of CASE tools in drug discovery, natural products chemistry, environmental sciences, and metabolomics is illustrated. Despite advancements, challenges persist in handling complex molecular structures, improving algorithm accuracy, integrating heterogeneous data sources, benchmarking and reconciling diverse programming languages, alongside the mixture of open vs. closed source developments. Looking ahead, emerging trends and future directions in CASE are identified, including rapid developments with the adoption of deep learning and big data analytics. By providing insights into the current landscape of CASE, highlighting the challenges and proposing recommendations for future research, this review aims to stimulate further CASE innovation and collaboration.
Manual structure elucidation poses several challenges due to its labour-intensive and subjective nature. First, it is a time-consuming process that requires chemists to meticulously analyse experimental data, interpret spectroscopic signals, and construct plausible molecular structures. This can be particularly daunting for complex molecules or when dealing with large datasets. The number of possible structures per molecule varies depending on factors such as the presence of functional groups, stereochemistry (e.g., cis- and trans isomers, chirality) or connectivity of atoms (e.g., straight-chain structures like n-pentane and branched structures like isopentane or dimethylpropane). Additionally, interpretation of spectroscopic data and assignment of chemical shifts can be subjective, leading to potential biases and inconsistencies between different chemists or laboratories. Variability in interpretation may result in discrepancies in the proposed structures. Moreover, structure elucidation often requires specialized knowledge and expertise in spectroscopy, organic chemistry, and computational methods. Analytical techniques such as NMR and MS generate complex data containing overlapping signals, noise, and artifacts, such that deciphering these spectra and extracting meaningful information to deduce structural features is challenging, particularly for molecules with diverse functional groups or unusual bonding patterns. Ambiguity and uncertainty arise in structure elucidation where multiple structural hypotheses are consistent with the experimental data. Resolving such ambiguities requires additional experiments or computational analyses, adding complexity to the elucidation process. Moreover, human errors, such as misinterpretation of spectral peaks, misassignment of chemical shifts, or oversight of structural constraints, can occur during manual structure elucidation, leading to inaccuracies or incorrect structural assignments. Lastly, manual structure elucidation may not be scalable or suitable for high-throughput analysis, particularly in the context of large compound libraries or high-volume screening programs. Automation and computational methods offer advantages in terms of speed, throughput, and reproducibility.
Consequently, strengthening the role of computational CASE tools in automating and facilitating the structure elucidation process is essential in addressing these challenges. These tools leverage algorithms and machine learning techniques to analyse analytical data, generate structural hypotheses, and refine molecular models. CASE tools play a pivotal role in closing the loop between experimental data and structural interpretation. They can often automate tedious spectral analysis tasks like peak picking and signal assignment, reducing the time and effort needed for manual interpretation. Additionally, many CASE tools integrate diverse analytical techniques,7,8 enabling comprehensive analysis of complex molecular structures. Through iterative refinement and validation against experimental data, these approaches enhance the accuracy and reliability of structural assignments.
One of the primary challenges faced by CASE is still underutilization, particularly in contexts where users prioritize identifying known compounds over discovering truly novel compounds. Many chemists rely on databases and existing libraries of these known compounds to match experimental data, limiting the exploration of the vast space of unknowns. Moreover, the abundance of CASE tools available, with no really established approach to “lead the way”, presents a dilemma for users, as they are often overloaded with options and struggle to determine which tool best suits their needs. Compared with other computational software, CASE tools remain comparatively unknown, such that these options are rarely in the active awareness of researchers. Each tool may employ different methodologies, algorithms, and user interfaces, making it challenging to navigate the landscape of available options. Additionally, CASE tools are often underutilized because they sometimes underperform, further complicating their adoption. Another obstacle in CASE is the prevalence of proprietary software, which restricts access to source code and hinders further development and customization. Without open source tools, users cannot modify or extend the software to meet specific needs or integrate new features.
This review seeks to address this issue by providing an overview of the various tools and methodologies used in CASE, with a particular focus on the potential for CASE to support identification efforts in MS, where developments are not as mature as CASE for NMR. This review encompasses a structured exploration of the field of CASE, starting with a historical overview (chapter 2) tracing its evolution and highlighting key milestones. It then delves into the methodologies employed in CASE (chapter 3), including reduction methods, assembly methods, and hybrid approaches. Structural databases are covered in chapter 4, integration of analytical data in chapter 5, then real-world applications and case studies in chapter 6, illustrating the utility of CASE tools in various domains, from pharmaceuticals to environmental science. The potential for CASE to evolve in the coming years with the rapid evolution in ML and AI is covered in the closing chapter.
The development of software for predicting NMR spectra in the 1980s and 1990s revolutionized the field of structure elucidation. Programs like ACD/NMR Predictor11 and ChemDraw12 allowed chemists to simulate NMR spectra for proposed structures, aiding in structural verification and validation. Elyashberg et al. highlighted the “synergistic interaction between CASE, new NMR experiments, and continuously improving methods of computational chemistry”.13 Elyashberg himself contributed to this interaction through several tools, including MASS14 (1976), X-PERT15,16 (1997), StrucEluc17 (1999), and Fuzzy Structure Generation18 (2007). Other prominent researchers in the history of CASE tools include Munk, involved in ASSEMBLE19 (1981), COCOA20 (1988), Assemble 2.0 (ref. 21) (2000), and HOUDINI22 (2003); Steinbeck, who contributed to LUCY23 (1996), SENECA24 (2001), MAYGEN25 (2021), and SURGE26 (2022); Kerber with the MOLGEN27 suite (summarized in detail in 2014), and Faulon, involved in OMG,28 MOLSIG,29 and PMG30 during 2012 and 2013. Numerous additional tools have been developed, often building on older algorithms. These tools are summarized chronologically in Table 1 and in greater detail in SI Table S1, which details their basic principles, disadvantages, programming languages, successors (if any), and references. Information was collected from the respective method papers or selected review articles (e.g. Yirik and Steinbeck31). The purpose of Table S1 is to provide a structured historical and methodological overview of CASE tools. Due to the lack of standardized benchmarks, limited availability of performance metrics, and the prevalence of closed source or commercial systems, a quantitative comparison of accuracy or performance was not feasible and is therefore not included.
| Year | Name | Language | Successor | Comment |
|---|---|---|---|---|
| 1964 | CONGEN (DENDRAL)9 | LISP | CONGEN-II, GENOA | Rarely used |
| ∼1970 | CHEMICS(-F)32 | NA | Not accessible | |
| 1976 | MASS14 | FORTRAN | SMOG | Not accessible |
| 1981 | GENOA (DENDRAL)33 | LISP | Rarely used | |
| 1981 | ASSEMBLE19 | NA | Assemble 2.0 | Superseded |
| 1985 | ACCESS34 | NA | Not accessible | |
| 1986 | DARC-EPOIS35 | NA | Not accessible | |
| 1988 | COCOA20 | Pascal, FORTRAN | GEN, HOUDINI | Not accessible |
| 1990 | AEGIS7 | PROLOG | Not accessible | |
| 1990 | MOLGEN36 | C | MOLGEN 3.5, 4, 5 | Closed source |
| 1991 | LSD37 | PROLOG | MacOS, Win | |
| 1995 | GEN38 | Turbo Pascal | HOUDINI | Not accessible |
| 1996 | SMOG39 | C/C++ | Open | |
| 1996 | LUCY23 | NA | SENECA | Not accessible |
| 1997 | COCON40 | NA | Online demo | |
| 1997 | X-PERT15 | NA | Not accessible | |
| 1998 | MOLGEN 4.0 (ref. 41) | C | MOLGEN 5.0 | Closed |
| 1999 | StrucEluc (ACD Labs)42 | NA | Closed | |
| 2000 | Assemble 2.0 (ref. 21) | NA | Win ‘95,97,NT | |
| 2000 | ESESOC43 | NA | Not accessible | |
| 2001 | SENECA24 | Java | Open, Unix/Win | |
| 2003 | HOUDINI22 | Pascal, FORTRAN | Not accessible | |
| 2007 | Fuzzy structure generation18 | NA | Concept | |
| 2012 | OMG28 | Java, C | PMG | Open |
| 2013 | MolSig29 | C | Open | |
| 2013 | PMG30 | Java | Open | |
| 2014 | MOLGEN 5.0 (ref. 44) | C | Online demo | |
| 2017 | MassChemSite45 | NA | Closed | |
| 2017 | SMART46 | Python/Matlab | DeepSAT | Closed |
| 2021 | MAYGEN25 | Java | SURGE | Open |
| 2021 | Scharnica47 | NA | Accessible | |
| 2021 | MassGenie48 | PyTorch | NA | |
| 2022 | SURGE26 | C | Accessible | |
| 2022 | MSNovelist4 | Python | Open | |
| 2023 | Mass2SMILES49 | Python | Open (preprint) | |
| 2023 | DeepSAT50 | Python | Open |
Major differences have emerged in the evolution of structure elucidation tools, such as assembly versus reduction or hybrid approaches, whether they work from a molecular formula or use experimental data, whether they are open or closed source, and whether structures are generated with or without relying on databases.
More recently, machine-learning approaches have begun to complement traditional CASE methodologies by learning structure-spectrum relationships directly from large datasets. Algorithms trained on vast databases of chemical structures and spectral data can now predict and interpret spectra with high accuracy. Examples for NMR include SMART46 (Small Molecule Accurate Recognition Technology), which applies deep learning to 2D NMR (HSQC) spectra for spectral embedding and dereplication (2017), and its successor DeepSAT50 (2023), which extends this concept toward data-driven spectral annotation and scaffold recognition. For MS, examples include MassChemSite45 (2017), using a custom database, or MassGenie48 (2021), which leverages PubChem.51 Other tools, such as Scharnica47 (2021), generate possible structures independently of databases. These methodological differences are explained in the next sections.
This relatively small PFAS (per- and polyfluoroalkyl substance) was chosen because the complexity of both the molecule and the reduction process increases with molecular size, making it impractical to demonstrate the method for larger molecules. Without a large number of constraints, the method generates a vast array of possible candidates. Generally, the main disadvantage of reduction methods is the massive size of the hypergraphs.31 For molecules with unknown structures, the size of the hyper structure can become extremely large, resulting in a corresponding increase in runtime. Hybrid methods combining reduction and assembly techniques have been developed to address this (see the section describing Hybrid approaches).
Fig. 2 presents a simplified version of the assembly approach for PFMOAA. The same relatively small PFAS structure was selected (as in Fig. 1) because a greater number of substructures and consequently different assembled isomers are possible with larger examples. Several examples of tools employing this approach are listed in Table S1, highlighting their specific methodologies as well as disadvantages.
HOUDINI,22 an improved version of GEN,38 further refines this hybrid approach. HOUDINI relies on two main data structures: a square matrix representing all bonds in a hyper structure and a substructure representation listing atom-centred fragments. During the structure generation process, HOUDINI22 maps these atom-centred fragments onto the hyper structure, enhancing the efficiency and accuracy of structure generation. Neither approach is available online, nor do they incorporate experimental data, instead relying solely on the molecular formula.
The MOLGEN family of structure generators are among the most time-efficient generators. MOLGEN53 addressed several shortcomings of DENDRAL and many other tools by offering more sophisticated and time-efficient algorithms, with various versions tailored for different data inputs. MOLGEN 3.5 remains one of the fastest generators based on mathematical graph theory using just the molecular formula as an input. MOLGEN 4 (ref. 41) and the related MOLGEN-MS54 and MOLGEN-QSPR55 focused less on speed and more on a flexible interface with advanced restrictions (good list, bad list structures and macroatoms that could be expanded later in generation). In 2007, MOLGEN 5 (ref. 44) was released, aiming to combine the efficiency and flexibility of previous versions through a new, albeit still closed source, approach. In practice, different MOLGEN versions were better suited to different applications.27
In parallel, hybrid approaches have also emerged that combine experimental spectral data with data-driven, machine-learning models rather than explicit graph-based assembly. Tools such as SMART46 and DeepSAT50 integrate 2D NMR data directly into learned chemical representations, enabling dereplication, similarity assessment, and partial structural annotation without enumerating full molecular graphs. These approaches are typically used alongside, rather than as replacements for, traditional CASE generators, providing complementary information that can guide or constrain subsequent structure elucidation workflows.
945 metabolites)66 for the human metabolome.
Spectral databases contain structures for which analytical data exists in sufficient amounts to be measured with the respective technique. Mass spectral libraries have grown impressively in recent years. NIST produce some of the largest mass spectral libraries, with the 2023 release of the NIST/EPA/NIH EI-MS library for electron impact MS data including 394k spectra of 347
100 compounds and the NIST Tandem Mass Spectral Library containing 2.4 million spectra of 51
501 compounds.67 The METLIN library now contains tandem mass spectra for over 960
000 compounds,68,69 although the compound list has not been made publicly available to assess the relevance of the compound coverage. One of the largest open mass spectral libraries is MassBank of North America (MoNA) with 2
080
139 spectra of 651
236 compounds (including some combinatorial libraries such as LipidBlast).70 The open NMR database NMRshiftDB/nmrshiftdb2 contains 271
816 structures, with 70
026 measured and 396
583 calculated spectra in Dec. 2025.71 PubChem collates spectral information (or the presence of spectral information) for 1
650
108 compounds,72 corresponding to 1
229
560 compounds with mass spectra (including the calculated LipidBlast), 659
362 with NMR spectra, 228
628 with IR spectra and 16
029 with UV spectra. Unfortunately, the majority of these entries are thumbnails or partial data (and thus unsuitable for CASE).
The challenge of matching analytical signals to a documented (or hypothetically possible) structure differs depending on the analytical technique used, with NMR generally yielding the richest source of structural information. For mass spectrometry, with generally sparser information, the number of possible structures for discrete formulae rapidly becomes unmanageable, even at relatively small masses (see Table 2). At larger masses (∼400–500 Da), even the number of possible formulae, let alone the number of structures, becomes difficult to manage when including small elements without isotopic patterns such as fluorine.73
| #Carbons | #Isomers | SDF file size | #Isomers in CDD |
|---|---|---|---|
| 2 | 9 | 6 kB | 9 |
| 3 | 29 | 22 kB | 27 |
| 4 | 116 | 108 kB | 38 |
| 5 | 506 | 561 kB | 35 |
| 6 | 2455 | 3176 kB | 34 |
| 7 | 12 783 |
18 939 kB |
40 |
| 8 | >70 000 |
117 146 kB |
[>upload limit] |
MS and IR data integration, while less common due to the generally less detailed structural information available, follow similar principles. MS data provides molecular masses and fragmentation patterns that can confirm or refute structural hypotheses generated from NMR data. However, structure elucidation can also be performed directly from MS data. MOLGEN-MS, based on low resolution electron impact mass spectra27,54 and MOLGEN 4.0, generated structures from a molecular formula using spectral classifiers to determine possible structural features using a “good list” and “bad list” (substructures present/absent to a given probability threshold set by the user) that was then used to constrain the generation. Coupling MOLGEN-MS with classifiers from the NIST database (which was much larger than the original training set) resulted in notable performance improvements,79 but was only applied in a handful of cases80 (discussed in more detail below). Since these efforts, structural elucidation with MS has developed significantly, but typically coupled to databases of structures (PubChem, ChemSpider, HMDB or others), rather than de novo identification based on structure generation. While manually-performed elucidation efforts generally outperformed automated methods in the early “Critical Assessment of Small Molecule Identification” (CASMI) contests (initiated in 2012),81 computational methods improved dramatically in the years of active contest and clearly outperformed manual attempts in later years.82 However, very few of these entries over the years used structure generation due to the poor performance relative to database lookup. While directly training better performing structure generation models using tandem mass spectrometry (MS2) spectra is likely currently still out-of-reach due to the limited availability of public training data, the development of methods leveraging latest advances in ML are underway. MSNovelist4 leverages the success of CSI:FingerID83 and SIRIUS,84 which have performed well in CASMI, using compound databases by predicting a molecular fingerprint based on MS2 data.83 MSNovelist4 combines fingerprint prediction83 with an encoder-decoder neural network using a Recurrent Neural Network (RNN) model with Long Short-Term Memory (LSTM) architecture to generate structures de novo solely from MS2 spectra. It predicted 25% of structures correctly on the first rank and retrieved 45% of structures overall in evaluations, successfully reproducing 61% of correct database annotations without having seen the structures during training.4 This does not reach top CASMI performance level, but is closer than may have been expected. A recent effort by Brogat-Motte et al. also shows some potential to interpolate novel structures without using predefined finite candidate set,85 with first plausible applications likely to be transformations of existing molecules, such as the “suspect library” from GNPS (discussed further below).86
IR spectra offer insights into functional groups present within the molecule, aiding in the construction of accurate structural models. Tools like Scharnica,47 AEGIS,7 ASSEMBLE21 or CHEMICS8,32 make use of IR data for structure elucidation. Advancements in spectral data processing have been pivotal in enhancing the performance of CASE tools. Improved algorithms for noise reduction, peak detection, and baseline correction ensure higher quality input data for structure elucidation. Spectral validation techniques have also evolved, providing robust mechanisms to verify the consistency and accuracy of predicted structures against experimental data. Cross-validation with multiple data types (NMR, MS, IR), e.g. shown by Scharnica,47 ensures that the proposed structures are not only mathematically plausible but also chemically and physically consistent.
An important consideration when performing CASE coupled with analytical data is the role of stereochemistry. As highlighted above, the number of structural isomers possible for given molecular formulae rapidly expands into unmanageable proportions; considering the number of stereoisomers possible for combinations of stereocentres in a molecule greatly expands this problem. For instance ESESOC,43 which examines the 2D connection table to identify all stereocentres, removes all equivalent stereoisomers and then generates candidate structures, was noted to be a very time-consuming approach (Table S1). Many approaches work on such small numbers of atoms that the true combinatorial impact of stereochemistry in CASE is not yet sufficiently explored. Since MS experiments rarely yield stereochemistry information (only possible in very rare cases or with chiral chromatography), CASMI contests were often evaluated on the structural skeleton, by collapsing all candidates by the InChIKey first block (connectivity).82 Recent MS-based CASE developments (MassGenie,48 MASS2SMILES,49 see Table S1) go one step further, ignoring stereochemistry altogether by using canonical (or connectivity) SMILES.
In the pharmaceutical industry, the identification and structural elucidation of small molecule impurities and degradation products is a crucial aspect enforced by regulatory agencies worldwide. Liu et al. provide a comprehensive review of how CASE tools, particularly MS-based techniques, are employed to address this need.87 The review underscores the critical importance of structure elucidation in pharmaceutical development, noting that complete identification of impurities and degradation products often necessitates a combination of chromatographic, MS, and NMR techniques.87 For de novo structure elucidation of compounds the authors refer to the MS2LDA tool,88 which aids impurity identification by extracting common patterns (Mass2Motifs) from MS/MS spectra, which can indicate shared substructures between impurities and drug APIs (Active Pharmaceutical Ingredients). CASE tools that were originally developed for metabolomics and metabolite identification are now increasingly adopted in pharmaceutical settings to streamline this process.87
Metabolomics is a critical area where CASE tools have found application. Several studies have explored the utility of CASE tools in metabolomics, with articles by Dias et al.3 or de Jonge et al.89 providing comprehensive overviews of these efforts. These publications highlight the advancements and challenges in using computational tools for metabolite identification. The field has seen the development of deep learning tools such as MassGenie48 and Mass2SMILES49 (see Table S1), which were specifically built on metabolomics data. These tools represent significant advancements in the ability to predict molecular structures and functional groups directly from MS data, addressing one of the major bottlenecks in metabolomics: the identification of unknown metabolites. Furthermore, the open source methods OMG28 and its parallelized version PMG30 were designed with metabolomics applications in mind. Overall, these applications show that the synergistic interaction between human expertise and computational tools could provide a powerful approach to chemical structure elucidation.
Natural products have long served as a rich source of novel compounds with diverse biological activities, making them crucial targets for structure elucidation. Some CASE tools have been built and tested specifically on natural products, using primarily NMR data.78 Notable examples include COCON40,75 and CISOC-SES.90 In their 2018 review, Burns et al. highlighted the role of CASE tools in the structure elucidation of complex natural products, addressing several critical issues.91 Despite advancements, a concerning number of incorrect natural product structures are still reported in the literature. CASE programs can mitigate this risk by generating all possible structures consistent with the input data and ranking them by probability. These tools are effective in determining structures for complex natural products, although they may struggle with compounds containing very few protons.91 Different CASE programs were described, emphasizing their handling of longer-range correlation peaks. These programs either provide just planar skeletal structures or use stereospecific NMR data to determine 3D structures.91 The paper discusses additional forms of computer assistance in structure elucidation, including the growing use of theoretical DFT calculations to determine 3D structures and predict chemical shifts. Burns et al. concluded with suggestions for improving CASE programs and proposed a challenge match between current CASE program developers to further enhance their capabilities and accuracy.91
Environmental samples are generally too complex for NMR, leaving CASE via MS as the primary choice. To date, applications have been rather limited. This includes some approaches with low resolution GC-MS data based on MOLGEN-MS, which helped identify some unknowns in effect-directed analysis studies in Bitterfeld80 and elucidate a toxic transformation product (TP) of diclofenac,92 whose identity was confirmed via synthesis of a reference standard. Some efforts with high resolution data include the elucidation of several benzotriazole TPs,93 although final proof of many structure remained elusive due to lack of reference standards. The advent of large open structure databases such as PubChem and ChemSpider in the early 2000s alongside the developments of high resolution mass spectrometry (HRMS) has seen the field shift focus to documented chemicals, since this “known” chemical space is already challenging enough to master at present,62 let alone the unknown. However, TPs, i.e. relatively slight modifications of documented structures, are likely the next domain within reach of CASE via MS and are the focus of several current developments such as BioTransformer 4.0 (ref. 94) and the Chemical Transformation Simulator.95 Molecular networking approaches based on GNPS96 have been used to generate so-called “suspect libraries” of spectra that are one node away from known spectral/structural associations, on the hypothesis that many of these may be from structurally-related compounds.86 Recent developments85 may help interpolate novel structures to enable CASE for these TPs, as mentioned above.
Despite progress, the accuracy of CASE algorithms remains a critical concern. Current algorithms can sometimes produce incorrect or incomplete structures, particularly when dealing with noisy or incomplete spectral data (restricted to all tools dealing with experimental data, e.g. LSD37 or COCON40,75). Enhancing the reliability of these algorithms is crucial to ensure accurate structure elucidation. Another challenge is integrating data from different analytical techniques, such as NMR, MS, and IR, into a cohesive structure elucidation process. Each technique provides complementary information, but combining these data streams seamlessly remains difficult and, although certainly a worthy time investment, is rather low demand as it is quite rare that all three methods are available for the same question.
The performance of most CASE systems still limits their adoption. Overall, scientists are rarely satisfied with an honest, unbiased assessment of how many structural possibilities may be theoretically possible for a given CASE problem, with substituted long chains being particularly problematic due to the high number of branching/substitution possibilities. The “hard truth” of potential possibilities is combined with the great difficulty and expense in confirming potential “unknown unknowns” (which would involve synthesis or isolation of sufficient amounts for detailed analysis – both often very difficult in reality), such that these confirmation efforts are only performed in very rare cases, and focus is often placed on easier problems to solve. Different CASE systems apply a range of methods to rank their candidates, which makes it difficult to compare results, while benchmarking is also challenging (see below). Although various systems have evolved in the last two decades to quantify confidence of identification (e.g. the Metabolomics Society Initiative,97 confidence levels for HRMS data,98 HRMS data coupled with PFAS99 or CCS100), an attempt by the Metabolomics Society to create a reporting system catering for the structural information and confidence applicable to both MS and NMR has so far failed to reach community consensus. Although Metz et al.101 recently proposed a probability approach, the probabilities would be so low for any CASE problem considering all possible structures that this is not yet feasible for de novo identification efforts.
The use of different programming languages across various CASE tools (e.g., LISP, FORTRAN, C, see Tables 1 and S1) presents a challenge in terms of interoperability, maintenance, and integration. This diversity complicates the ability to seamlessly combine or compare results from different software systems, hindering collaborative efforts and the development of standardized workflows. Additionally, many CASE tools face computational bottlenecks, such as long computing times and difficulties in handling overlapping substructures or duplicate fragments, let alone stereoisomers. These issues can slow down the elucidation process and reduce the efficiency of the tools, highlighting the need for more optimized and scalable algorithms.
CASE is hampered in many ways by the mix of open versus closed/commercial approaches and data. Closed source tools further exacerbate many challenges faced by CASE by limiting transparency and hindering collaborative development. The lack of access to the underlying algorithms and data processing methods in these tools prevents the wider scientific community from verifying results, contributing to improvements, or integrating these tools into broader workflows. For instance, the commercial license on the MOLGEN suite – one of the most efficient structure generators developed – prevented further developments following the retirement of Prof. Kerber and the distribution of the know-how away from the license holder (University of Bayreuth). While the push for open code of recent years has clearly impacted the CASE field, with many new developments now open (see Tables 1 and S1), the availability of sufficient open data to train and benchmark CASE methods is also becoming a bottleneck, with few scientists incentivised to measure, let alone contribute the resulting data to open resources, while many of the largest collections remaining licensed or closed (e.g., the NIST and METLIN libraries for MS). Since the recent trend in open source code availability has come strongly from funders and institutions, it is likely that a similarly coordinated approach to incentivise large, multinational collections of measured data would be needed to contribute sufficient amounts of new data to openly available collections. Rigorous benchmarking exercises are also now generally beyond the reach of individual research groups, such that communities, networks or societies may need to consider how such efforts could be stimulated and supported.
Despite the reflections above, it is uncertain whether substantial improvements to CASE tools would lead to their widespread adoption across various scientific fields. Routine laboratories still face significant challenges in fully identifying compounds or structures documented in databases (e.g., even non-target analysis on “known unknowns” is not standardized yet, making CASE tools to identify true “unknown unknowns”) more of a future prospect that will remain largely underutilized for the near future.
While gathering the literature and information for this review and Table S1, several documentation and interoperability issues with many of the CASE tools became evident, particularly with older ones. Locating the original references and the exact year of publication was challenging, especially when they were published in different languages (e.g., CHEMICS in Japanese) or only available in print with no online access. Additionally, comparing the computational demands of these tools was difficult due to differences in the hardware and software capabilities of the machines used at the time. Since many of these tools only run on outdated operating systems, direct comparisons between old and modern tools are complicated. Thus, while CASE has been around many years and is one of the reasons that cheminformatics as a discipline exists, it is in desperate need of modernization.
Emerging trends in CASE include the adoption of deep learning and big data analytics. Deep learning and other machine learning techniques are increasingly being integrated into CASE tools (as shown above). For instance, python tools like MassGenie48 and MASS2SMILES49 leverage deep neural networks to predict molecular structures and substructures from spectral data, demonstrating the potential power of these technologies in CASE. In parallel, several recent studies have applied deep learning directly to NMR spectra: transformer- and neural network-based models such as NMRMind,102 SMART,46 DeepSAT50 and other preprinted architectures103,104 map experimental spectra to molecular structures or structural features, complementing existing MS-based approaches. Both MS and NMR-based tools, however, would become more reliable for specific structural challenges and for larger, more complex molecules with additional training data. The integration of larger and more comprehensive spectral databases can allow for better matching and validation of experimental spectra against known compounds. As these spectral databases evolve, the challenge of curation and maintenance becomes critical. Most publicly available MS spectral databases are still too small to effectively train models. Additionally, a significant amount of closed-source research remains, particularly in the industry, which hinders the growth of these resources via collaborative efforts. As discussed above, concerted community efforts will be needed to address the lack of data issue.
Future advancements in CASE are likely to come from interdisciplinary approaches that combine insights from chemistry, computer science, and bioinformatics. Collaborations across these fields could lead to the development of more sophisticated algorithms and software capable of addressing the current limitations and pushing the boundaries of what is possible in structure elucidation. Making CASE tools more user-friendly and accessible should be an ongoing goal, but requires incentives for all sides. Simplifying the interfaces and workflows of these tools can help non-experts use them effectively, broadening their adoption and impact. Additionally, open-source initiatives and collaborative platforms could facilitate wider access and community-driven improvements.
CASE tools have found applications in fields like pharmaceuticals, natural products chemistry, metabolomics and environmental science, though their use remains limited. Despite notable progress and promising trends such as deep learning, big data analytics, and improved user accessibility, significant challenges persist. Many applications are still only successful for small or carefully-selected cases, and are not broadly applicable. Expanding the current offerings through advanced computational techniques, better data integration, and interdisciplinary collaboration could be key to broader adoption of CASE tools across various scientific and industrial domains. However, it is uncertain whether there is sufficient demand within the scientific community to drive these advancements. ML-based developments are improving rapidly, but are dependent on large amounts of novel data, while experimentalists have relatively few incentives to contribute valuable measurements to open resources to support ML developments. Benchmarking efforts suffer from the same lack of data. At this stage, CASE via NMR seems to be enjoying rapid developments, and although several key new breakthroughs are now available for CASE via MS, their performance is not yet sufficient for routine use. As long as challenges in identifying known structures persist – which is still the case for MS experiments – fully automated new structure generation with CASE tools remains a future prospect.
| This journal is © The Royal Society of Chemistry 2026 |