Open Access Article
Kohulan
Rajan
a,
Viktor
Weißenborn
a,
Laurin
Lederer
a,
Achim
Zielesny
b and
Christoph
Steinbeck
*a
aInstitute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany. E-mail: christoph.steinbeck@uni-jena.de
bInstitute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665 Recklinghausen, Germany
First published on 7th October 2025
The exponential growth of chemical literature necessitates the development of automated tools for extracting and curating molecular information from unstructured scientific publications into open-access chemical databases. Current optical chemical structure recognition (OCSR) and named entity recognition solutions operate in isolation, which limits their scalability for comprehensive literature curation. Here we present MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures), a tool designed for natural product literature curation that integrates COCONUT-aware schema mapping, CIP-based stereochemical validation, and human-in-the-loop structure refinement. This integrated web-based platform combines automated text annotation, multi-engine OCSR, and direct submission capabilities to the COCONUT database. MARCUS employs a fine-tuned GPT-4 model to extract chemical entities and utilises a Human-in-the-loop ensemble approach integrating DECIMER, MolNexTR, and MolScribe for structure recognition. The platform aims to streamline the data extraction workflow from PDF upload to database submission, significantly reducing curation time. MARCUS bridges the gap between unstructured chemical literature and machine-actionable databases, enabling FAIR data principles and facilitating AI-driven chemical discovery. Through open-source code, accessible models, and comprehensive documentation, the web application enhances accessibility and promotes community-driven development. This approach facilitates unrestricted use and encourages the collaborative advancement of automated chemical literature curation tools.
With the growing demand for machine learning-based tools and machine learning itself being recognised as the ‘fourth pillar’ of chemistry,3 high-quality, machine-readable data have become increasingly important, particularly as graph neural networks now lead molecular property prediction benchmarks when trained on large, reliable structure datasets.4 The FAIR principles likewise emphasise machine readability as a prerequisite for data reuse.5 Together, these trends create a pressing need to convert unstructured literature into structured, query-ready datasets capable of facilitating AI-driven discovery pipelines.
Text extraction has reached relative maturity, with transformer models achieving F1 scores of 86.7% on full-text PubMed articles for chemical entity recognition.6 For example, models trained on the NLM-Chem corpus have achieved F1 scores of 86.7% on full-text PubMed articles.7 However, Optical Chemical Structure Recognition (OCSR) – the conversion of graphical depictions to a machine-readable format remains challenging. Current tools like DECIMER.ai struggle with heterogeneous drawing styles.8 On curated benchmarks, MolScribe reaches a high accuracy, but on noisy images, it degrades.9 A 2020 review catalogued typical failure modes, from bond-length variation to mislabelled atoms.10 In a 2024 study, eight open-access OCSR tools were compared, revealing F1 scores ranging from 34 to 93%, with no single tool consistently outperforming others across all categories of chemical structure images.11
These performance disparities have motivated ensemble approaches, where image-aware routing strategies direct different structure types to optimal recognisers. Such arbitration already surpasses the exact-match accuracy of the best individual tool.11 Nevertheless, end-to-end extraction demands more than recognition: diagrams must be linked back to captions and body text, reaction components reconciled, and chemical validity enforced. Recent advances in multimodal language models demonstrate F1 values of around 88–96% for complex structure parsing,12 yet their raw outputs often violate valence rules or yield redundant tautomers.
Recent advances demonstrate the importance of contextual integration: ReactionDataExtractor 2.0 achieves ∼90% F1 scores by co-analysing image labels with surrounding text.13 Cross-media linkage methods enable alignment between visual panels and textual ref. 14, while multimodal language models like RxnIM and MarkushGrapher achieve F1 scores of 88% and 96%, respectively, for parsing reaction images and Markush structures.15,16 These tools, however, don't yield the expected accuracy in real-world applications since raw recognition outputs often violate valence rules or produce redundant tautomers.
The work by Wang et al. introduced BioChemInsight (Wang et al. 2025), an automated pipeline for extracting chemical structures and bioactivity data from pharmaceutical literature and patents. While BioChemInsight focuses on Structure Activity Relationship (SAR) data extraction for drug discovery, MARCUS addresses distinct challenges in natural product structure curation. Unlike BioChemInsight's pharmaceutical focus, MARCUS is specifically designed for natural products research, featuring COCONUT-specific schema mapping, taxonomic metadata linking (organism and geographical information), and stereochemical validation with CIP annotations. Additionally, MARCUS provides human-in-the-loop validation through integrated Ketcher editing capabilities, ensuring curation quality for complex natural product structures that often contain intricate stereochemistry and diverse structural motifs.
To address these comprehensive workflow challenges and integrate the fragmented landscape of chemical information extraction tools, we present MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures), a containerised web platform that transforms unstructured text and images in raw PDFs of articles about natural products into machine-readable molecular records and allows for submission to the open-access database COCONUT, as illustrated in Fig. 1. MARCUS employs Docling for document conversion and a fine-tuned GPT-4 model to extract chemical entities from text while routing chemical structures through an ensemble of three OCSR engines – DECIMER, MolNexTR, and MolScribe – for optimal recognition accuracy. The platform features a Vue.js frontend with a FastAPI backend, supporting concurrent users through queue-based session management and Docker-based deployment across operating systems.
By bridging the gap between the dynamic growth of literature and the structured data requirements of modern AI-driven discovery, the platform enables automated chemical literature curation while maintaining human oversight for quality assurance. The platform is accessible through a user-friendly web interface at https://marcus.decimer.ai, designed for researchers with varying technical expertise. Through open-source code availability at https://github.com/Kohulan/MARCUS and comprehensive documentation, MARCUS promotes community-driven development and encourages collaborative advancement of automated chemical literature curation tools. While MARCUS currently only supports the submission to the COCONUT database, it can be easily adapted for other OA databases.
Articles were retrieved as PDF files from the American Chemical Society website and processed using the Docling17 optical character recognition library. Docling was selected for its lightweight nature, open-source availability, high performance, and local execution capability, ensuring a reproducible preprocessing pipeline. The Docling-generated JSON output was parsed to extract the title, Abstract, and Introduction sections and combined as a single paragraph. These sections were chosen as they typically contain the highest density of chemical entity mentions and novel compound information.
To train an entity classifier for extracting molecules and related information, the text files required manual annotation. The text annotation was done using Doccano,18 an open-source annotation platform selected for its collaborative features and ease of use. The 100 publications were divided into five groups of 20 papers to expedite the manual labelling process through parallel annotation by multiple team members.
Ten entity types were defined based on chemical literature requirements and alignment with the COCONUT19 database schema. These are listed in Table 1.
| Label | Explanation | Example |
|---|---|---|
| trivial_name | Common molecular names without IUPAC nomenclature | Caffeine |
| iupac_name | Systematic IUPAC chemical nomenclature | 1,3,7-Trimethylpurine-2,6-dione |
| iupac_like_name | Hybrid nomenclature combining IUPAC and trivial naming | 1,3,7-Trimethylxanthin |
| compound_class | Formal classification of compounds based on structure or function | Alkaloids |
| abbreviation | Shortened form of the molecular name | CAF |
| organism_or_species | Complete organisms or biological species | Coffea arabica |
| organism_part | The anatomical or functional component of an organism | Beans |
| geo_location | Specific geographical locations | Ethiopia |
| kingdom | High-level biological classifications | Plants |
| location | General physical or abstract places | Mountain |
| compound_group | Enumeration of molecules grouped into one word | Caffeine derivatives A-G |
After completing the annotation, all documents, including the labelled entity spans and types, were exported in JSON format. These JSON files were then used to fine-tune a large language model (LLM) for automated data extraction.
The fine-tuning used 2 epochs, batch size 1, and learning rate multiplier 2, totalling 210
194 tokens from our 100 annotated papers. This final parameter configuration was determined through multiple fine-tuning iterations with manual validation of entity extraction accuracy, optimising the balance between effective domain adaptation and overfitting prevention while minimising hallucination in chemical entity recognition.
While commercial models offer strong performance, their closed-source nature poses limitations. Therefore, an open-source model represents the most suitable alternative in future.
The fine-tuned model was integrated with Doccano's auto-annotation feature, enabling automated pre-labelling of text via a single click. This significantly accelerates the annotation process for additional documents, supporting iterative model improvement and rapid performance assessment.
The auto-annotation system operates through a Flask-based20 API server, which handles REST requests initiated by Doccano. When the ‘auto-annotate’ feature is triggered, the target text is sent via a REST call to the Flask server, which forwards it to the fine-tuned language model for inference. The model identifies and extracts relevant annotations, which are then mapped back to the original text to determine precise character offsets. These span data are transmitted back to Doccano for visual rendering on the user interface. A visual representation of the complete process is illustrated in Fig. 2.
Text annotation processing requires 10–20 seconds per document, with costs approximately under $0.04 per publication through the OpenAI API. The fine-tuned model was specifically optimised for natural products literature terminology, including organism names, bioactive compound classes, and natural product-specific nomenclature, which may limit performance on synthetic medicinal chemistry or other chemical domains.
MARCUS employs a human-in-the-loop multi-model ensemble approach that enables the cross-validation of extracted structures, where users manually select the best prediction after reviewing all model outputs. Unlike automated ensemble methods that combine predictions algorithmically, this approach prioritises transparency and expert oversight. Users can select individual models or run all three for comparison. A Tanimoto similarity calculation matrix using ECFP fingerprints and Maximum Common Substructure (MCS) highlighting using RDKit's find maximum common substructure algorithm facilitates analysis of prediction consistency across models. These analyses help users assess model agreement and select the most accurate result. Fig. 3 illustrates the complete process of structure detection and extraction, followed by OCSR and OCSR results comparison in MARCUS.
![]() | ||
| Fig. 3 Summary of structure detection and extraction, followed by OCSR and OCSR comparison.24 | ||
An initial evaluation was conducted on 15 randomly selected articles from the Journal of Natural Products to assess model performance. A detailed analysis is provided in the Results section.
In some cases, the predicted chemical structures may require manual adjustments before being downloaded or submitted to a database. To facilitate this, structure editing capabilities are provided through the integrated Ketcher 3.0 structure editor,28 enabling manual refinement of predicted structures before final submission. Ketcher's open-source, standalone design allows local operation without external dependencies.
The backend consists of six routers, including those for text extraction, LLM-based text annotation, DECIMER-based chemical structure detection and segmentation, optical chemical structure recognition (OCSR), chemical structure depiction, and similarity calculation. Detailed information about each component is explained below.
Additionally, to ensure fair access and limit the number of simultaneous users, a session management system has been implemented. The Session Router handles user sessions and concurrency control via WebSocket and HTTP endpoints, maintaining real-time session status and queue position information for up to three concurrent users. The system uses Python's asyncio framework38 with deque data structures for efficient queue management. Each session receives a unique UUID39 and maintains metadata including creation time, last activity, and current status. The number of users is currently restricted to three due to hardware limitations, but it can be easily increased if there is sufficient hardware capacity to meet.
Real-time communication utilises WebSocket connections40 for immediate status updates, queue position notifications, and session state changes. The system includes automatic fallback to HTTP polling when WebSocket connections are unavailable. Background cleanup tasks are executed every 30 seconds to remove expired sessions and promote queued users to active status.
The backend also includes a hierarchical file storage system within the uploads directory for file storage and data management, organising files into subdirectories for original PDF uploads (PDFs), extracted chemical structure images (segments), temporary processing images (chem_images), and cached annotation results (openai_results).
Each PDF upload generates a unique identifier, ensuring isolation between concurrent users. Extracted segments include complete metadata serialised to JSON files for complete tracking.
The frontend includes centralised state management that utilises Vuex with five specialised modules and a PDF module, which manages PDF file handling, upload status, and document metadata with reactive updating for PDF viewer synchronisation. A text module that handles extracted text content, processing status, and text manipulation operations with extraction history support. An annotations module that stores entity recognition results, annotation statistics, and highlighted text spans with category management and confidence scoring. A structure module that coordinates chemical structure segments, OCSR results, and processing configurations, which also implements caching mechanisms for processed structures and supports batch processing operations with segment selection management. A theme module that controls visual themes with automatic system preference detection and persistent user settings supporting smooth transitions between light and dark modes.
WebSocket integration ensures persistent connections for real-time session status updates and processing progress notifications. The system includes automatic reconnection with exponential backoff and connection health monitoring to maintain performance. In cases where WebSocket connections are unavailable, HTTP polling provides a reliable fallback mechanism.
The application features a three-column responsive layout managed by the home page, comprising three main component groups. First, the PDF Components include a drag-and-drop PDF upload component with file validation and progress indicators, as well as a PDF viewer that supports embedded visualisation with zoom and page navigation controls. Second, the text processing components consist of a text panel for displaying extracted text with syntax highlighting, an annotated text viewer for rendering highlighted entities, and an annotation viewer that presents tabular annotation data with filtering options. Finally, the structure processing components include a segments panel offering grid-based visualisation with selection tools, a segment viewer for displaying individual segments alongside OCSR model outputs, a structure editor based on the Ketcher structure editor for modifying predicted structures, and a comparison component that enables side-by-side analysis of model predictions with similarity scoring.
The service layer abstracts API communication through specialised modules. The core API service uses Axios45 with automatic versioning, request/response interceptors, and error handling. Session-aware API client integration provides automatic session ID injection and session-specific error handling.
A total of 20 publications were randomly selected and downloaded from four major journals in natural products chemistry to ensure diversity in publication styles, chemical structure types, and drawing conventions. The selection included five articles each from Journal of Natural Products (ACS Publications), Phytochemistry (Elsevier), Molecules (MDPI), and Phytochemical Analysis (Wiley). Chemical structure segmentation was performed using DECIMER segmentation v1.5.0 on all 20 publications.
Each extracted structure image was processed through all three OCSR engines to generate SMILES predictions. Model performance was assessed through manual validation by comparing the regenerated molecular structures (from SMILES) with the original segmented images. A prediction was considered accurate only if the generated structure matched the original depiction exactly (1
:
1 match); otherwise, it was classified as incorrect and excluded from the count. Fig. 4 summarises the overall accuracy of the tools across all the publications. Structures deemed irrelevant for natural product research (such as 2D-NMR correlation diagrams or X-ray crystallographic representations) were excluded from accuracy calculations.
The evaluation revealed similar performance levels among the three OCSR models, with each demonstrating distinct strengths across different publication types. DECIMER achieved the highest average accuracy of 75%, followed by MolScribe at 73%, and MolNexTR at 69%. See Fig. 4 for more details. Performance varies significantly by journal type, with all models showing the lowest accuracy on JNP articles (51–63%) and the highest on Phytochemical Analysis articles (73–93%). Despite similar overall averages, each model demonstrates distinct strengths across different publication styles, supporting the Human-in-the-loop ensemble approach implemented in MARCUS.
Contrary to expectations, across four journal types, the lowest performance still occurs on Journal of Natural Products (JNP): DECIMER 63%, MolScribe 51%, and MolNexTR 52%. In contrast, the highest performance was observed in Phytochemical Analysis articles, where DECIMER reached 93%, MolScribe 87%, and MolNexTR 73%. MolNexTR shows the most consistent behaviour across journal types (52–73%), while MolScribe exhibited the greatest variability (51–87%); DECIMER spans 63–93%. Despite the competitive overall performance, DECIMER showed a unique advantage in handling Markush structures by representing variable groups with ‘R’ notations rather than asterisks directly in SMILES, facilitating downstream processing for structure enumeration. DECIMER also includes a dedicated model specifically trained for hand-drawn chemical structure diagrams, which can be used when needed. Both MolScribe and MolNexTR generate MolFiles with atomic coordinate information, enabling structure depictions that match the original orientation found in the literature. This allows curators to visually compare predicted structures with the source images, aiding in validation and accuracy assessment.
The observed performance levels (69–75% average accuracy) were notably lower than reported benchmark accuracies for these models, reflecting the challenges of real-world literature processing compared to curated test datasets. Critical factors contributing to reduced performance include varying image quality across different publishers, diverse drawing styles and conventions, complex stereochemical representations in natural product structures, and the absence of standard molecular depictions.
Performance evaluation revealed that the three OCSR models possess complementary strengths, exhibit varying accuracy across different structure types, and display diverse error patterns. These findings support the implementation of all three OCSR engines in MARCUS, enabling users to cross-validate predictions and select the most appropriate results for their specific applications. The Human-in-the-loop ensemble approach provides both improved overall accuracy through model consensus and transparency in cases where models disagree, enabling users to make informed decisions about structure quality.
This evaluation directly influenced the development of comparison features in MARCUS, including Tanimoto similarity calculations and Maximum Common Substructure (MCS) analysis for ensemble validation. It also guided the implementation of confidence scoring mechanisms and user interface elements that highlight consensus or disagreement across models. The evaluation confirms that, despite the limitations of individual OCSR models across diverse literature sources, the human-in-the-loop ensemble approach in MARCUS improves extraction accuracy and provides users with clear tools to evaluate prediction reliability.
Using MARCUS, users can improve their data extraction process significantly compared to manual curation processes. The integrated three-panel interface (Fig. 5) enabled seamless navigation between PDF viewing, text annotation review, and structure analysis within a single browser session. Processing times vary by document complexity, with typical natural products papers (10–15 pages) completing text extraction within 10 seconds, text annotation within 10–20 seconds, and structure segmentation within 60–90 seconds. The ensemble OCSR processing adds 2–3 minutes per structure set, depending on the number of detected chemical diagrams. Real-time progress indicators and status updates enhanced the user experience, particularly during longer processing operations.
The exported data maintains complete traceability, with each structure segment linked to its original page coordinates and segment index, enabling users to verify predictions against source images. Batch selection functionality allowed users to export subsets of structures, with selected data packages ranging from individual structures to complete document sets.
The automated submission workflow for COCONUT is integrated with the COCONUT database schema. The platform extracts and formats metadata, including compound names, organism information, and structural classifications for database submission. DOI extraction and linking provided proper citation information for submitted entries.
The integrated Ketcher editor enables structure refinement as needed, before database submission. The submission interface maintains the approach of one molecule per submission, required for proper tracking and provenance. Form auto-population reduces manual data entry requirements, although the extent varies depending on the completeness of the extracted metadata from individual publications.
The streamlined process eliminates many of the manual steps traditionally required in database curation workflows, though expert oversight remains necessary for quality assurance and validation of automated results.
The web interface, backed by FastAPI and containerised services, proved stable on modest hardware and enabled concurrent sessions with real-time feedback. Curators can inspect every prediction, adjust structures in the embedded editor, download the full annotation package as a single JSON file, or hand the record straight to COCONUT. Keeping all code, model checkpoints, and documentation openly available should make it easier for others to adapt the system to additional journals or target databases.
Several limitations remain. The accuracy of image-to-structure conversion still depends strongly on drawing style and image quality, and the current queue length is restricted by GPU capacity. Future work will focus on expanding training data, refining routing strategies for difficult images, and exploring lighter-weight inference back-ends so that more users can work simultaneously. We also plan to add automated consistency checks between the text-derived metadata and the decoded structures, as well as export options for other open-access repositories. The current system is optimised for natural products literature and may show reduced performance on medicinal chemistry or synthetic chemistry documents due to domain-specific training data.
In its present form, MARCUS cannot replace the expert curator, but it does take over many routine steps that previously consumed most of the curation time. We hope that the community will test the system, provide feedback, and help us turn MARCUS into a dependable companion for translating the growing body of natural products literature into FAIR, machine-actionable data.
| This journal is © The Royal Society of Chemistry 2025 |