Unravelling the evolution of nickel-catalyzed C–O bond activation with data-driven strategies

Jingyuan Zhu a, Yizhou Wang a, Imanuel Rava a, Shihong Chen a, Zhiyan Zou a, Yong Huang *a, Zhenyang Lin *a and Haibin Su *ab
aDepartment of Chemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR, China. E-mail: yonghuang@ust.hk; chzlin@ust.hk; haibinsu@ust.hk
bIAS Center for AI for Scientific Discoveries, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR, China

Received 26th June 2025 , Accepted 13th October 2025

First published on 14th October 2025


Abstract

Since the 1970s, nickel has proven to be an exceptionally efficient catalyst for cross-coupling reactions, particularly in the activation of C–O bonds, which serves as an environmentally friendly alternative to organic halides. The relentless exploration by chemists of the synthetic methodologies and mechanisms of this field has progressively fostered the emergence of an increasingly mature yet intricate discipline. Despite its apparent complexity, the core patterns remain hidden within some significant works. The development of large language models (LLMs) has provided unprecedented opportunities to navigate this complex landscape and uncover hidden patterns. Here, we introduce GPT-NiCOBot, a modular platform that integrates LLMs with chemistry-specific tools to autonomously extract reactions and identify key patterns in reagents and catalysts from peer-reviewed papers. Moreover, by combining the core citation network with in-depth chemical knowledge, this platform constructs a more effective and comprehensive research assistance framework. This system demonstrates the potential of LLMs to accelerate research in nickel catalysis and suggests broader applications in other chemical subfields.


Introduction

Over recent decades, the construction of C–C bonds through transition metal-catalyzed cross-coupling reactions has attracted significant attention due to its broad applications in medicinal chemistry. C–O bonds are prevalent in both natural and industrial feedstocks, ranging from small molecules like phenol and acid derivatives to polymeric compounds such as lignans, DNAs and biomass.1 This abundance offers chemists a vast substrate pool for C–C bond forming reactions. In contrast to traditional methods using organohalides as electrophiles, C–O activation offers a more environmentally friendly and practical alternative. These reactions employ less toxic and more readily available materials, such as alcohols, ethers, and esters,2 which are often derived from biomass,3 with minimal preactivation.4 Moreover, C–O functionalization reactions reduce toxic chemical waste5 and enhance safety, environmental friendliness, and economic viability.

Traditionally, cross-coupling reactions involving halide electrophiles are catalyzed by palladium, which commonly operates in the 0 and +2 oxidation states. As an Earth-abundant and cost-effective alternative to palladium, nickel offers diverse catalytic pathways and unique open-shell reactivity, thanks to its wide range of accessible oxidation states from Ni(0) to Ni(IV).6 The smaller atomic size of nickel enhances its nucleophilicity, enabling efficient activation of sulfonates, esters and less reactive electrophiles.7 Research exploring nickel as a catalyst in C–O bond activation has intensified in recent decades. Hence, it has become a general platform to synthesize C–C bonds. Studies have explored a wide array of carbon–oxygen electrophiles, ranging from active sulfonates8 to ether insertion using nickel catalysis.9,10 With the initial proof-of-concept achieved in 1979,11 progress in this field has significantly accelerated throughout the 21st century. The surge in publications within this domain has resulted in a wealth of data, enhancing our comprehension of reaction mechanisms, ligand effects, and substrate scopes (Fig. 1a).


image file: d5qo00947b-f1.tif
Fig. 1 (a) The vast amount of information gathered in Ni-catalyzed C–O bond activation reactions is scattered, making it challenging to conduct data mining and subsequent machine learning tasks. (b) General workflows of data-driven research in organic chemistry. (c) Introducing GPT-NiCOBot that can process multimodal information, analyse using advanced chemical tools, and iteratively refine outputs to support decision-making and planning.

The emergence of data-driven modeling has revolutionized the analysis of chemical data, aiding in reaction classification,12 product prediction,13 reactivity/yield estimation,14,15 and synthesis planning.16,17 A general workflow of data-driven methods used in organic chemistry is illustrated in Fig. 1b, and includes several key steps: dataset, representation, model and output. Traditionally, strategies relied on linear free energy relationships that connect a single parameter with the chemical reactivity of interest.18,19 Over time, multi-parameter methods have been introduced to correlate specific chemical inquiries with customized molecular descriptors. For example, the high relevance of the cone angle, buried volume, and electrostatic potential of ligands to predict the yield of nickel-catalyzed Suzuki reactions using a regression model was demonstrated.20 The emergence of advanced artificial intelligence (AI) and progress in high-throughput experiments has enabled the use of increasingly large datasets and more advanced algorithms to describe and predict the outcome of more complex catalyst systems. A neural network model utilized a comprehensive library of computed atomic, molecular, and vibrational descriptors as inputs to predict yields of Buchwald–Hartwig amination reactions.14 Structure-based fingerprints were also developed to predict the yield using a random forest model trained on the same dataset for the Buchwald–Hartwig reaction.21

Recently, OpenAI has demonstrated significant breakthroughs in the field of large language models (LLMs) through extreme scaling,22 successfully applying these models to chemical research23,24 and paving the way for data-driven techniques in organic chemistry. To enhance the cross-domain performance of LLMs, a viable approach is to build AI agents that utilize specialized external tools or plugins to improve their overall performance and applicability.25 The conceptual framework for an LLM-based agent includes three components: the brain, perception, and action.26 Serving as the controller, the brain module undertakes basic tasks such as memorizing, thinking, and decision-making. The perception module perceives and processes multimodal information from the external environment, while the action module uses tools to execute operations.

Inspired by the successful application of LLM agents in multimodal reasoning and action,27,28 we here present GPT-NiCOBot, an integrated platform that iteratively reflects on chemical tasks and refines responses (Fig. 1c). GPT-NiCOBot receives and processes various forms of input data (such as text, PDF files, and images). The brain module utilizes built-in databases, citation networks, and chemical patterns for in-depth data analysis/reasoning, and supports decision-making and planning. In the action phase, based on the prior analysis, it strategically employs appropriate chemical tools, a data retrieval unit (DRU), or research modes to acquire the necessary information to complete specific tasks. GPT-NiCOBot highlights the transformative potential of LLMs in chemistry research, serving both as an assistant to experts and as a gateway for non-experts through a user-friendly interface for accessing chemical knowledge.

Results and discussion

Designing the data retrieval unit

Large-scale extraction of chemical information from the literature presents a multifaceted challenge that extends beyond simple image/pattern recognition, and contextual understanding. The complexity begins with parsing chemical structures, which are often depicted as intricate diagrams featuring personalized notations for bonds, atoms, and functional groups. Reaction schemes frequently include pre-defined labels and annotations, necessitating the ability to distinguish text from graphical elements and to associate this text with the corresponding part of the diagram. This may also involve cross-referencing accompanying textual descriptions or tables. Additionally, chemical entity recognition is a crucial step that involves the precise categorization of chemical entities, such as reactants, products, ligands, and bases. GPT-NiCOBot extracted detailed reaction information from 148 papers on Ni-catalyzed C–C cross-coupling reactions via C–O activation. We developed a DRU to execute user-guided data processing (Fig. 2). The ultimate output includes 13 key reaction parameters: electrophile, nucleophile, product, catalyst, catalyst loading, ligand, ligand loading, base, solvent, additive, temperature, time, and yield.
image file: d5qo00947b-f2.tif
Fig. 2 Workflow of the data retrieval unit. The DRU employs LLMs and prompt engineering for efficient content mining, extracting homogeneous nickel catalysis reaction synthesis conditions from a diverse set of published research articles. This workflow includes three procedures: (1) location of the relevant page within the document based on user input, (2) extraction of reaction content from images, tables/figures/schemes, and text on the identified page, and (3) iteration of the data based on user feedback.

For the DRU, the critical steps involved in the pipeline include content mining from images, tables, and text. Leveraging the vision-language multimodal capabilities of GPT-4, we streamlined the content mining process for chemical data. First, an initial prompt is given to the DRU to determine whether the current page contains data of interest, such as tables or scheme names, based on the user-provided information. This step effectively filters out irrelevant pages, thereby saving tokens needed for further processing. Next, considering that GPT-4-Vision29 accepts input in image format, the DRU will digitize the relevant pages into images for subsequent image mining. At the same time, the original text format for GPT-4o30 will be retained to conduct table and text mining. For image mining, the DRU extracts detailed information on relevant chemical reactions, including reactants, products, nickel pre-catalysts, nickel pre-catalyst loading, ligands, ligand loading, bases, solvents, temperature, time, and yield, ensuring that this information is returned in a structured format. Given the limitations of GPT-4-Vision in recognizing molecular structures in images, the DRU only returns the labels of the molecules in the literature, without converting them into compound names or SMILES. Considering that tables or text might lack essential reaction information, data extracted from images will serve as prompts to assist in the processing of the tables and text. Then, the DRU integrates the extraction results from tables, text, and images and outputs a preliminary dataset. A name-to-SMILES converter tool is used to transform some of the molecules into SMILES representations, while returning the labels of molecules that cannot be converted to the user. During the user interaction phase, users can review the preliminary results and make suggestions for improvements based on their needs, such as supplementing data for specific entries or pointing out overlooked footnotes. Users can also upload screenshots of molecules that cannot be converted to SMILES, and the DRU will invoke the MolGrapher31 model for conversion. Through this interactive process and iterative data updates, the DRU continuously refines the information until the user is satisfied, and ultimately outputs structured data.

To evaluate the accuracy of content mining in the DRU, we conducted a comprehensive analysis of the entire result dataset. Specifically, we manually recorded the ground truth values for all the 13 reaction parameters across 1051 reactions, utilizing these data to assess the accuracy of the DRU outputs. Each reaction parameter was assigned to one of three labels: true positive (TP, where the DRU correctly identified the reaction parameter), false positive (FP, where the compound was incorrectly assigned to the wrong reaction parameter or irrelevant information was extracted), or false negative (FN, where the DRU failed to extract certain reaction parameters). Considering that the DRU is semi-automated, specific classification and processing are applied to evaluate its performance. Any results that required manual intervention are automatically categorized as FP or FN. Since the accuracy of converting molecular names or images into SMILES largely depends on the chosen tools or models, the conversion step is not included in the performance evaluation. Such an assessment method ensures a focused and accurate evaluation of the DRU's capabilities in content mining and reaction condition extraction, eliminating any potential interference due to the accuracy of external tools or models. The distribution of the TP labels for the 13 reaction parameters extracted across the 1051 reactions from all the 148 papers is shown in Fig. S2. It should be noted that not all reaction conditions require the reporting of all 13 reaction parameters; for example, most Kumada reactions and Negishi reactions did not involve an external base, and some reactions did not use additives. In such cases, the DRU assigns N/A to the corresponding reaction parameters. The precision, recall, and F1 scores (the harmonic mean of precision and recall) for each reaction parameter, as shown in Fig. S3, demonstrate high precision (>94%), recall (>86%), and F1 scores (>90%). Notably, the entire workflow is characterized by minimal coding and an efficient prompt engineering system, ensuring exceptional performance. This approach relieves researchers from the arduous task of sifting through numerous papers within their field, enabling them to swiftly assimilate the latest domain knowledge.

Analyzing the historical evolution

Nickel-catalyzed C–O activation represents a significant advancement in cross-coupling reactions, with numerous research groups contributing to this area. To better understand the path of evolution of this field, 28 influential and highly cited papers were selected from our dataset to construct a citation network (Fig. 3a). Within this network visualization, different types of C–C bonds formed through cross-coupling reactions are clustered. Additionally, this network delineates details on reactivities of electrophiles and nucleophiles. Relatively inert electrophiles include those with the leaving groups like –OR′ (alkoxy, aryloxy, siloxy, and hydroxy groups) and –OCOR′ (carboxylate, carbamate, and carbonate groups). Comparatively reactive electrophiles are those with the leaving groups like –OSO2R′ (sulfonate groups). Furthermore, this citation network categorizes nucleophiles into distinct classes, including RNu–[Mg] (Grignard's reagents), RNu–[Zn] (organozinc reagents), RNu–[B] (organoboron reagents), and RNu–[H] (pro-nucleophiles involved in heterolytic C–H bond cleavage). Through this citation network, the evolution of research in nickel-catalyzed C–O activation, including the targeted substrates and products, over time, can be visually presented, providing insights into future advancements in this field.
image file: d5qo00947b-f3.tif
Fig. 3 The evolution of the field of Ni-catalyzed C–O bond activation. (a) This citation network highlights key papers from 1979 to recent years and the evolution of electrophiles and nucleophiles in this field, reflecting the advancements in the scope of substrates and C–C bonds formed over time. The size of the shape represents the local citation number of the paper in our database. The similarity score is calculated based on the Dice similarity coefficient using the Morgan fingerprint of the electrophilic moiety. (b) Representative reactions around 2008.

Prior to 2004, research largely focused on pairings of highly reactive partners, such as aryl ethers coupled with Grignard reagents or organoboron nucleophiles with sulfonate electrophiles (Fig. 3b(1) and (2)). After 2008, several research groups expanded the scope to less activated C–O electrophiles—including carbamates, carbonates, and carboxylates—and milder nucleophiles such as organoboron and organozinc reagents, as well as pro-nucleophiles accessed via heterolytic C–H cleavage.32,33 In 2008, Chatani and co-workers reported Ni-catalyzed Suzuki–Miyaura couplings of aryl ethers (Fig. 3b(4)),9 while the Shi34 and Garg35 groups independently demonstrated couplings of aryl boron reagents with aryl carboxylates; Garg and co-workers also extended the scope to aryl carbamates and carbonates.36 Shi and co-workers further pioneered Ni-catalyzed Negishi couplings of aryl/alkenyl pivalates that year (Fig. 3b(5)).37 Subsequently, in 2012, Itami and co-workers disclosed C–H/C–O couplings of azoles with arenol derivatives, including carboxylates and triflates (Fig. 3b(6)).38

Among aryl/alkenyl electrophiles, intra-class similarity increases after 2008, consistent with methodological generalization beyond a specific substrate; this consolidation is reflected in clusters of related electrophiles in Fig. 3a. Additionally, a further trend after 2008 is the emergence of C(sp3) electrophiles, notably benzylic substrates. For example, in 2011, the Jarvo group made a significant breakthrough with a stereospecific nickel-catalyzed C(sp3)–C(sp3) cross-coupling employing alkyl Grignard reagents (Fig. 3b(7)).39 Furthermore, the Jarvo group showcased the transformative impact of Suzuki–Miyaura cross-coupling reactions with benzylic esters, carbonates, and carbamates alongside aryl-boronic esters.40 Concurrently, the Watson group introduced a related strategy, employing benzylic pivalates and aryl boroxines to produce diarylalkenes and triarylmethanes.41

This citation network plays a crucial role in understanding the background of the field. Compared to directly summarizing and analyzing all the collected literature using LLMs, providing structured data that explicitly clarifies the citation relationships between papers enables the model to more accurately identify milestone work in the field. This approach avoids redundancy and highlights key advancements. The background module of GPT-NiCOBot is designed based on the concept of the citation network (Fig. S12). It includes the citation relationship between papers and constructs a dataset with concise citation content to describe the core contributions of the cited works, leveraging the strengths of language models in text comprehension and generation. To provide high-quality background information, we extracted key information from 148 papers, including titles, abstracts, and keywords related to electrophiles and nucleophiles. Additionally, with the assistance of GPT-4o, we conducted preliminary summaries of each paper, evaluating aspects such as the introduction of new reaction types, the range of applicable substrates, the mildness of reaction conditions, and the potential for application and practicality. This structured, citation-based approach optimizes resource utilization and significantly enhances the effectiveness and accuracy of background information in the research process.

Pattern mining

The utilization of data-driven analysis to uncover substrate distribution patterns and understanding the roles of ligands and bases in key steps like oxidative addition and transmetallation can enhance decision-making in experimental design and strategy development within this field. To focus on synthetically meaningful transformations, we limited our analysis to reactions in the dataset with yields greater than 40%. Additionally, to ensure comprehensive coverage, we treat on an equal footing catalytic systems employing “nickel catalyst + ligand” combinations and those using pre-synthesized, purified metal complexes (e.g. PCy3 is counted as the operative ligand when Ni(PCy3)2Cl2 is used as the catalyst).

We initially utilized Tree MAP (TMAP)42 to correlate electrophiles with various nucleophiles in experimentally validated combinations (Fig. 4a). Within the TMAP framework, sub-trees categorize the electrophiles into two distinct regions: unactivated C–O bonds (such as ethers and esters) and activated C–O bonds (including phosphonates and sulfonates). The patterns exhibited by different types of electrophiles show a range of clustering behaviors. Csp2 electrophiles, primarily aryls and alkenyls, are the most frequently studied. In contrast, Csp3 electrophiles are less common and are mostly limited to benzyl and allyl derivatives. This can be attributed to the propensity of these benzyls and allyls to readily undergo oxidative addition by forming η3-complexes.43 Furthermore, the carbon hybridization of electrophiles exhibits preferences for certain leaving groups. Allyl and benzyl electrophiles are typically paired with unactivated leaving groups, such as ether and ester groups, whereas aryl electrophiles are more commonly associated with better leaving groups such as sulfonates and phosphates.


image file: d5qo00947b-f4.tif
Fig. 4 Data analysis in the subject databases. (a) TMAP of reactions based on electrophiles and nucleophiles using 1024-dimensional features generated by RDKit (see Tables S1 and S2). (b) Distribution of ligand types based on donor atoms. (c) Quantitative analysis of the ligand TEP value and BDE of the C–O bond for RE–OCOR′ and RE–OR′ electrophiles (examples of in situ C–O activation are excluded). (d) t-SNE clustering of phosphine and NHC ligands with nucleophiles. (e) Statistics for monodentate and bidentate ligands used for distinct organoboron reagents. (f) Quantitative analysis of the nucleophilicity values of nucleophiles and base pKa values (borate anions are excluded).

Ligand selection is a critical aspect of transition metal catalysis, particularly during the oxidative addition and transmetallation steps. When reaction substrates are altered, it is essential to reoptimize ligands by tuning their steric and electronic properties. In this study, density functional theory (DFT) calculation was also employed to generate molecular descriptors for electrophiles, nucleophiles, and ligands, enabling subsequent structure–property relationship investigations. Notably, some ligands are underrepresented due to the scarcity of low/negative outcomes and the bespoke nature of certain scaffolds (for example, N-ligands or some structurally complicated and rare ligands). We compiled statistics on various types of ligands based on their donor atoms. Fig. 4b shows that phosphine ligands, particularly monodentate and bidentate phosphines, are the most employed. With their strong binding affinity to metals and highly tunable sterics and electronics, phosphine ligands dominate in metal-catalyzed reactions.44 N-Heterocycle carbenes (NHCs) are also observed, albeit less frequently, showcasing their distinct reactivity, such as strong σ-donating ability and steric bulkiness.45 Conversely, nitrogen ligands, commonly employed in Ni-catalyzed radical reactions,45 are seldom encountered in our database. This scarcity may be attributed to their relatively weak σ-donation characteristics and potential π-accepting abilities, which can lead to electron-deficient nickel complexes, potentially limiting their efficacy as ligands in the polar process of oxidative addition to C–O bonds.

To explore the relationship between the ligand and electrophile, clustering analysis was conducted using t-distributed stochastic neighbor embedding (t-SNE Fig. S5). The results demonstrate that Csp2 sulfonates and phosphates can accommodate a wider range of ligands, whereas less reactive Csp2 esters and ethers exhibit a preference for electron-rich ligands (with Tolman's electronic parameters,46 TEP, of 2064 cm−1 or greater). For Csp3 electrophiles, such as benzyl and allyl groups, electron-poor ligands (with TEP less than 2064 cm−1) are commonly employed. To better understand the ligand choice for ester/ether substrates, a quantitative analysis correlating the calculated bond dissociation energy (BDE) of the electrophile C–O bond and the TEP of the ligands is shown for ester/ether substrates (Fig. 4c). The correlation plot demonstrates a staircase-like distribution of data points. Relatively weak and medium C–O bonds can tolerate phosphine ligands with a wide range of TEP values (Fig. 4c, region I). In contrast, C–O bonds with higher BDEs are more associated with ligands with low TEP values (Fig. 4c, region II), which underscores the importance of C–O bond cleavage in the catalytic cycle. For instance, where Grignard's reagents serve as the nucleophile, ligands with high TEPs are infrequently matched with less reactive electrophiles11,32 (Fig. 4c region II). This may be attributed to the role of Mg as a Lewis acid, aiding in the activation of C–O bonds.3,47 The hybridization of the electrophile carbon also exhibited a different preference for TEPs. Allyl/benzyl electrophiles tend to cluster towards the left-hand side, while aryl/alkenyl electrophiles are primarily located on the right-hand side. This positioning implies that the π-electrons of aryl/alkenyl substrates play a role in affecting the oxidative addition, requiring more electron-rich ligands for efficient C–O bond activation.

Another consideration for ligand selection is the relationship between the ligand denticity requirement and reactivity of the nucleophiles. This is illustrated in Fig. 4d using t-SNE with DFT-based descriptors. Mono- and bidentate ligands showed different distributions across different nucleophile classes. In the category of highly reactive nucleophiles such as Grignard and organozinc reagents on the left, mono- and bidentate ligands appear at similar frequencies. This phenomenon can be attributed to the high reactivity of these nucleophiles, requiring little help from the ligand in the rapid transmetallation process. Conversely, less reactive nucleophiles exhibit a propensity for a specific ligand denticity to facilitate smoother transmetallation. For RNu–[B] nucleophiles, the cluster of monodentate ligands is considerably bigger than that of bidentate ligands. This preference likely arises from indirect transmetallation,48 where Lewis base activation of the organoboron reagents is needed. Monodentate phosphines are better suited to this process due to their ease of dissociation. A statistical analysis was performed to further investigate the choice of ligands for organoboron nucleophiles. Fig. 4e shows that monodentate ligands are the most prevalent choice overall, while bidentate ligands are only occasionally used for boronic acids or borate anions. Interestingly, bidentate phosphine ligands are predominantly used in the reactions involving pro-nucleophiles (RNu–[H]). For unactivated C–H bonds, the nickelation step is typically slow and is sometimes the rate-determining step.49 Bidentate ligands are favored in these cases by stabilizing the C–Ni intermediates.

The role of an external base in promoting transmetallation has been widely accepted.50,51 To further understand the synergy between the type of base and nucleophiles, we performed a quantitative analysis correlating pKa values in water with the strength of the nucleophile, as shown in Fig. 4f. In this plot, Δq represents the relative reactivity of the nucleophile. These data are determined by comparing the charge of the nucleophilic carbon atom in the original nucleophile (RNu–[M]) with the charge of the carbon atom when bonded to a hydrogen atom (RNu–[H]). A more negative Δq reflects a stronger nucleophile. A staircase-like distribution is observed. For strong nucleophiles, such as RNu–[Mg] and RNu–[Zn] reagents, an external base is typically not needed, except in scenarios where the electrophiles contain reactive hydrogen like –OH.52,53 In such instances, a strong base may be introduced to convert the alcohol into its corresponding salt.53 For RNu–[B] reagents, weak or moderate non-nucleophilic bases are often used (CO32−, PO43−, and tBuO). Reactions involving RNu–[H] reagents typically require the addition of external base, except when nucleophiles with highly acidic C–H bonds are utilized.54 The selection of the base for RNu–[H] reagents is influenced by the mechanism of the deprotonation. For the direct deprotonation of a high pKa C–H, strong organic bases can be employed.55 Instead, for C–H activation processes, inorganic bases, such as CO32−, are often used to facilitate the concerted metalation-deprotonation step.51 Furthermore, in reactions involving olefin reagents (Heck couplings), amines are sometimes also employed to deprotonate Ni-hydrides in the catalytic cycle.56 The correlation of the base with different substrates underscores the nuanced and multifaceted nature of the base.

Using GPT-NiCOBot for interactive research

To improve data accessibility and aid in the interpretation of its complex content, we converted this data set into an interactive and user-friendly dialogue system. Drawing inspiration from models described in the ReAct57 and MRKL58 systems this mode's workflow merges logical reasoning with tools pertinent to specific tasks. This integration transforms LLMs from merely overconfident information providers into dynamic reasoning engines that actively engage in task reflection. To augment the general chemistry knowledge base accessible through this module, we integrated a web search tool and several basic chemical tools, enhancing the breadth of information available for chemical research and analysis (see Fig. S8).

In the interactive research mode, the background and reaction recommendation modules are the two core components (Fig. 5), both operating on the retrieval augmented generation framework. This framework expands the LLM's data access by incorporating external data sources. Preliminary summaries of individual papers are compiled with the assistance of GPT-4o. Additional descriptors from our database are also imported. Meanwhile, to enable descriptor-based analysis for unknown substrates or ligands, this mode utilizes RDKit59 for initial calculations. Two transformer-based models are employed to predict the TEP and Boltzmann-weighted buried volume for monodentate phosphine ligands. For a certain number of target substrates or ligands, it is recommended to organize their descriptors into a structured table and upload it to GPT-NiCOBot. After automatic embedding processing, GPT-NiCOBot can analyze the data and provide responses based on the external database. In addition to the citation networks discussed above, a dataset containing concise citation content is also constructed. Using GraphRAG60 for data retrieval, the background module can effectively provide accurate answers supported by experimental data. The research module allows input in SMILES, IUPAC names, and even images of structures. It can qualitatively recommend reaction conditions (ligand, solvent, base, temperature, etc.) from a set of substrates.


image file: d5qo00947b-f5.tif
Fig. 5 Overview of the interactive research mode. (a) The background module provides general organic chemistry knowledge and information. (b) The reaction recommendation module (human to machine) processes questions from users. (c) The reaction recommendation module (machine to human) sets up questions to users, evaluating their answers, and providing needed guidance for further use.

The reaction recommendation module assesses substrate compatibility and proposes optimal reaction conditions. Based on subtle functional differences, this module is further divided into the human to machine mode and machine to human mode. The human to machine mode is designed for users who have clear needs and provides assistance through Q&A functionality. When a user wants to inquire about suitable reaction conditions for a specific electrophile, it gradually recommends nucleophiles and the corresponding conditions. This step-by-step interaction enables users to swiftly identify the optimal reaction and its parameters. By using reaction conditions as search prompts, the module retrieves the most pertinent reactions from the embedded dataset for the user. If the backend deems the inputs infeasible, it offers a list of suggested reactions, together with a rationale and references to relevant literature. The machine to human mode is led by GPT-NiCOBot and is primarily used to test a user's understanding of the specific field. Unlike the human to machine mode, this mode focuses more on evaluation. GPT-NiCOBot generates questions based on the reaction database and evaluates the user's answers by integrating the background module, chemical tools, and established chemical patterns. The system can determine whether a user's response is reasonable and whether it adheres to chemical principles. It then provides corrections and feedback to help users deepen their understanding of the field.

Conclusions

In conclusion, the work described here represents a significant leap forward in the application of LLMs to the field of organic chemistry, particularly in the study of nickel-catalyzed C–O bond activation reactions. By releasing GPT-NiCOBot, we have highlighted the critical role of high-quality data and demonstrated a novel, data-driven approach to studying complex chemical reactions. The integration of prompt engineering, a chemistry-specific image recognition model, and an interactive prompt refinement strategy has significantly advanced the extraction and analysis of the relevant literature. This work has not only illuminated the historical development of the field through citation network analysis but has also revealed correlations and preferences among various chemical components through statistical analysis. By harnessing these chemical patterns, we have guided LLMs to provide more accurate answers. Such enhancement improves access to the synthesis dataset, thereby facilitating a transition from data collection to interpretive dialogue. While findings are constrained by the limited size and quality of the selected datasets and tools, the scope for further development is vast. Integrating additional resources, such as quantum chemistry packages or models specifically trained for targeted tasks, into GPT-NiCOBot could greatly enhance its capabilities in areas like novel catalyst design and reaction mechanism prediction. Moreover, although the scope of evaluation tasks is currently narrow, future research and development could expand and diversify these tasks, ultimately pushing the boundaries of what such systems can achieve. This study serves as a proof of concept, showcasing the immense potential of the GPT series models and other LLMs for domain knowledge specific data mining and analysis.

Author contributions

Conceptualization: H. S., Z. L. and Y. H.; data curation: I. R., Y. W., J. Z., Z. Z. and S. C.; software: J. Z. and I. R.; data analysis, I. R., Y. W., J. Z., and S. C.; writing—original draft preparation: J. Z. and Y. W.; writing—review and editing: H. S., Z. L., Y. H., J. Z., Y. W. and I. R.; supervision, H. S., Z. L., and Y. H. All authors have read and agreed to the published version of the manuscript.

Conflicts of interest

There are no conflicts to declare.

Data availability

The accession codes are available free of charge at https://github.com/hbsulab/NiCOBot. The citation network can be accessed at https://hbsulab.github.io/NiCOBot/citation-network.html.

Supplementary information (SI): data analysis and GPT-NiCOBot (.docx); collected data and calculated descriptors (.xlsx). See DOI: https://doi.org/10.1039/d5qo00947b.

Acknowledgements

This work is supported in part by the Research Grants Council of Hong Kong (16304022), HKUST grant (R9418), and the Society of Interdisciplinary Research (SOIRÉE) in Hong Kong.

References

  1. Z. J. Shi, Homogeneous Catalysis for Unreactive Bond Activation, John Wiley & Sons, 2014 Search PubMed.
  2. D. G. Yu, B. J. Li and Z. J. Shi, Exploration of New C−O Electrophiles in Cross-Coupling Reactions, Acc. Chem. Res., 2010, 43, 1486–1495 CrossRef CAS PubMed.
  3. Z. Qiu and C. J. Li, Transformations of Less-Activated Phenols and Phenol Derivatives via C–O Cleavage, Chem. Rev., 2020, 120, 10454–10515 CrossRef CAS PubMed.
  4. C. M. So and F. Y. Kwong, Palladium-catalyzed cross-coupling reactions of aryl mesylates, Chem. Soc. Rev., 2011, 40, 4963–4972 RSC.
  5. J. Cornella, C. Zarate and R. Martin, Metal-catalyzed activation of ethers via C–O bond cleavage: a new strategy for molecular diversity, Chem. Soc. Rev., 2014, 43, 8081–8097 RSC.
  6. B. M. Rosen, K. W. Quasdorf, D. A. Wilson, N. Zhang, A.-M. Resmerita, N. K. Garg and V. Percec, Nickel-Catalyzed Cross-Couplings Involving Carbon−Oxygen Bonds, Chem. Rev., 2011, 111, 1346–1416 CrossRef CAS.
  7. S. Z. Tasker, E. A. Standley and T. F. Jamison, Recent advances in homogeneous nickel catalysis, Nature, 2014, 509, 299–309 CrossRef CAS.
  8. V. Percec, J. Y. Bae and D. H. Hill, Aryl Mesylates in Metal Catalyzed Homocoupling and Cross-Coupling Reactions. 2. Suzuki-Type Nickel-Catalyzed Cross-Coupling of Aryl Arenesulfonates and Aryl Mesylates with Arylboronic Acids, J. Org. Chem., 1995, 60, 1060–1065 CrossRef CAS.
  9. M. Tobisu, T. Shimasaki and N. Chatani, Nickel-Catalyzed Cross-Coupling of Aryl Methyl Ethers with Aryl Boronic Esters, Angew. Chem., Int. Ed., 2008, 47, 4866–4869 CrossRef CAS PubMed.
  10. L. Guo, X. Q. Liu, C. Baumann and M. Rueping, Nickel-Catalyzed Alkoxy–Alkyl Interconversion with Alkylborane Reagents through C−O Bond Activation of Aryl and Enol Ethers, Angew. Chem., Int. Ed., 2016, 55, 15415–15419 CrossRef CAS PubMed.
  11. E. Wenkert, E. L. Michelotti and C. S. Swindell, Nickel-induced conversion of carbon-oxygen into carbon-carbon bonds. One-step transformations of enol ethers into olefins and aryl ethers into biaryls, J. Am. Chem. Soc., 1979, 101, 2246–2247 CrossRef CAS.
  12. P. Schwaller, D. Probst, A. C. Vaucher, V. H. Nair, D. Kreutter, T. Laino and J.-L. Reymond, Mapping the space of chemical reactions using attention-based neural networks, Nat. Mach. Intell., 2021, 3, 144–152 CrossRef.
  13. P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas and A. A. Lee, Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction, ACS Cent. Sci., 2019, 5, 1572–1583 CrossRef CAS.
  14. D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher and A. G. Doyle, Predicting reaction performance in C–N cross-coupling using machine learning, Science, 2018, 360, 186–190 CrossRef CAS.
  15. A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow and S. E. Denmark, Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning, Science, 2019, 363, eaau5631 CrossRef CAS PubMed.
  16. B. Mikulak-Klucznik, P. Gołębiowska, A. A. Bayly, O. Popik, T. Klucznik, S. Szymkuć, E. P. Gajewska, P. Dittwald, O. Staszewska-Krajewska, W. Beker, T. Badowski, K. A. Scheidt, K. Molga, J. Mlynarski, M. Mrksich and B. A. Grzybowski, Computational planning of the synthesis of complex natural products, Nature, 2020, 588, 83–88 CrossRef CAS.
  17. M. H. S. Segler, M. Preuss and M. P. Waller, Planning chemical syntheses with deep neural networks and symbolic AI, Nature, 2018, 555, 604–610 CrossRef CAS PubMed.
  18. L. P. Hammett, The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives, J. Am. Chem. Soc., 1937, 59, 96–103 CrossRef CAS.
  19. P. R. Wells, Linear Free Energy Relationships, Chem. Rev., 1963, 63, 171–219 CrossRef CAS.
  20. K. Wu and A. G. Doyle, Parameterization of phosphine ligands demonstrates enhancement of nickel catalysis via remote steric effects, Nat. Chem., 2017, 9, 779–784 CrossRef CAS.
  21. F. Sandfort, F. Strieth-Kalthoff, M. Kühnemund, C. Beecks and F. Glorius, A Structure-Based Platform for Predicting Chemical Reactivity, Chem, 2020, 6, 1379–1390 CAS.
  22. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu and D. Amodei, Scaling Laws for Neural Language Models, arXiv, 2020, preprint, arXiv:2001.08361, DOI: 10.48550/arXiv.2001.08361.
  23. D. A. Boiko, R. MacKnight, B. Kline and G. Gomes, Autonomous chemical research with large language models, Nature, 2023, 624, 570–578 CrossRef CAS PubMed.
  24. Z. Zheng, O. Zhang, C. Borgs, J. T. Chayes and O. M. Yaghi, ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis, J. Am. Chem. Soc., 2023, 145, 18048–18062 CrossRef CAS.
  25. T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda and T. Scialom, Toolformer: Language Models Can Teach Themselves to Use Tools, in Advances in Neural Information Processing Systems, 2023, vol. 36, pp.68539–68551 Search PubMed.
  26. Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang and T. Gui, The Rise and Potential of Large Language Model Based Agents: A Survey, arXiv, 2023, preprint, arXiv:2309.07864, DOI: 10.48550/arXiv.2309.07864.
  27. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng and L. Wang, MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, arXiv, 2023, preprint, arXiv:2303.11381, DOI: 10.48550/arXiv.2303.11381.
  28. Y. Shen, K. Song, X. Tan, D. Li, W. Lu and Y. Zhuang, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, in Advances in Neural Information Processing Systems, 2023, vol. 36, pp. 38154–38180 Search PubMed.
  29. GPT-4 V(ision) system card, https://openai.com/index/gpt-4v-system-card/, (accessed 11 May 2025).
  30. Hello GPT-4o, https://openai.com/index/hello-gpt-4o/, (accessed 11 May 2025).
  31. L. Morin, M. Danelljan, M. I. Agea, A. Nassar, V. Weber, I. Meijer, P. Staar and F. Yu, MolGrapher: Graph-based Visual Recognition of Chemical Structures, arXiv, 2023, preprint, arXiv:2308.12234, DOI: 10.48550/arXiv.2308.12234.
  32. J. W. Dankwardt, Nickel-Catalyzed Cross-Coupling of Aryl Grignard Reagents with Aromatic Alkyl Ethers: An Efficient Synthesis of Unsymmetrical Biaryls, Angew. Chem., Int. Ed., 2004, 43, 2428–2432 CrossRef CAS PubMed.
  33. Z.-Y. Tang and Q.-S. Hu, Room-Temperature Ni(0)-Catalyzed Cross-Coupling Reactions of Aryl Arenesulfonates with Arylboronic Acids, J. Am. Chem. Soc., 2004, 126, 3058–3059 CrossRef CAS.
  34. B.-T. Guan, Y. Wang, B.-J. Li, D.-G. Yu and Z.-J. Shi, Biaryl Construction via Ni-Catalyzed C−O Activation of Phenolic Carboxylates, J. Am. Chem. Soc., 2008, 130, 14468–14470 CrossRef CAS PubMed.
  35. K. W. Quasdorf, X. Tian and N. K. Garg, Cross-Coupling Reactions of Aryl Pivalates with Boronic Acids, J. Am. Chem. Soc., 2008, 130, 14422–14423 CrossRef CAS.
  36. K. W. Quasdorf, M. Riener, K. V. Petrova and N. K. Garg, Suzuki−Miyaura Coupling of Aryl Carbamates, Carbonates, and Sulfamates, J. Am. Chem. Soc., 2009, 131, 17748–17749 CrossRef CAS PubMed.
  37. B.-J. Li, Y.-Z. Li, X.-Y. Lu, J. Liu, B.-T. Guan and Z.-J. Shi, Cross-Coupling of Aryl/Alkenyl Pivalates with Organozinc Reagents through Nickel-Catalyzed C-O Bond Activation under Mild Reaction Conditions, Angew. Chem., Int. Ed., 2008, 47, 10124–10127 CrossRef CAS.
  38. K. Muto, J. Yamaguchi and K. Itami, Nickel-Catalyzed C–H/C–O Coupling of Azoles with Phenol Derivatives, J. Am. Chem. Soc., 2012, 134, 169–172 CrossRef CAS PubMed.
  39. B. L. H. Taylor, E. C. Swift, J. D. Waetzig and E. R. Jarvo, Stereospecific Nickel-Catalyzed Cross-Coupling Reactions of Alkyl Ethers: Enantioselective Synthesis of Diarylethanes, J. Am. Chem. Soc., 2011, 133, 389–391 CrossRef CAS PubMed.
  40. M. R. Harris, L. E. Hanna, M. A. Greene, C. E. Moore and E. R. Jarvo, Retention or Inversion in Stereospecific Nickel-Catalyzed Cross-Coupling of Benzylic Carbamates with Arylboronic Esters: Control of Absolute Stereochemistry with an Achiral Catalyst, J. Am. Chem. Soc., 2013, 135, 3303–3306 CrossRef CAS PubMed.
  41. A. R. Ehle, Q. Zhou and M. P. Watson, Nickel(0)-Catalyzed Heck Cross-Coupling via Activation of Aryl C–OPiv Bonds, Org. Lett., 2012, 14, 1202–1205 CrossRef CAS.
  42. D. Probst and J.-L. Reymond, Visualization of very large high-dimensional data sets as minimum spanning trees, J. Cheminf., 2020, 12, 12 Search PubMed.
  43. P.-P. Chen, E. L. Lucas, M. A. Greene, S.-Q. Zhang, E. J. Tollefson, L. W. Erickson, B. L. H. Taylor, E. R. Jarvo and X. Hong, A Unified Explanation for Chemoselectivity and Stereospecificity of Ni-Catalyzed Kumada and Cross-Electrophile Coupling Reactions of Benzylic Ethers: A Combined Computational and Experimental Study, J. Am. Chem. Soc., 2019, 141, 5835–5855 CrossRef CAS PubMed.
  44. A. L. Clevenger, R. M. Stolley, J. Aderibigbe and J. Louie, Trends in the Usage of Bidentate Phosphines as Ligands in Nickel Catalysis, Chem. Rev., 2020, 120, 6124–6196 CrossRef CAS PubMed.
  45. J. B. Diccianni and T. Diao, Mechanisms of Nickel-Catalyzed Cross-Coupling Reactions, Trends Chem., 2019, 1, 830–844 CrossRef CAS.
  46. C. A. Tolman, Steric effects of phosphorus ligands in organometallic chemistry and homogeneous catalysis, Chem. Rev., 1977, 77, 313–348 CrossRef CAS.
  47. H. Ogawa, H. Minami, T. Ozaki, S. Komagawa, C. Wang and M. Uchiyama, How and Why Does Ni0 Promote Smooth Etheric C-O Bond Cleavage and C-C Bond Formation? A Theoretical Study, Chem. – Eur. J., 2015, 21, 13904–13908 CrossRef CAS PubMed.
  48. J. E. Borowski, S. H. Newman-Stonebraker and A. G. Doyle, Comparison of Monophosphine and Bisphosphine Precatalysts for Ni-Catalyzed Suzuki–Miyaura Cross-Coupling: Understanding the Role of the Ligation State in Catalysis, ACS Catal., 2023, 13, 7966–7977 CrossRef CAS PubMed.
  49. K. Muto, J. Yamaguchi, A. Lei and K. Itami, Isolation, Structure, and Reactivity of an Arylnickel(II) Pivalate Complex in Catalytic C–H/C–O Biaryl Coupling, J. Am. Chem. Soc., 2013, 135, 16384–16387 CrossRef CAS.
  50. M. C. Schwarzer, R. Konno, T. Hojo, A. Ohtsuki, K. Nakamura, A. Yasutome, H. Takahashi, T. Shimasaki, M. Tobisu, N. Chatani and S. Mori, Combined Theoretical and Experimental Studies of Nickel-Catalyzed Cross-Coupling of Methoxyarenes with Arylboronic Esters via C–O Bond Cleavage, J. Am. Chem. Soc., 2017, 139, 10347–10358 CrossRef CAS.
  51. H. Xu, K. Muto, J. Yamaguchi, C. Zhao, K. Itami and D. G. Musaev, Key Mechanistic Features of Ni-Catalyzed C–H/C–O Biaryl Coupling of Azoles and Naphthalen-2-yl Pivalates, J. Am. Chem. Soc., 2014, 136, 14834–14844 CrossRef CAS PubMed.
  52. D.-G. Yu, X. Wang, R.-Y. Zhu, S. Luo, X.-B. Zhang, B.-Q. Wang, L. Wang and Z.-J. Shi, Direct Arylation/Alkylation/Magnesiation of Benzyl Alcohols in the Presence of Grignard Reagents via Ni-, Fe-, or Co-Catalyzed sp3 C–O Bond Activation, J. Am. Chem. Soc., 2012, 134, 14638–14641 CrossRef CAS PubMed.
  53. B. Yang and Z.-X. Wang, Nickel-Catalyzed Cross-Coupling of Allyl Alcohols with Aryl- or Alkenylzinc Reagents, J. Org. Chem., 2017, 82, 4542–4549 CrossRef CAS PubMed.
  54. Y. Kita, R. D. Kavthe, H. Oda and K. Mashima, Asymmetric Allylic Alkylation of β-Ketoesters with Allylic Alcohols by a Nickel/Diphosphine Catalyst, Angew. Chem., 2016, 128, 1110–1113 CrossRef.
  55. S.-C. Sha, H. Jiang, J. Mao, A. Bellomo, S. A. Jeong and P. J. Walsh, Nickel-Catalyzed Allylic Alkylation with Diarylmethane Pronucleophiles: Reaction Development and Mechanistic Insights, Angew. Chem., 2016, 128, 1082–1086 CrossRef.
  56. S. Z. Tasker, A. C. Gutierrez and T. F. Jamison, Nickel-Catalyzed Mizoroki–Heck Reaction of Aryl Sulfonates and Chlorides with Electronically Unbiased Terminal Olefins: High Selectivity for Branched Products, Angew. Chem., Int. Ed., 2014, 53, 1858–1861 CrossRef CAS.
  57. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan and Y. Cao, ReAct: Synergizing Reasoning and Acting in Language Models, arXiv, 2023, preprint, arXiv:2210.03629, DOI: 10.48550/arXiv.2210.03629.
  58. E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev-Shwartz, A. Shashua and M. Tenenholtz, MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning, arXiv, 2022, preprint, arXiv:2205.00445, DOI: 10.48550/arXiv.2205.00445.
  59. RDKit, https://www.rdkit.org/, (accessed 14 August 2021).
  60. D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt and J. Larson, From Local to Global: A Graph RAG Approach to Query-Focused Summarization, arXiv, 2024, preprint, arXiv:2404.16130, DOI: 10.48550/arXiv.2404.16130.

Footnote

These authors made equal contributions to this article.

This journal is © the Partner Organisations 2026
Click here to see how this site uses Cookies. View our privacy policy here.