Yaoyi
Su‡
a,
Siyuan
Yang‡
ab,
Yuanhan
Liu
a,
Aiting
Kai
ab,
Linjiang
Chen
*cd and
Ming
Liu
*ab
aDepartment of Chemistry, Zhejiang University, Hangzhou, Zhejiang 310058, China. E-mail: mingliu@zju.edu.cn
bZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, Zhejiang 311200, China
cKey Laboratory of Precision and Intelligent Chemistry, University of Science and Technology of China, Hefei, Anhui 230026, China. E-mail: linjiangchen@ustc.edu.cn
dSchool of Chemistry, School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
First published on 19th December 2024
Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in the text, including authors, affiliations, synthetic procedures, surface area, and the Cambridge Crystallographic Data Centre (CCDC) number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding to cage-related questions.
Large language models (LLMs) like Generative Pre-trained Transformer (GPT) can generate responses based on patterns and statistical principles learned during their pre-training phase.14 These models can interact dynamically, adapting to the context of a conversation to simulate human-like dialogue and communication. With hundreds of millions of parameters, GPT has shown exceptional performance and dominance in various fields, including natural language processing (NLP),15,16 medical imaging analysis,17,18 and chemical and biological research,19,20 garnering widespread recognition and acclaim for its capabilities.
Prompt engineering has become a crucial technique in LLMs for optimizing and fine-tuning them to perform specific tasks and achieve desired outcomes. This technique involves creating high-quality prompts that guide LLMs to generate accurate results.21,22 The process involves selecting the appropriate type of prompt, adjusting their size and structure, and sequencing them effectively according to the task requirements. Zheng et al. used prompt engineering to guide GPT-3.5 in extracting synthetic texts from the literature related to Metal–Organic Frameworks (MOFs) with a precision accuracy exceeding 90%.23 Afterwards, the same group also used a prompt-learning strategy to facilitate MOF material synthesis experiments through a symbiotic human-AI collaboration.24 They later applied a similar approach to guide the discovery and optimization of synthesis conditions for MOFs and Covalent Organic Frameworks (COFs).25 In 2024, Lu et al. successfully predicted the yield of ammonia catalytic reduction with up to 86% accuracy by incorporating pre-existing experimental data in the prompt project.26
In this study, we employed prompt engineering to guide GPT-4 in performing multi-label text classification, a task more complex than binary classification and a significant challenge for large language models. Literature paragraphs were labeled based on the information they contained, such as authors, cage names, synthetic procedures, surface area, and the CCDC number of the corresponding cages. These labeled paragraphs were then used as the input for GPT-4 to extract and tabulate information into the cage knowledge database. Each row in the database contains details such as the cage name, corresponding synthetic procedures, monomers and their synthesis procedures, cage stoichiometry, surface area, and CCDC number. The accuracy of GPT-4's multi-label classification and information extraction was assessed by comparing its results with manually curated data, which served as the ground truth. Ultimately, the cage knowledge database was used to develop a chatbot capable of reliably answering a variety of cage-related questions.
![]() | ||
Fig. 1 The workflow of GPT-based information extraction from the literature. The workflow of GPT-based information extraction from the literature. |
In the first step, the articles were divided into text segments. Each text segment was assigned a categorical label using a GPT-4 model trained with prompt engineering techniques (ESI, Section S2†). Since topology is described in a well-defined and fixed format, Python code was employed to identify specific sequences, as this method is more cost-effective compared to using GPT-4. In the second step, the selected text containing relevant information was further organized into tabulated data by both human experts and GPT-4. The verified answers were then compiled into a database, which was subsequently used for constructing chatbots.
Category | Description | Required |
---|---|---|
Comprehensive synthesis | Contained comprehensive experimental conditions of the chemical reaction. The chemical reaction conditions must appear with clear information about the reaction temperature, reaction time, reactants, products, solvents, and their amounts | ✓ |
CCDC | Contained CCDC number | ✓ |
Surface area | Contained information on the specific surface area of a compound | ✓ |
This paper's authors | Contained information about the authors of this paper | ✓ |
Affiliation | Contained information about the authors' organizations, cities, nationalities etc. | ✓ |
Extra authors | Contained authors of other articles, such as background descriptions | |
Incomprehensive synthesis | Contained incomprehensive experimental conditions of the chemical reaction | |
References | Contained references | |
Others | Paragraphs that exceed all of the previously mentioned categories |
Precision and recall were calculated as follows:
The F1 score is a reconciled average of precision and recall:
Fig. 2b shows the distribution of different text categories, revealing that most of the text in the original documents falls under the categories of references and other sections. The key information we needed—such as authors, affiliations, specific surface areas, CCDC numbers, and experimental procedures—constitutes less than 10% of the total text content. This indicates that the GPT-4 classification process significantly reduces the volume of text to be processed in the tabulation step, lowering the corresponding costs.
The recall and precision results are illustrated in Fig. 2c. The “comprehensive synthesis” category had the highest recall rate at 0.74, which can be attributed to distinctive markers in the text, such as frequently mentioned compound names and amounts. However, there was a significant drop in precision for this category, down to 0.61. Observing the actual-versus-predicted category matrix (ESI, Fig. S1†), the primary error came from misidentifying segments that should have been labeled as “incomprehensive synthesis” as “comprehensive synthesis”. This highlights that even with prompt engineering designed to differentiate between comprehensive and incomprehensive synthesis, some errors persist. Texts under the “this paper's authors”, “Affiliation”, and “CCDC number” categories had similar recall rates of 0.67, 0.67, and 0.66, respectively. However, the precision for “this paper's authors” was notably higher than for “Affiliation” and “CCDC number”, showing a marked difference between precision and recall. Specific surface area information had both low recall and precision, likely because, while it has the identifier “m2 g−1”, it is often confused with similar terms like “m2 s−1” and “m/z”. The recall and precision of the surface area and CCDC numbers, which should be readily identifiable due to their distinct identifiers, were found to be unsatisfactory. This outcome can be attributed to redundant texts significantly interfering with the encoding and decoding process of GPT-4. Evidence for this conclusion is present in the actual-versus-predicted category matrix (ESI, Fig S1†), which shows that the recall of CCDC is 66.44%, with 25.17% of the information misclassified as “others.” Similarly, the recall for surface area is 47.71%, while information classified as “others” accounts for 43.13%, a value comparable to the recall.
For the monomers, differences arose in the naming of the same compound (BPDDP), while other differences were primarily due to redundant text being extracted by GPT-4. Specifically, GPT-4 included titles along with the synthetic routes for cages and monomers, while manual work did not. Regarding surface area, GPT-4's response included more information than the manual response, providing additional comparisons of surface area between BPPOC and another compound, BTPOC. This suggests that GPT-4 has the potential to offer additional information, enhancing researchers' understanding of cage-related knowledge.
The results of the BERT score calculation, shown in Fig. 3b, indicate that the average similarity score across all information in the articles was 0.9155. In particular, with a score of 0.9357, the CCDC numbers showed the highest similarity. The similarity scores for specific surface area, synthetic routes of molecular cages, names and synthetic routes of monomers were also relatively high, each reaching a value of around 0.90. The lowest similarity score of 0.8405 was observed for the information related to the topology, mainly because a significant part of the relevant information could not be successfully extracted from the text and was therefore labelled as “None”.
Based on the complexity of their synthesis, the articles studied were systematically categorized into three different classes: class I represents articles in which only a single POC has been reported; class II represents articles reporting multiple POCs without transformation relationships between them, usually synthesized in parallel using the same reaction type but different building blocks; class III represents articles that reported multiple POCs with transforming relationships among them. Analysis of the statistical graph shows a trend that, in general, the accuracy of information extraction gradually decreases as the complexity of the articles increases. However, in the extraction of topologies, articles in the second class had a considerably higher similarity than those in the first class, contrary to the general trend.
The distribution of similarity was further analyzed using the molecular cage synthetic route as a representative example (Fig. 3c). The analysis shows that most similarities are above 0.8, with a significant proportion exceeding 0.9. However, a few samples had notably lower similarity. Upon reviewing these cases, we found that low similarity scores were mainly due to unsuccessful extractions, resulting in a single word “None” or very short answers. A typical example of this error was the vague description “Condensation of a pyridine system” replacing a comprehensive synthesis route. Fortunately, such instances are rare and do not significantly impact GPT-4's overall performance.
In the task of information extraction and tabulation, GPT-4 demonstrated strong capabilities in processing input text and extracting multiple categories of information simultaneously. This feature can significantly aid researchers by allowing for quicker reading and summarization of new papers. With GPT-4's assistance, researchers can save considerable time and effort in literature review, enabling them to focus more on tasks that require innovation and creativity.
These results offer substantial evidence that GPT-4 is capable of answering questions based on the information contained within the database. The types of questions it can address are diverse and not limited to a specific subset. GPT-4's responses are comprehensive enough to aid chemists in obtaining relevant information without the need to read the full text. Additionally, the system can meet practical demands in the field, such as providing detailed guidance for the synthesis of organic cages.
Approximately 64% of the cages in the database were formed via imine condensation, with 11.63% being reduced from these imine precursors. This indicates that imine chemistry currently dominates the synthesis of porous organic cages (POCs). Additionally, 12.68% of the cages were synthesized via ether bonds, while other synthetic methods, such as amides and boronic esters, were also observed (Fig. 5a).
![]() | ||
Fig. 5 Statistical analysis of the chemistries involved in synthesis (a), topology (b), CCDC structures (c) and surface areas (d) of cages. |
In terms of topologies, the analysis shows that [2 + 3]-cages account for 42.89%, which is nearly half of all entries in the database. Additionally, [4 + 6]-cages and [8 + 12]-cages are relatively prevalent, comprising 19.75% and 8.49% of the total, respectively (Fig. 5b).
Surface area provides guidance for exploring cage porosity and identifying potential applications. The density and accessible surface area (ASAs) of 253 entries were calculated using the Zeo++ software package (Fig. 5c).31 The probe radius was set as 1.82 Å, which is the kinetic radius of a nitrogen molecule. The results revealed a negative correlation between density and accessible surface area (ASA). Lower densities, around 0.5 g cm−3, correspond to ASAs exceeding 3000 m2 g−1 (red circle, Table S2†). Non-accessible surface area (NASA) values are generally lower, with significant values observed only within the density range of 1.00–1.25 g cm−3 (green circle). This is due to the inherent low surface area of high-density crystal structures.
Analysis of experimental surface area data revealed that approximately 60 POCs exhibit surface areas exceeding 500 m2 g−1, with 12 entries surpassing 1500 m2 g−1 (Fig. 5d). With the exception of a boronic ester-based cage, all high-surface-area cages were imine-based. This suggested that imine-based cages are currently one of the most promising methods for achieving high surface areas.
The codes and required python modules for text classification, information tabularization and the directly runnable chatbot can be found at https://hub.docker.com/r/syy12137059/cage_gpt/tags.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00337c |
‡ Equal contribution. |
This journal is © The Royal Society of Chemistry 2025 |