Kangyong
Ma
*
College of Physics and Electronic Information Engineering, Zhejiang Normal University, Jinhua City 321000, China. E-mail: kangyongma@outlook.com; kangyongma@gmail.com
First published on 10th January 2025
This work utilizes collected and organized instructional data from the field of chemical science to fine-tune mainstream open-source large language models. To objectively evaluate the performance of the fine-tuned models, we have developed an automated scoring system specifically for the chemistry domain, ensuring the accuracy and reliability of the evaluation results. Building on this foundation, we have designed an innovative chemical intelligent assistant system. This system employs the fine-tuned Mistral NeMo model as one of its primary models and features a mechanism for flexibly invoking various advanced models. This design fully considers the rapid iteration characteristics of large language models, ensuring that the system can continuously leverage the latest and most powerful AI capabilities. A major highlight of this system is its deep integration of professional knowledge and requirements from the chemistry field. By incorporating specialized functions such as molecular visualization, SMILES string processing, and chemical literature retrieval, the system significantly enhances its practical value in chemical research and applications. More notably, through carefully designed mechanisms for knowledge accumulation, skill acquisition, performance evaluation, and group collaboration, the system can optimize its professional abilities and interaction quality to a certain extent.
Fine-tuning has a significant effect on improving the performance of LLMs in specific application scenarios, which lays the foundation for LLMs to further promote scientific progress in various fields.4,5 For example, research by Ouyang et al. (2022), Wei et al. (2021), and Sanh et al. (2021) demonstrates that fine-tuning language models on a specific set of tasks significantly enhances their ability to understand and execute instructions.6–8 This method not only reduces the reliance on large datasets but also improves the generalization capabilities of the models. Given the scale of LLMs, a common fine-tuning strategy currently involves adjusting a limited number of parameters while keeping the rest fixed.9 This technique, known as Parameter-Efficient Fine-Tuning (PEFT), selectively tunes a small subset of parameters. PEFT has also gained interest beyond NLP, particularly in the CV community, for fine-tuning large-parameter visual models such as Vision Transformers (ViTs), diffusion models, and visual-language models.4
However, fine-tuning large models still has some drawbacks. For example, this method requires substantial computational resources and data. Fine-tuning large models is also prone to overfitting on small-scale datasets and cannot accurately reflect potential risks (e.g., “hallucinations”), which may introduce latent hazards. Additionally, it cannot update its knowledge base in real time.10 The primary reasons for these drawbacks are that both pre-trained large models and fine-tuned large models use parameter memory to construct a parameterized implicit knowledge base.11 Hybrid models that combine parametric memory and non-parametric (i.e., retrieval-based) memory can address some of these issues.12–14 The Retrieval-Augmented Generation (RAG) technique improves the accuracy and reliability of hybrid model generation by integrating knowledge from external databases (non-parametric memory), especially for knowledge-intensive tasks. This approach also allows for continuous knowledge updates and the integration of domain-specific information. RAG synergizes the intrinsic knowledge of large language models with the extensive dynamic repositories of external databases.15
Furthermore, with the continuous development of LLMs, they are seen as potential sparks for Artificial General Intelligence (AGI), providing hope for the construction of general AI agents.16 Currently, AI agents are considered a crucial step towards achieving AGI, encompassing the potential for a wide range of intelligent activities.17–19 In many real-world tasks, the capabilities of agents can be enhanced by constructing multiple cooperative agents.20 Studies have shown that multi-agent systems help encourage divergent thinking21 (Liang et al., 2023), improve factuality and reasoning abilities22 (Du et al., 2023), and provide verification23 (Wu et al., 2023). These features have garnered widespread attention. Currently, the general frameworks for constructing LLM applications with multiple agents include AutoGen,20 crewAI,38 Langchain39 and others. Intelligent agents based on large language models (LLMs) are increasingly permeating various aspects of human production and daily life. However, designing artificial intelligence agents with self-evolution capabilities has become a current research hotspot. For example, Li et al.24 proposed an evolutionary framework for agent evolution and arrangement called EvoluaryAgent. Qian et al.25 proposed a general strategy for inter-task agent self-evolution based on Investigation-Consolidation-Exploitation (ICE).
These artificial intelligence technologies will provide a new paradigm for scientific research and open new avenues for scientific innovation, thereby significantly accelerating the pace of scientific discoveries. The close collaboration between artificial intelligence technologies and scientists heralds the advent of a new era of scientific exploration and technological breakthroughs.26,27
In recent years, despite the rapid development of artificial intelligence technology, especially the emergence of large language models, its application in the field of chemistry has not yet been widely popularized. As an important productivity tool, artificial intelligence not only improves work efficiency but also provides a new paradigm for scientific research. For chemistry, a discipline with a long history, how to combine with this advanced productivity tool to breathe new life into the field has become an important topic facing the new generation of chemists. This research aims to address this challenge by developing a dedicated intelligent assistance system for the field of chemistry through the integration of cutting-edge AI technologies. Specifically, we first collected and organized a large amount of data from the field of chemical science to fine-tune mainstream open-source large language models. Secondly, we designed a set of evaluation systems specifically for the chemistry field to detect the performance of the fine-tuned models and select the best-performing model from them. On this basis, we developed an AI assistant for the chemistry field. This system integrates multi-agent architecture, retrieval-augmented generation (RAG) technology, online search functionality, and an interactive user interface. It not only provides an innovative platform for chemical research and education but also offers valuable research opportunities for exploring multi-agent collaboration in complex systems. By fusing traditional chemical knowledge with cutting-edge AI technology, this system is expected to promote innovative development in the field of chemistry and provide new ideas and tools for solving current scientific and engineering challenges. Fig. 1 illustrates the overall process of this study.
For example, Kevin Maik Jablonka et al.45 fine-tuned the large language model GPT-3 to perform various tasks in chemistry and materials science, including properties of molecules and materials, as well as chemical reaction outcomes. Zikai Xie et al.46 demonstrated the effectiveness of fine-tuned GPT-3 in predicting electronic and functional properties of organic molecules. Shifa Zhong et al.47 developed quantitative structure–activity relationship (QSAR) models for water pollutant activity/properties by fine-tuning GPT-3 models. Seongmin Kim et al.48 evaluated the effectiveness of pre-trained and fine-tuned large language models (LLMs) in predicting the synthesizability of inorganic compounds and selecting synthetic precursors. Results showed that fine-tuned LLMs performed comparably, and sometimes superiorly, to recent custom machine learning models in these tasks, while requiring less user expertise, cost, and time to develop.
These research findings conclusively demonstrate that fine-tuning LLMs can significantly enhance their application breadth and effectiveness in the field of chemical sciences. This approach not only provides powerful tools for chemical research but also promises to accelerate innovation in chemical sciences, offering new ideas and methods for solving complex chemical problems. As technology continues to advance, we can anticipate that fine-tuned LLMs will play an increasingly important role in the field of chemical sciences, driving chemical research towards deeper and more precise directions.
For example, Kevin Maik Jablonka et al.49 developed ChemCrow, an LLM chemical agent designed to complete chemistry tasks such as organic synthesis, drug discovery, and materials design. By integrating multiple expert-designed chemical tools and using GPT-4 as the LLM, they enhanced the performance of LLMs in the field of chemistry and demonstrated new capabilities. Daniil A. Boiko et al.50 reported Coscientist, a GPT-4-powered artificial intelligence system capable of autonomously designing, planning, and executing complex scientific experiments. Coscientist leverages large language models combined with tools such as internet searches, document retrieval, code execution, and experimental automation. Andrew D. McNaughton et al.51 introduced a system called CACTUS (Chemistry Agent Connecting Tool-Usage to Science), which is an intelligent agent based on large language models (LLMs) designed to enhance advanced reasoning and problem-solving capabilities in the fields of chemistry and molecular discovery by integrating cheminformatics tools.
These research findings demonstrate that AI agents, by expanding the functionality of large language models, enable their more extensive application in the field of chemistry.
For instance, Tong Xie et al.43 constructed a dataset by integrating resources from multiple scientific domains to support natural science research, especially in the fields of physics, chemistry, and materials science. It includes:
This study collected and organized datasets related to the field of chemical science from the above research for use in this study. Specific details can be found in the ESI†.
Dataset | Url link | Data format |
---|---|---|
ESOL43 | https://github.com/MasterAIEAM/Darwin/blob/main/dataset/ESOL/ESOL.json | Json |
MoosaviCp43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/MoosaviCp/MoosaviCp.json | Json |
MoosaviDiversity43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/MoosaviDiversity/MoosaviDiversity.json | Json |
NagasawaOPV43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/NagasawaOPV/NagasawaOPV.json | Json |
Chembl43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/chembl/chembl.json | Json |
matbench_expt_gap43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/matbench_expt_gap/matbench_expt_gap.json | Json |
matbench_glass43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/matbench_glass/matbench_glass.json | Json |
matbench_is_metal43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/matbench_is_metal/matbench_is_metal.json | Json |
matbench_steels43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/matbench_steels/matbench_steels.json | Json |
Pei43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/Pei/pei.json | Json |
waterStability43 | https://github.com/MasterAI-EAM/Darwin/blob/main/dataset/waterStability/waterStability.json | Json |
description_guided_molecule_design44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
forward_reaction_prediction44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
molecular_description_generation44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
reagent_prediction44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
property_prediction44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
Retrosynthesis44 | https://huggingface.co/datasets/zjunlp/Mol-Instructions/tree/main/data | Json |
Fig. 3 and 4 show the distribution of output character lengths for the instruction dataset and the usage frequency and types of the 20 most commonly used instructions in this work.
Fig. 3 illustrates the character count (output length) of the output text in the dataset, which exhibits a wide distribution range, covering both short and long texts. The distribution is concentrated in the 0 to 1000 character range. Short texts (texts with fewer characters) appear more frequently, and as the output length increases, the frequency decreases. Kernel Density Estimation (KDE), also known as Parzen's window,28 is one of the most renowned methods for estimating the underlying probability density function of a dataset. The KDE curve provides a smooth estimate of the distribution within this range, aiding in a more intuitive understanding of the text distribution pattern.
The bar chart (Fig. 4) shows the frequency of the 20 most common instructions in the dataset for this study. Among these, “Provide a brief overview of this molecule” and “Provide a description of this molecule” appear significantly more often than other instructions, indicating their prominent role in the dataset. Nonetheless, other types of instructions also appear, demonstrating the diversity of instruction types within the dataset.
Parameter | Value | Description |
---|---|---|
Lora_alpha | 16 | LoRA alpha parameter |
Max_steps | 60 | Maximum training steps |
Learning_rate | 2 × 10−4 | Learning rate |
Weight_decay | 0.01 | Weight decay parameter |
Seed | 3407 | Random seed |
Fig. 5 presents the training loss curve during the training process of LLMs. In the initial phase of training, the loss value is relatively high because the model parameters have not yet been optimized, leading to a significant gap between the predicted results and the actual values. As the training progresses, the model gradually learns and continuously adjusts the parameters, making the predicted results increasingly closer to the actual values. Consequently, the error decreases, and the loss value gradually declines and tends to stabilize.
Different scoring criteria were designed for different questions. Additionally, the evaluator considered some special cases in the field of chemical science, assigning higher weights to key words such as ‘reaction’, ‘mechanism’, 'synthesis', and ‘catalyst’. It also recognizes specific chemical terms (e.g., ‘alkane’, ‘alkene’, and ‘alkyne’), considers conversions between different units when making numerical comparisons (such as kJ to kcal), and applies special processing for questions involving specific concepts such as the LUMO, the HOMO, and orbital energies (comparing the signs (positive or negative) of the extracted answer value and the correct answer value; LUMO and HOMO energies are typically negative, so the correctness of the sign is important). For questions involving MOFs, it pays special attention to key concepts such as ‘linker’, ‘node’, and ‘topology’.
The system employs various methods to evaluate the quality of answers. For numerical problems, it calculates relative errors and assigns corresponding scores. It uses Levenshtein distance31 or simple word set intersections to compute the similarity between answers and standard solutions. BLEU scores32 and ROUGE scores33 are used to assess the quality of generated text and summaries, respectively. The Flesch34 Reading Ease Index is utilized to evaluate text readability. In addition to these methods, the system also incorporates evaluation criteria such as keyword relevance, coherence, conciseness, factual accuracy, and creativity. Fig. 7 presents the scoring criteria for various types of questions.
Through these detailed settings, the evaluator can better assess the model's understanding of concepts related to molecular orbital theory, rather than just simple numerical matching. This enables a comprehensive evaluation of AI models' performance in answering chemistry-related questions, covering multiple dimensions including accuracy, relevance, readability, and creativity. Fig. 8 illustrates the scoring process. (See the ESI† for details).
Based on this classification, we developed a highly customized scoring analysis system, implemented through the OptimizedModelEvaluator class. This system evaluated eight different models, including Llama3, Mistral, Phi-3, Gemma, Gemma2, Phi-3 Medium, MistralNemo, and Llama3.1. Specific scoring criteria and weights were designed for the three main question types: numeric, descriptive, and generate. For numeric type questions, the system focuses on numeric_accuracy (60%), keyword_relevance (20%), and conciseness (20%). The descriptive type considers bleu_score (20%), rouge_scores (20%), keyword_relevance (20%), readability (20%), and coherence (20%). The generate type emphasizes creativity (40%), coherence (30%), and keyword_relevance (30%).
The system also introduced chemistry-specific keyword weights and terminology, assigning different weights to various chemical concepts. For example, reaction, mechanism, and synthesis each account for 0.5 points, while bond, electron, and orbital each account for 0.3 points. Additionally, the system pays special attention to key chemical terms such as alkane, alkene, alkyne, aromatic, nucleophile, and electrophile. To ensure the accuracy of numerical evaluations, the system integrated conversion factors between common units. For instance, 1 kJ = 0.239006 kcal and 1 eV = 96.485 kJ. This carefully designed configuration ensures that the scoring system can accurately capture the characteristics and challenges of different types of questions.
For numeric type questions, the system identifies and extracts values and units, supports the aforementioned unit conversions, and calculates accuracy scores based on relative errors. In terms of keyword relevance scoring, the system uses a predefined keyword_importance dictionary to assign weights to different keywords, while also considering specific chemical terminology.
For descriptive and generate type questions, the system integrates various advanced natural language processing techniques. Text similarity scoring primarily uses Levenshtein distance, with the word set overlap rate as a fallback when unavailable. The system also applies BLEU and ROUGE algorithms to evaluate generated text quality, uses the textstat library to calculate readability, and assesses text coherence based on the word overlap rate between adjacent sentences.
For domain-specific knowledge such as HOMO/LUMO energy levels or MOF structures, the system applies special scoring rules to accurately evaluate these highly specialized chemical concepts. We implemented a complex and refined set of rules in the scoring system. These rules not only consider numerical accuracy but also include unit consistency, relative energy relationships, structural composition, functional properties, and more.
For HOMO/LUMO energy levels, the system first evaluates numerical accuracy, allowing an error range of ±0.1 eV. We also consider unit consistency, prioritizing electron volts (eV) as the standard unit and slightly penalizing answers using non-standard units. Furthermore, the system checks if the relative positions of HOMO and LUMO levels are correct and rewards answers that correctly mention the energy gap.
When evaluating MOF structures, our rules are more comprehensive. The system checks if the answer correctly identifies the metal center, organic linkers, and their connectivity. We also assess descriptions of porosity and specific surface area, as well as the identification and explanation of the MOF's main functions. To encourage more in-depth answers, we provide extra points for mentioning synthesis methods and characterization techniques.
These rules are implemented through the calculate_factual_accuracy method in the OptimizedModelEvaluator class. This method uses regular expressions to extract values and units and combines them with a chemistry knowledge base to provide reference values and expected ranges. The scoring system can dynamically adjust weights based on the depth and accuracy of the provided information.
By implementing these special rules, our scoring system can more accurately evaluate the model's performance in handling complex chemical concepts. This not only improves the accuracy and professionalism of the evaluation but also provides valuable feedback to model developers regarding the model's mastery of specific chemical domain knowledge. This approach allows us to gain a more comprehensive understanding of the capabilities and limitations of large language models in specialized chemical problems, providing important guidance for further improvement and application of these models. Creativity scoring combines uniqueness (degree of difference from standard answers) and coherence, mainly used to evaluate generate type questions.
This comprehensive scoring system is not just a simple word count or hard-coded decision tree, but a complex evaluation tool that integrates multiple techniques and domain knowledge. By preliminarily classifying questions and designing specific scoring criteria and weights for each category, we can more accurately evaluate the model's performance in different types of tasks. This approach enables us to comprehensively and deeply analyze the performance of large language models on complex and diverse chemical problems.
The results from Fig. 9, model performance evaluation, show that Mistral NeMo demonstrated the strongest overall performance with an average score of 4.39. The model excelled particularly in descriptive tasks (3.60) while maintaining strong performance in numeric tasks (4.25) and generative tasks (6.24). Mistral and Llama3 follow closely behind, scoring 4.07 and 4.00 respectively, with both performing very similarly. Phi-3 follows with a score of 3.84, showing balanced capabilities across different question types.
Both Mistral and Llama3 performed notably better in generative tasks (Mistral: 6.36 and Llama3: 6.26) compared to descriptive ones (Mistral: 2.25 and Llama3: 2.28). Phi-3 showed particular strength in generative tasks (6.09) and comparative weakness in descriptive ones (2.61).
Notably, Gemma2-9B (3.70 points) shows significant improvement compared to its predecessor Gemma-7B (3.02 points). According to the technical reports,35,37 these gains can be attributed to several key architectural enhancements: First, Gemma2-9B adopts a deeper architecture with 42 transformer layers compared to Gemma-7B's 28 layers, along with an increased model dimension (d_model: 3584 vs. 3072). Second, it introduces novel components including interleaving local-global attentions (with a 4096-token local window and 8192-token global span) and the group-query attention (GQA) mechanism with num_groups = 2. Third, Gemma2 models employ knowledge distillation for training instead of traditional next-token prediction, learning from a larger teacher model on 8 trillion tokens. However, both models still face common challenges in keyword relevance, BLEU score, and ROUGE scores (<0.2), suggesting that while architectural and training advances boost overall capabilities, some fundamental limitations in text generation quality and precision remain.
The iteration from Mistral 7B to Mistral NeMo demonstrates significant architectural advances, scaling up from 7B to 12B parameters while introducing innovations such as the Tekken tokenizer for improved multilingual handling and expanding context length to 128 k tokens. These improvements enhance the model's capabilities across reasoning, instruction following, and multilingual tasks.52,53
We observe that Phi-3-medium (14B parameters), despite its larger capacity with 40 attention heads and 40 layers (embedding dimension 5120), shows more modest improvements on certain benchmarks compared to Phi-3-mini (3.8B parameters, 32 heads, 32 layers, and embedding dimension 3072). This suggests that our current data mixture, while effective for the smaller model architecture, may need further optimization to fully leverage the increased representational capacity of the 14B parameter scale.54
While Llama 3.1 8B incorporated multilingual capabilities and extended context length to 128 K tokens, it scored lower than Llama 3 8B which utilized a comprehensive post-training approach combining supervised fine-tuning, rejection sampling, PPO and DPO. This suggests that the diversity in fine-tuning strategies may play a more crucial role in model performance than expanded linguistic coverage and context length at the 8B parameter scale.36,55,56
Based on the comprehensive evaluation data, the analysis reveals a clear hierarchy in model performance, with Mistral NeMo leading at an average score of 4.39, followed by Mistral (4.07) and Llama3 (4.00). The models demonstrate distinct strengths across different question types, with generative tasks yielding the highest performance scores ranging from 6.2 to 6.4 for top performers. In numeric tasks, models showed moderate capability with scores between 4.02 and 4.25 for the leading models, while descriptive tasks proved most challenging with significantly lower scores, though Mistral NeMo maintained a notable advantage at 3.60 compared to others ranging from 1.97 to 3.01. Looking at specific evaluation criteria, most models exhibited strong creativity (above 0.77) and coherence, with Mistral NeMo particularly excelling in coherence at 0.962. However, all models struggled with keyword relevance, with scores varying across models but generally remaining low. The correlation analysis indicates that numeric accuracy operates largely independently from other metrics, while keyword relevance shows a moderate negative correlation with conciseness (−0.29). These findings suggest that while current models excel at creative and generative tasks, there remains significant room for improvement in precise information extraction and keyword relevance, particularly in descriptive tasks. The substantial variation in performance across different question types also indicates that optimal model selection should be task-dependent rather than assuming that one model will excel universally.
Research findings reveal the significant impact of model iterations on performance improvement, particularly evident in the evolution from Gemma-7B to Gemma2-9B35 and from Mistral-7B to Mistral-Nemo. However, the iteration from Llama3-8B to Llama3.1-8B failed to achieve the expected performance leap, possibly due to different iteration priorities.36 Notably, all tested models face common challenges, especially in keyword relevance and task scoring, highlighting the necessity of introducing additional technologies to address these shortcomings.
Nevertheless, the outstanding performance of these models in creative and generative tasks continues to demonstrate the inherent advantages of large language models in these domains. The test results indicate that fine-tuned large language models can meet researchers' needs to some extent, but still have many limitations, including the inability to update data in real-time, lack of online search capabilities, poor compatibility with specific domains, insufficient response accuracy, and limitations in decision-making for single large models.
Given these limitations exhibited by fine-tuned large language models, this study developed an artificial intelligence assistant for the chemical domain. This system cleverly integrates multi-agent architecture, Retrieval-Augmented Generation (RAG) technology, online search functionality, and a user-friendly interactive interface, aiming to comprehensively address the aforementioned shortcomings and provide researchers with a more intelligent, precise, and practical auxiliary tool.
AutoGen is an open-source framework for building LLM applications through multi-agent dialogue. In AutoGen, a conversable agent is an entity with a specific role that can send and receive messages to and from other conversable agents, such as starting or continuing a conversation. It maintains its internal context based on the messages sent and received and can be configured to have a range of functionalities, such as being supported by LLMs, tools, or human input. These agents can be implemented through AutoGen's built-in AssistantAgent (powered by GPT-4 for general problem-solving) and UserProxyAgent (configured to gather human input and execute tools).20
(This research can utilize models that have undergone fine-tuning and comprehensive performance testing as the system's response model. All fine-tuned large language models have been uploaded to the Hugging Face platform, allowing researchers to flexibly invoke different models from KANGYONGMA/Chemistry based on specific application scenarios. Additionally, the system supports the use of original base models without fine-tuning to execute tasks, providing greater flexibility and diverse options for research).
The above examples demonstrate the capability of a Retrieval-Augmented Generation (RAG) based intelligent agent system in accurately answering questions about the water solubility of chemical compounds. The system precisely answered the water solubility of two compounds at room temperature: CCC(O)(CC)CC at 0.14125375446227545 mol L−1, and a compound with a complex InChI representation at 2.0989398836235246 × 10−5 mol L−1. This high-precision response highlights the advantages of RAG technology in cheminformatics applications, especially in tasks requiring precise numerical outputs, where it outperforms traditional fine-tuned large language model approaches. The application of RAG functionality enables the system to retrieve and provide accurate numerical information. Fig. 12 illustrates the Q&A test results based on the RAG.
The project not only integrates advanced online search functionality but is also equipped with an intelligent summarization system that significantly enhances information retrieval capabilities. The project employs a multi-layered processing architecture that intelligently merges and refines web search results with knowledge base data to present users with precise and concise information summaries. Notably, the search results go beyond simple text summaries by incorporating interactive design elements. Specifically, key content within the summaries includes corresponding hyperlinks, allowing users to trace back to original information sources with just a click. This design enables researchers to conveniently access primary sources and quickly verify the accuracy of search content.
The framework consists of two primary classes: ChemistryAgent and ChemistryLab. The ChemistryAgent class maintains a developing knowledge repository and growing skill set through the knowledge_base and skills attributes, working to expand its capabilities via the learn() and acquire_skill() methods. A preliminary performance tracking system has been implemented through history-based assessment, with the evaluate_performance() method beginning to analyze effectiveness based on user feedback.
The refinement process, managed by the improve() and refine_skills() methods, represents early efforts toward developing new capabilities and refining existing ones. The system makes initial attempts to identify potential areas for enhancement by examining interaction patterns and user responses. At the group level, the ChemistryLab class introduces basic knowledge sharing among agents and implements foundational assessment cycles.
This architecture takes preliminary steps toward enabling incremental adjustments based on interactions and feedback, aiming to gradually enhance its domain expertise and interaction quality in chemistry-related discussions. While the current design creates a basic responsive framework that shows potential for adapting to user needs, it acknowledges substantial room for improvement across all aspects. The user feedback interface, shown in Fig. 15, provides initial support for ongoing refinement of the system's developing capabilities.
Through basic mechanisms including knowledge base expansion, skill development, and feedback incorporation, agents work toward building their understanding of chemical concepts and problem-solving approaches. This measured approach to capability enhancement represents early progress while acknowledging the significant work still needed to achieve more sophisticated and comprehensive functionality. The system remains in its nascent stages, with considerable opportunities for advancement in areas such as response accuracy, contextual understanding, and adaptive learning mechanisms.
The system's architecture leverages the capabilities of large language models through a flexible model-calling mechanism that can integrate different advanced models as they become available. The implementation incorporates specialized chemistry domain functions, including molecular visualization and SMILES string processing, to address specific requirements in chemical research. Fig. 17 illustrates the structure of the AI agents within the chemistry system. The agents' prompts can be found in the ESI.†
The system's effectiveness stems from its integrated design combining language model capabilities with domain-specific chemical tools. Through structured knowledge organization, targeted skill implementation, performance monitoring, and coordinated agent interactions, it aims to provide reliable support for chemical research tasks. This modular approach allows for systematic updates and refinements as underlying technologies advance, helping to maintain consistent and efficient assistance for complex chemical problems.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00398e |
This journal is © The Royal Society of Chemistry 2025 |