Bing Yana,
Angelica Chen
b and
Kyunghyun Cho
*ab
aDepartment of Computer Science, New York University, 60 5th Avenue, New York, NY 10011, USA. E-mail: kyunghyun.cho@nyu.edu
bCenter for Data Science, New York University, 60 5th Avenue, New York, NY 10011, USA
First published on 8th August 2025
Large language models (LLM) have demonstrated remarkable capabilities in chemistry, yet their ability to capture intrinsic chemistry remains uncertain. Within any familiar, chemically equivalent representation family, rigorous chemical reasoning should be representation-invariant, yielding consistent predictions across these representations. Here, we introduce the first systematic benchmark to evaluate the consistency of LLMs across key chemistry tasks. We curated the benchmark using paired representations of SMILES strings and IUPAC names. We find that the state-of-the-art general LLMs exhibit strikingly low consistency rates (≤1%). Even after finetuning on our dataset, the models still generate inconsistent predictions. To address this, we incorporate a sequence-level symmetric Kullback–Leibler (KL) divergence loss as a consistency regularizer. While this intervention improves surface-level consistency, it fails to enhance accuracy, suggesting that consistency and accuracy are orthogonal properties. These findings indicate that both consistency and accuracy must be considered to properly assess LLMs' capabilities in scientific reasoning.
In principle, rigorous chemical reasoning should be independent of how a molecule is represented. A knowledgeable chemist, or an AI model with true chemical understanding, should draw the same conclusions about a molecule whether given its 2D graph, SMILES string, or IUPAC name. In other words, the representation should not influence the reasoning process or the outcome. This expectation aligns with the broader principle of self-consistency in AI models, which requires responses to remain invariant under semantics-preserving transformations of the input.9
However, if a model's reasoning depends on the chosen representation, logically equivalent inputs may yield different outcomes. This issue has been documented in natural language processing, where LLMs often produce contradictory responses when the same question is phrased in different ways or when the context is reworded. For instance, GPT-3 and GPT-4 exhibit poor self-consistency on multi-step reasoning tasks, giving different answers to re-framed but logically equivalent queries.9
A similar phenomenon has been observed in computer vision: image classifiers can learn superficial cues, such as texture rather than capturing the true shape of an object. As a result, a trivial change in surface pattern can lead to entirely different predictions for the same underlying object.10 These examples from language and vision highlight a broader failure mode: when reasoning hinges on how information is presented instead of its intrinsic meaning, the model's reliability is compromised.
Despite the growing use of LLMs in chemistry, their consistency across different molecular representations has not been systematically evaluated. To address this gap, we introduce a benchmark to assess whether LLMs exhibit representation-invariant reasoning. We curated a paired dataset of molecules with both SMILES and IUPAC representations, spanning multiple chemistry tasks, including forward reaction prediction, retrosynthesis, and molecular property prediction. By evaluating LLMs on each task using both input formats, we can compute a consistency rate—the percentage of cases where the model produces identical predictions for SMILES and IUPAC representations. Our results show that state-of-the-art general-purpose LLMs exhibit a low consistency rate (≤1%). Even after finetuning on our paired dataset, the models remain inconsistent, suggesting that they rely more on superficial text patterns than on the underlying chemistry.
Can this inconsistency be easily remedied? To explore this, we investigated whether a simple training intervention could enforce representation-invariant behavior. Specifically, we introduced a sequence-level symmetric Kullback–Leibler (KL) divergence loss as a consistency regularizer. This approach penalizes the model when its output distributions differ for the same molecule presented in different formats. While this regularization strategy led to mild improvements in consistency, the gains were limited – models still frequently produced diverging predictions depending on the input format. Furthermore, this intervention did not improve accuracy. The models became more likely to generate the same prediction for a given molecule, regardless of representation, but not necessarily the correct one. This suggests that consistency and accuracy are orthogonal properties, and that we must consider both to assess LLMs' capabilities in capturing intrinsic chemistry.
The persistence of inconsistency indicates a deeper, systematic issue in how LLMs learn chemistry that cannot be easily fixed with finetuning alone. Addressing this challenge will likely require fundamental advances. More broadly, our findings highlight a key requirement for AI-driven scientific reasoning: models should respect the domain's natural invariances to be reliable. By rigorously benchmarking this consistency gap, we take a step toward developing more trustworthy AI systems that reason based on substance rather than surface patterns.
LLMs predict the output distribution Pθ(y|x), where θ denotes model parameters. The input molecules can be encoded in different formats (e.g., SMILES, IUPAC names), leading to different output distributions, Pθ(y|xS) for SMILES and Qθ(y|xI) for IUPAC. We evaluate consistency by comparing these distributions to assess whether models capture the intrinsic chemistry underlying symbolic representations.
Consistency measures how often a model produces identical predictions for the same molecule when presented in different formats (SMILES vs. IUPAC). For forward reaction prediction and retrosynthesis: a prediction is considered consistent if the outputs match for both input representations. For binary property prediction: consistency is measured as the proportion of cases where classification outcome remains the same. For numeric property prediction: consistency is measured using the mean squared error (MSE) between predictions from SMILES and IUPAC inputs.
To distinguish cross-representation alignment from chance-level agreement, we report adjusted consistency, defined as the observed consistency minus a random-consistency baseline. For forward reaction prediction, retrosynthesis, and binary property prediction, the baseline is the expected match rate between two independent random predictions. For numeric property prediction, we subtract the expected MSE between two random predictions. Unless otherwise noted, all reported consistency values are adjusted.
Accuracy evaluates how closely model predictions align with the ground truth. For forward reaction prediction and retrosynthesis: accuracy is the percentage of exact matches between the predicted and target outputs in each format. For binary property prediction: accuracy is the percentage of correct classifications. For numeric property prediction: accuracy is measured as the MSE between predicted and ground truth.
Formal definitions and equations for both metrics are provided in Appendix A.1.
We provided explicit instructions tailored to the input and output molecular representations. For instance, when both the input and output were in SMILES format, the instruction read: “Based on the SMILES strings of reactants and reagents, predict the SMILES string of the product. Please output the product directly.”
For forward reaction prediction and retrosynthesis, models were trained to generate either SMILES or IUPAC outputs with equal probability, indicated by a flag (“S” for SMILES, “I” for IUPAC). All models were optimized using cross-entropy loss.
We further examined the effect of model size by training four GPT-2 variants (124M, 355M, 774M, and 1.5B parameters). To estimate variability, we ran experiments with different random seeds. The training hyperparameters and implementation details are provided in Appendix B.1.1 and B.1.2.
We consider both directions of the KL divergence, DKL(P∥Q) and DKL(Q∥P):
![]() | (1) |
However, the sequence-level KL divergence is computationally intractable. Therefore, we estimate the KL divergence using Monte Carlo sampling method. Details of KL divergence loss can be found in Appendix C.1.
We also examined whether translation pretraining improves downstream performance. Specifically, we first trained a GPT-2 small model on a SMILES ↔ IUPAC translation dataset, then finetuned it on the forward reaction prediction task, with and without the addition of KL divergence loss.
The original “property prediction” and “chemical reaction” subsets use SMILES representation. We translated SMILES into IUPAC to construct one-to-one mapped input datasets. For each molecule, we first used ,18 a Python wrapper for the PubChem PUG REST API, to retrieve its IUPAC name. If no IUPAC name was found, we used
,19 an open-source model to translate SMILES into IUPAC. We validated the translation using
, a Python wrapper for OPSIN.20
The training datasets for the forward reaction prediction and retrosynthesis both consist of 1M entries. For most models, we used an 80k subset for finetuning. To evaluate the impact of dataset size, we trained a GPT-2 model on the full dataset. We filtered the “name conversion” dataset by removing examples with more than one molecule. The statistics of all datasets are listed in Appendix Table 4.
First, across all models, the adjusted consistency scores ranged from 0% to 1%, revealing a poor alignment between SMILES and IUPAC representations. The result indicates that LLMs struggle to maintain consistent outputs when given different input representations.
Second, LLMs without instruction tuning achieved higher accuracy for IUPAC inputs. This discrepancy is likely due to the training data distribution, which tends to include more examples using IUPAC,21–23 providing the models with a familiarity advantage for this representation.
Third, models optimized for reasoning, such as o1-preview, demonstrated improved accuracy, but the increase in accuracy did not lead to a comparable increase in consistency. This observation suggests that accuracy and consistency are orthogonal metrics. We explored the orthogonality further in the discussion.
Finally, the instruction-tuned model, LlaSMolMistral, achieved significantly higher accuracy with SMILES inputs, reflecting the impact of its SMILES-specific training. However, this tuning did not improve accuracy with IUPAC inputs, indicating a limited generalization between the two representations. This result highlights a key limitation of current LLMs—they fail to develop an intrinsic understanding of the chemical equivalence between different molecular representations.
We evaluated three architectures – GPT-2, Mistral 7B,24 and CodeT5 small25 – on three tasks: forward reaction prediction, retrosynthesis, and property prediction. For GPT-2, we further varied the model size (small, medium (M), large (L), and extra-large (XL)) to examine the impact of scaling. Additionally, we compared performance using two training data sizes: 80k and 1M examples. To assess the effects of pretraining, we also trained a GPT-2 model from random initialization.
We evaluated performance using two metrics: consistency and accuracy. We used both overall and false consistency (cases where SMILES and IUPAC inputs produce the same incorrect predictions), which is critical for disentangling consistency from accuracy. Accuracy was measured separately for SMILES and IUPAC inputs. The results are presented in Fig. 3a, b, Tables 1 and 2. To provide context for our results, we compare the performance of our models with state-of-the-art LLMs (Table 5). Our finetuned GPT-2 model achieves accuracy comparable to existing benchmarks.
Properties | Models | Performance (%)↑ | Performance w/KL (%)↑ | ||||||
---|---|---|---|---|---|---|---|---|---|
Consist | Adj. consist | Acc. (S) | Acc. (I) | Consist | Adj. consist | Acc. (S) | Acc. (I) | ||
BBBP | GPT-2 | 83.6 ± 1.1 | 26.9 ± 1.1 | 83.6 ± 1.7 | 81.0 ± 2.1 | 91.5 ± 1.8 | 34.8 ± 1.8 | 86.2 ± 0.9 | 82.0 ± 1.1 |
Mistral | 85.2 ± 6.8 | 28.5 ± 6.8 | 68.3 ± 5.8 | 76.7 ± 1.3 | 90.5 ± 1.1 | 33.8 ± 1.1 | 84.1 ± 4.3 | 78.8 ± 5.3 | |
CodeT5 | 85.7 ± 2.0 | 29.0 ± 2.0 | 85.7 ± 0.3 | 85.2 ± 2.9 | 88.9 ± 2.4 | 32.2 ± 2.4 | 86.2 ± 1.5 | 82.5 ± 0.3 | |
ClinTox | GPT-2 | 95.4 ± 1.9 | 9.5 ± 1.9 | 93.1 ± 0.4 | 91.6 ± 1.5 | 96.2 ± 2.0 | 10.3 ± 2.0 | 93.1 ± 1.2 | 92.4 ± 0.0 |
Mistral | 100 ± 4.8 | 14.1 ± 4.8 | 92.4 ± 0.0 | 92.4 ± 4.0 | 99.2 ± 0.4 | 13.3 ± 0.4 | 92.4 ± 0.0 | 91.6 ± 0.4 | |
CodeT5 | 87.0 ± 2.0 | 1.1 ± 2.0 | 89.3 ± 1.2 | 85.5 ± 3.1 | 94.7 ± 0.4 | 8.8 ± 0.4 | 91.6 ± 0.9 | 90.8 ± 1.2 | |
HIV | GPT-2 | 97.3 ± 0.7 | 6.2 ± 0.7 | 95.3 ± 0.4 | 95.3 ± 0.3 | 98.3 ± 0.0 | 7.2 ± 0.0 | 96.3 ± 0.3 | 95.3 ± 0.2 |
Mistral | 99.7 ± 0.2 | 8.6 ± 0.2 | 95.7 ± 0.2 | 95.3 ± 0.0 | 99.7 ± 0.2 | 8.6 ± 0.2 | 95.3 ± 0.0 | 95.0 ± 0.2 | |
CodeT5 | 96.7 ± 0.5 | 5.6 ± 0.5 | 96.0 ± 0.5 | 96.0 ± 0.2 | 97.3 ± 1.1 | 6.2 ± 1.1 | 95.7 ± 0.2 | 96.3 ± 0.2 | |
SIDER | GPT-2 | 61.3 ± 1.2 | 6.2 ± 1.2 | 55.7 ± 1.2 | 62.0 ± 2.5 | 77.7 ± 3.8 | 22.6 ± 3.8 | 55.7 ± 0.3 | 65.7 ± 0.3 |
Mistral | 98.3 ± 0.8 | 43.2 ± 0.8 | 65.0 ± 3.5 | 66.0 ± 0.2 | 96.7 ± 1.3 | 41.6 ± 1.3 | 64.7 ± 3.6 | 63.3 ± 1.5 | |
CodeT5 | 71.3 ± 4.3 | 16.2 ± 4.3 | 60.7 ± 2.8 | 60.7 ± 1.0 | 76.7 ± 5.9 | 21.6 ± 5.9 | 62.3 ± 1.3 | 61.7 ± 1.2 |
Properties | Models | Performance (MSE) | Performance w/KL (MSE) | ||||||
---|---|---|---|---|---|---|---|---|---|
Consist↓ | Adj. consist↑ | Acc. (S)↓ | Acc. (I)↓ | Consist↓ | Adj. consist↑ | Acc. (S)↓ | Acc. (I)↓ | ||
ESOL | GPT-2 | 4.3 ± 0.5 | 5.1 ± 0.5 | 1.5 ± 0.1 | 3.3 ± 0.6 | 2.7 ± 0.3 | 6.7 ± 0.3 | 1.6 ± 0.3 | 3.1 ± 0.1 |
Mistral | 4.9 ± 0.5 | 4.5 ± 0.5 | 1.7 ± 0.8 | 4.5 ± 0.6 | 2.1 ± 0.2 | 7.3 ± 0.2 | 1.3 ± 0.3 | 2.9 ± 0.4 | |
CodeT5 | 5.9 ± 0.5 | 3.5 ± 0.5 | 0.9 ± 0.2 | 5.4 ± 0.4 | 3.1 ± 0.7 | 6.3 ± 0.7 | 1.8 ± 0.3 | 3.6 ± 0.2 | |
LIPO | GPT-2 | 1.1 ± 0.1 | 1.5 ± 0.1 | 1.2 ± 0.0 | 1.2 ± 0.0 | 0.7 ± 0.0 | 1.9 ± 0.0 | 1.0 ± 0.1 | 1.0 ± 0.0 |
Mistral | 0.9 ± 0.2 | 1.7 ± 0.2 | 1.5 ± 0.2 | 1.2 ± 0.0 | 0.5 ± 0.1 | 2.1 ± 0.1 | 1.2 ± 0.0 | 1.1 ± 0.0 | |
CodeT5 | 1.0 ± 0.2 | 1.6 ± 0.2 | 1.0 ± 0.0 | 0.9 ± 0.1 | 1.0 ± 0.0 | 1.6 ± 0.0 | 1.1 ± 0.0 | 1.0 ± 0.1 |
For property prediction, however, the results vary across models and tasks. The mixed results indicate that while certain architectures, such as the encoder–decoder framework of CodeT5, may excel at capturing structural patterns, decoder-only models, such as GPT-2 and Mistral, may generalize better for less complex tasks.26
The results show that both KL regularization and translation pretraining enhance surface-level consistency across representations, but do not improve the model's intrinsic chemical reasoning.
(1) Complicated reactions: we group reactions that require a good understanding of chemistry and substantial manipulation of symbolic representations as “complicated reactions”. For instance, hydroquinone oxidation by cerium(IV) ammonium nitrate requires recognizing the hydroquinone structure and the oxidant. In addition, the product's SMILES string differs from the reactant's SMILES string in multiple positions (Scheme 1, entry 1). More than half of the reactions (24/46) fall into this category.
These reactions span five types: redox, coupling, cyclization, addition, and condensation. The distribution is shown in Fig. 4. Additional examples are listed in Schemes 1–6.
(2) Position inconsistency: the second-largest group consists of reactions whose predicted products are inconsistent in reaction sites or the positions of functional groups between SMILES and IUPAC inputs (Schemes 1 and 7).
(3) Reaction type inconsistency: SMILES and IUPAC inputs lead to predicted products from different reaction types (Schemes 1 and 8).
(4) Reaction step inconsistency: SMILES and IUPAC inputs result in predicted products involving different numbers of reaction steps (Schemes 1 and 9).
(5) Minor inconsistency: reactions with minor errors in either SMILES or IUPAC representations, such as mislabeling a nitrogen atom as carbon (Schemes 1 and 10).
The reverse transition – from consistent to inconsistent predictions – follows a similar pattern. Out of 300 reactions, 6 reactions became inconsistent with KL divergence loss: three complicated reactions and three position inconsistencies (Schemes 11 and 12).
For complicated reactions, models often make inconsistent and incorrect predictions without KL divergence loss. With KL divergence loss, the predictions become consistent but still incorrect. In contrast, for reactions where the model makes correct predictions in one representation but minor mistakes in the other, KL divergence loss aligns predictions and enables correct outputs for both representations.
The results suggest that KL divergence loss effectively addresses surface-level inconsistencies, but it falls short of achieving both accuracy and consistency. Advanced techniques will be required to capture the deeper intrinsic chemistry and achieve the ultimate goal of accurate and consistent predictions across representations.
We plotted consistency versus accuracy for models finetuned with and without KL divergence loss (Fig. 5). In both cases, there was minimal correlation between false consistency and accuracy, suggesting their orthogonality. Linear regression of the data yielded slopes of −0.29 and 0.08 for the results with and without KL divergence loss, respectively, further demonstrating that improvements in accuracy do not directly lead to better consistency. These findings highlight the need for strategies that enhance both metrics independently.
These findings underscore the limitations of current LLM architectures and the pressing need for more advanced models capable of scientific understanding and reasoning. In particular, we find it necessary for such an advanced model to readily incorporate prior knowledge of target domains, such as chemistry in this case, similarly to graph neural networks and other geometric deep learning approaches.27 Such advances are crucial for achieving both accurate and consistent predictions in chemistry tasks.
(1) Forward reaction prediction and retrosynthesis: for a given input format, the model is tested to generate outputs in either SMILES and IUPAC representations. For SMILES input (xS), the model generates SMILES (ŷSxS) or IUPAC outputs (ŷIxS); for IUPAC input (xI), the model generates SMILES (ŷSxI) or IUPAC output ((ŷIxI)).
The outputs from different input representations “match” if identical:
![]() | (2) |
![]() | (3) |
We also compute the false consistency, defined as the consistency of entries that produce incorrect predictions from both SMILES and IUPAC inputs. For M entries:
![]() | (4) |
We compute adjusted consistency to measure consistency beyond chance. Let p(y) be the empirical label distribution. Then the expected chance-level consistency is:
![]() | (5) |
The adjusted consistency is then:
Consist(adj) = consist(overall) − consist(rand) | (6) |
(2) Binary property prediction: the predictions are denoted as ŷxS and ŷxI for SMILES and IUPAC inputs, respectively. The consistency for a dataset with N entries is:
![]() | (7) |
The expected random agreement baseline is:
Consist(rand) = p(0)2 + p(1)2 | (8) |
Consist(adj) = consist(binary) − consist(rand) | (9) |
(3) Numeric property prediction: consistency is measured as the mean squared error (MSE) between the predictions from SMILES and IUPAC inputs:
![]() | (10) |
We define the random consistency baseline as:
Consist(rand) = 2·Var(ŷ) | (11) |
Consist(adj) = consist(rand) − consist(numeric) | (12) |
(1) Forward reaction prediction and retrosynthesis: For SMILES input, accuracy is calculated as the percentage of exact matches between the predicted SMILES output (ŷSxS) and the target SMILES output (yS); for IUPAC input, accuracy is calculated between the predicted IUPAC output (ŷIxI) and the target IUPAC output (yI).
![]() | (13) |
(2) Binary property prediction: accuracy is calculated as the percentage of predictions same to the ground-truth y.
![]() | (14) |
(3) Numeric property prediction: accuracy is measured as the MSE between the predicted outputs and the ground truth values.
![]() | (15) |
We train models using Nvidia A100 or H100 GPUs. We use one GPU for GPT-2 small, GPT-2 medium, GPT-2 large, and CodeT5 small models, and two GPUs for GPT-2 XL and Mistral 7B models.
Model | LR | BSZ | Acc. | Epochs | Time (h) |
---|---|---|---|---|---|
GPT-2 small | 1 × 10−4 | 32 | 1 | 20 | 2.28 |
GPT-2 medium | 1 × 10−4 | 16 | 1 | 20 | 6.24 |
GPT-2 large | 1 × 10−4 | 8 | 1 | 20 | 15.57 |
GPT-2 XL | 1 × 10−4 | 8 | 2 | 20 | 24.91 |
CodeT5 small | 1 × 10−4 | 32 | 1 | 20 | 2.57 |
Mistral 7B | 1 × 10−5 | 8 | 2 | 10 | 25.25 |
(1) Evaluation of state-of-the-art LLMs: we provide a simple instruction specifying the input and output representation in the inquiry. The molecules are separated by comma (“.”) For example:
Input in SMILES: “Based on the SMILES strings of reactants and reagents, predict the SMILES string of the product. Please output the product directly.
〈SMILES〉 COc1ccc2c(c1)C(=O)c1ccccc1CC2.[BH4-].[OH-].[Na+].CCO 〈SMILES〉”
Target output in SMILES: “COc1ccc2c(c1)C(O)c1ccccc1CC2”
Input in IUPAC: “Based on the IUPAC names of reactants and reagents, predict the IUPAC name of the product. Please output the product directly.
〈IUPAC〉 5-methoxytricyclo[9.4.0.03,8]pentadeca-1(15),3(8),4,6,11,13-hexaen-2-one.boranuide.hydroxide.sodium(1+). ethanol 〈IUPAC〉”
Target output in IUPAC:
“5-methoxytricyclo[9.4.0.03,8]pentadeca-1(15),3(8),4,6,11,13-hexaen-2-ol”
(2) Finetuning of LLMs: we append a flag at the end of the input sequence to specify the output representation, “S” for SMILES and “I” for IUPAC. For example:
Input in SMILES expecting output in SMILES:
“COc1ccc2c(c1)C(=O)c1ccccc1CC2.[BH4-].[OH-].[Na+].CCO.S”
Target in SMILES: “COc1ccc2c(c1)C(O)c1ccccc1CC2”
Input in SMILES expecting output in IUPAC:
“COc1ccc2c(c1)C(=O)c1ccccc1CC2.[BH4-].[OH-].[Na+].CCO.I”
Target in IUPAC: “5-methoxytricyclo[9.4.0.03,8]pentadeca-1(15),3(8),4,6,11,13-hexaen-2-ol”
The gradient of DKL(P∥Q) is (we simplify Pθ(y|xS) as Pθ(y), and Qθ(y|xI) as Qθ(y)):
![]() | (16) |
Using the trick ∇θ(Pθ(y)) = Pθ(y)∇θ(log(Pθ(y))):
![]() | (17) |
Therefore, we can define the KL loss corresponding to the KL divergence DKL(P∥Q):
![]() | (18) |
However, the expectation is untractable, so we use a Monte Carlo to estimate it by sampling M sequences {y1, …, ym} from Pθ(y) and pass them through the models Pθ(y) and Qθ(y):
![]() | (19) |
Similarly, we can calculate the loss for the KL divergence of Qθ(y) from Pθ(y) (DKL(Q∥P)) and the Monte Carlo estimation by sampling N sequences {y1, …, yn} from Qθ(y):
![]() | (20) |
During training, we added a weight to the KL divergence loss. We screened values ranging from 0.001 to 10.0 and found that a weight of 1.0 gave the best consistency for all tasks and models.
Task | #Train | #Valid | #Test |
---|---|---|---|
Forward prediction (full) | 963![]() |
1956 | 300 |
Forward prediction (subset) | 76![]() |
1956 | 300 |
Retrosynthesis (full) | 932![]() |
2004 | 300 |
Retrosynthesis (subset) | 76![]() |
2004 | 300 |
Property – BBBP | 1521 | 188 | 189 |
Property – ClinTox | 1063 | 127 | 131 |
Property – HIV | 32![]() |
4104 | 300 |
Property – SIDER | 21![]() |
2540 | 300 |
Property – ESOL | 888 | 111 | 112 |
Property – LIPO | 3358 | 385 | 300 |
SMILES ↔ IUPAC | 274![]() |
1397 | 300 |
Task | Accuracy (%↑ or RMSE↓) | ||
---|---|---|---|
Ours (GPT-2) | Best (LlaSMolMistral) | Top 4 models averaged | |
Forward reaction prediction (%) | 57.7 | 63.3 | 53.9 |
Retrosynthesis (%) | 29.7 | 32.9 | 26.7 |
Property – BBBP (%) | 86.2 | 74.6 | 70.4 |
Property – ClinTox (%) | 93.1 | 93.1 | 92.9 |
Property – HIV (%) | 96.3 | 96.7 | 96.7 |
Property – Sider (%) | 55.7 | 70.7 | 69.9 |
Property – ESOL (RMSE) | 1.150 | 1.036 | 2.215 |
Property – LIPO (RMSE) | 0.995 | 1.010 | 1.191 |
Properties | Models | Performance (%)↑ | Performance w/KL (%)↑ | ||||||
---|---|---|---|---|---|---|---|---|---|
Consist | Adj. consist | Acc. (S) | Acc. (I) | Consist | Adj. consist | Acc. (S) | Acc. (I) | ||
BBBP | GPT-2 | 83.1 ± 0.8 | 26.4 ± 0.8 | 81.5 ± 0.6 | 78.9 ± 1.3 | 92.1 ± 1.5 | 35.4 ± 1.5 | 82.5 ± 0.5 | 85.2 ± 1.4 |
ClinTox | GPT-2 | 99.2 ± 2.2 | 13.3 ± 2.2 | 92.4 ± 0.2 | 93.2 ± 1.3 | 100.0 ± 2.3 | 14.1 ± 2.3 | 92.4 ± 0.9 | 92.4 ± 0.1 |
HIV | GPT-2 | 97.7 ± 0.8 | 6.6 ± 0.8 | 94.3 ± 0.3 | 95.7 ± 0.3 | 99.3 ± 0.1 | 8.2 ± 0.1 | 94.7 ± 0.4 | 95.3 ± 0.1 |
SIDER | GPT-2 | 77.3 ± 1.5 | 22.2 ± 1.5 | 64.3 ± 1.1 | 57.7 ± 1.9 | 84.3 ± 3.2 | 29.2 ± 3.2 | 65.3 ± 0.5 | 62.0 ± 0.2 |
Properties | Model | Performance (MSE) | Performance w/KL (MSE) | ||||||
---|---|---|---|---|---|---|---|---|---|
Consist↓ | Adj. consist↑ | Acc. (S)↓ | Acc. (I)↓ | Consist↓ | Adj. consist↑ | Acc. (S)↓ | Acc. (I)↓ | ||
ESOL | GPT-2 | 3.4 ± 0.1 | 6.0 ± 0.1 | 1.8 ± 0.1 | 2.8 ± 0.4 | 2.9 ± 0.3 | 6.5 ± 0.3 | 1.1 ± 0.1 | 3.6 ± 0.2 |
LIPO | GPT-2 | 1.6 ± 0.2 | 1.0 ± 0.2 | 1.3 ± 0.1 | 1.3 ± 0.1 | 0.7 ± 0.0 | 1.9 ± 0.0 | 1.4 ± 0.1 | 1.1 ± 0.1 |
![]() | ||
Scheme 2 Complicated redox reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 3 Complicated coupling reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 4 Complicated cyclization reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 5 Complicated addition reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 6 Complicated condensation reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 7 Position-inconsistent reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 8 Reaction type-inconsistent reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 9 Reaction step-inconsistent reactions that transition from inconsistent to consistent predictions after adding KL divergence loss. |
![]() | ||
Scheme 11 Complicated reactions that transition from consistent to inconsistent predictions after adding KL divergence loss. |
This journal is © The Royal Society of Chemistry 2025 |