Open Access Article
Jiayun
Pang
*a and
Ivan
Vulić
*b
aSchool of Science, Faculty of Engineering and Science, University of Greenwich, Medway Campus, Central Avenue, Chatham Maritime, ME4 3RL, UK. E-mail: j.pang@gre.ac.uk
bLanguage Technology Lab, University of Cambridge, 9 West Road, Cambridge CB3 9DA, UK. E-mail: iv250@cam.ac.uk
First published on 19th August 2024
Transformer-based encoder–decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder–decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become ‘chemistry domain compatible’ in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
The original Transformer architecture, introduced in the seminal paper of Vaswani et al.,2 contains two main components: (1) the encoder and (2) the decoder. All subsequent NLP models share some relationship with these components. For example, the widely used BERT (Bidirectional Encoder Representations from Transformers) only has the encoder component, while OpenAI’s GPT models and other very recent Large Language Models (LLMs) such as the Llama family3 or Gemini4 only have the decoder component. The encoder–decoder framework is ideally suited for sequence-to-sequence (seq2seq) learning, often referred to as text-to-text processing in NLP. In this framework, the encoder component captures the context of input sequences and sends it to the decoder which then generates output sequences.
Several studies have adapted seq2seq models for chemical reaction prediction tasks. Notably, the pioneering Molecular Transformer model by Schwaller et al.5 and the recent T5Chem model6 have achieved impressive performance. T5Chem adapted Google’s T5 (“Text-To-Text Transfer Transformer”) NLP model7 to chemical data represented in the SMILES format. T5 closely aligns with the encoder–decoder structure used in the original Transformer model but introduces the “Text-to-Text” framework which feeds text sequence (in a natural language such as English) as input and then generates text as output. This allows the same model to handle a variety of tasks simultaneously. To perform a specific task, a task-specific prefix is added to the input sequence, tailoring the model’s output. T5Chem pretrained the T5 encoder–decoder architecture with 97m SMILES from PubChem molecules and the USPTO_500_MT dataset, creating a multi-task reaction prediction model for five different types of reaction tasks. For example, it uses the task-specific prefix “Product:” for reaction product prediction, and a prefix “Reactants:” for single-step retrosynthesis. The advantage of multi-task learning is that it allows for simultaneous learning of multiple tasks by leveraging similarities between tasks and offers improved data efficiency, and fast learning without the need to predetermine a single specific prediction task.
T5Chem and a few other similar models8–10 have demonstrated the feasibility of applying a seq2seq framework to a variety of predictions in organic chemistry. However, several crucial issues have not been explored to enable more effective and accurate models. In the present study, we have trained multiple variants of two state-of-the-art seq2seq language models, namely instruction-tuned models Flan-T5 (ref. 11) and tokenisation-free byte-level models ByT5 (ref. 12), for standard organic reaction prediction tasks. With these two model architectures, our aim is to conduct a systematic empirical study on the following aspects:
(1) Adapted and adequate input preprocessing and tokenisation that reaches beyond natural language towards molecular structure. Tokenisation is usually the first step to train an NLP model. It is the process of breaking a sequence into discrete elements, called ‘tokens’ which are then converted to vectors/embeddings for machine learning models. In NLP, most pretrained language models over the past few years rely on tokenisation performed at the sub-word level, as it is effective with frequent tokens, capable of grouping sub-words while having some ability to deal with unknown words. However, sub-word tokenisation is still limited in its ability to deal with variants in spelling (e.g. typos) and unknown characters (e.g. from other languages). Recent approaches, such as ByT5, a variant of the multilingual T5 model which disposes of subword-level tokenisation, have shown the viability of token-free models which were trained on characters in the form of their UTF-8 byte encodings. ByT5 uses a standard Transformer architecture but is ‘tokenisation-free’ as it does not rely on a learned vocabulary to map words or sub-word units to tokens.
In addition, it has been demonstrated in NLP research that ByT5 is substantially more robust to noise in the data and performs better on tasks that are sensitive to spelling, grammar errors and ambiguous expressions, such as text on social media platforms. In terms of data noise, a chemical reaction dataset may have similarity with text on social media. For example, USPTO (United States Patent Office), the largest open-source chemical reaction dataset, contains noise: in this context, data noise is defined as incomplete reaction entry with missing or incorrect reactants, reagents and products; and could be quite common in all chemical reaction datasets due to the nature of how chemists record reactions, i.e. focusing on only the main product, and leaving unvaried reagents out when recording many similar reactions.13 Our work will thus implicitly assess whether ByT5’s advantage over dealing with noise in NLP data can be translated to better handling of noise in the chemical reaction data.
(2) Training data efficiency, i.e., how much annotated data is required in fine-tuning to generalise new sequences when working with chemical reaction datasets using seq2seq models.
(3) The use and impact of pretraining. T5-style language models are pretrained only on language data and/or language tasks, therefore are not “SMILES-aware”. T5Chem relies on self-supervised pretraining using 97m SMILES to learn the chemical space, which can be extremely GPU-intensive. Recently, language models pretrained on both language and chemical data have emerged. This cross-domain approach, adopted by models such as MolT5
14 and nach0,15 creates a shared representation space. We thus also assess whether such hybrid pretraining offers better initialisation points for task-specific fine-tuning for reaction prediction tasks.
(4) Various other important modelling aspects that can impact the final task performance, e.g., model size, and decoding algorithm at inference.
![]() | ||
| Fig. 1 Illustration of the key areas explored along the flow of pretraining, fine-tuning and inference in our work. | ||
One of the central questions we aim to answer in this work is: can encoder–decoder models, originally pretrained only on language data and/or a variety of language tasks, be effectively specialised to organic reaction prediction tasks via task-specific fine-tuning? Will the model learn to encode and generate SMILES although it was originally pretrained for encoding and generating natural language? To this end, our starting points are different flavours of the T5-style, all pretrained on language data:
• The original T5 model,7 pretrained on the CommonCrawl-based C4 corpus covering ∼356b word tokens, via the span-mask denoising objective;
• The FlanT5 model11,17 is an instruction-tuned language model that starts from the pretrained T5 model of the same size, and then ‘instruction-tunes’ it on supervised data of 1800+ NLP tasks (Flan stands for Finetuning Language models).‡ It typically exhibits better performance than the underlying T5 model across a range of NLP tasks. It can also be used in a standard text-to-text fashion with task-specific fine-tuning if an ‘empty’ instruction is provided to the model (i.e., only the input sequence without an additional task description is provided). This is how we use the model for single task-specific fine-tuning.
• The ByT5 model12 obtains the same architecture, but disposes of standard subword-level tokenisation (see the next paragraph) and processes text as sequences of raw (UTF-8) bytes. Being originally designed to enhance multilingual NLP models, it was pretrained on the multilingual mC4 corpus spanning 101 diverse languages, again relying on the span-mask denoising objective.
All the models come in different sizes (in terms of model parameters), and due to high computational demands we mostly focus on benchmarking their Small and Base variants. The Small variant of T5 and FlanT5 comprises ∼60m parameters, while Base covers 220m parameters. ByT5 variants with the same label are not directly comparable to T5/FlanT5 as they contain a larger number of parameters: Small is 300m, and Base is 582m parameters.
Assuming the existence of task-specific training data for organic reaction prediction tasks, we also evaluate whether language-specific (and thus ‘chemical domain-incompatible’) pretraining is necessary at all, by also comparing to the same architectures of the same size which get randomly initialised and then fine-tuned for the task. We denote those variants of each model as random.
Furthermore, we also analyse whether continued pretraining in a self-supervised fashion on SMILES data offers any performance benefits before task-specific fine-tuning (denoted as cont). For the continued pretraining we rely on the standard masked language modelling objective adapted to SMILES. For each SMILES we sample 15% of its constituent tokens (where constituent tokens are based on the standard regular expression for SMILES from prior work,6 see also later) and then we do one of the following options: (a) replace the token with a special mask token (the ‘$’ character is used for the mask token) with the probability of 80%; or (b) replace the token with another random token from the same SMILES (10% probability); or (c) keep the token ‘as is’ (10% probability).
Finally, we test whether the ‘hybrid’ T5-style models of the same sizes pretrained in a multi-task fashion to handle both text and SMILES generation simultaneously, offer better initialisation points for single task-specific fine-tuning. For the latter, we select molT5 (ref. 14) and nach0 (ref. 15) (their Base variants spanning 220m parameters) as two representative recent models from this ‘SMILES-aware’ family of encoder–decoder models.
Further, the original vocabularies of FlanT5 (and T5) typically span 32
000 subwords, but most of these subwords are associated with natural language subwords and can be safely discarded when processing SMILES with a much more restricted vocabulary. We thus trim the original vocabulary of FlanT5 by tokenising the large dataset of 116m SMILES sequences preprocessed via the +smiles strategy above, retaining only the subwords (and the corresponding embeddings) that occurred in those tokenised SMILES sequences.§ By doing this, we trimmed the vocabulary from the original 32k subwords to only 324 subwords, plus the three standard special tokens denoting the padding token, the start-of-sequence token and the ‘unknown’ token. This trimming of the vocabulary and the corresponding embeddings has a double effect: (1) it speeds up training and inference as it effectively constrains the search space, and (2) it reduces model size from 60m to 44.4m parameters (FlanT5Small) or from 220m to 198.7m (FlanT5Base) without losing any modelling capability and expressiveness in organic reaction prediction tasks. We denote variants that rely on the vocabulary and embed the trimming step as +trim, and variants with the original vocabulary as +orig. Different model variants are then possible when this step is combined with input preprocessing strategies (+none+orig as the simplest variant without any interventions versus, e.g., +smi+orig or +smi+trim).
Finally, given that SMILES comprise only special symbols, numbers, and alphabet letters where common atoms involved in organic reactions are represented as single or double letters, namely Br, Cl, N, O, S, P, F, I, b, c, n, o, s, p (note: upper case letters refer to non-aromatic atoms and lower case letters refer to aromatic atoms in SMILES), all the characters in SMILES sequences are UTF-8 compatible and can be easily fed into a tokenisation-free byte-level model such as ByT5. ByT5 is ‘tokenisation-free’ as it does not rely on a learned vocabulary to map words or sub-word units to vocabulary items and simply operates on the vocabulary of 256 UTF-8 characters. Therefore, in this work we also analyse the potential of such byte-level natural language-pretrained encoder–decoder models for organic reaction prediction tasks. We run ByT5 without any input preprocessing to keep the length of the sequences tractable, e.g., the only variant tested is +none+orig. In prior work in NLP, it was shown that ByT5 is significantly more robust to noise in the data and performs better on tasks that are sensitive to spelling, grammar errors and ambiguous expressions, such as text on social media platforms or speech transcribed to text.20 Our goal is to assess whether ByT5’s advantage related to dealing with noise in NLP data can be translated to better handling of noise in the chemical reaction data such as typically encountered in the USPTO datasets.
(1) Forward reaction prediction (FWD-S) with reactant–reagent separation¶ on the USPTO_MIT dataset. The full training set consists of 409
035 input–output pairs.
(2) Single-step retrosynthesis (RETRO) on USPTO_50k, where the full training set comprises 40
029 input–output pairs.
(3) Reagent prediction (REAG) on the USPTO_500_MT dataset. While USPTO_500_MT has been created primarily as a multi-task dataset,6 unless stated otherwise we conduct task-specific tuning for the single task using only its corresponding data, which comprises 116
360 input–output pairs.
We use the datasets and corresponding splits as provided by Lu et al.6 We evaluate all the models on the full test set of the RETRO task (5004 pairs), while we randomly sample 10
000 test instances from the full, larger test sets of FWD-S and REAG to speed up inference due to a large number of experiments.||
000 steps. It might be possible that longer pretraining and a larger pretraining set – as conducted e.g. by T5Chem6 – might yield models better adapted to SMILES input and output, but that setup exceeds our computational resources. We run continued pretraining for the FlanT5Base+trim+smi variant and ByT5Base; the batch size is set to 64 for both models, learning rate is set to 0.0001 with inverse square root decay, and the optimiser used is Adafactor.24
000 steps (∼16 epochs), while we fine-tune for 100 epochs for RETRO (62.5k steps), and 50 epochs for REAG (∼90k steps).** The optimiser is Adafactor in all the experiments. For each model variant and task, we select the checkpoint based on performance on the corresponding development set: in most cases, it is the end checkpoint.
| Model ↓ | FWD-S | RETRO | REAG | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc@1 | Acc@2 | Acc@5 | Acc@1 | Acc@3 | Acc@5 | Acc@1 | Acc@3 | Acc@5 | |
| T5Small | 89.01 | 93.36 | 95.45 | 42.35 | 57.97 | 63.13 | 3.85 | 7.36 | 9.32 |
| T5Base | 89.28 | 93.36 | 95.47 | 42.59 | 58.31 | 62.89 | 20.33 | 29.72 | 33.90 |
| FlanT5Base | 89.83 | 93.73 | 95.73 | 44.86 | 61.45 | 66.55 | 23.27 | 31.86 | 35.82 |
| ByT5Small | 90.06 | 93.75 | 95.71 | 43.96 | 58.81 | 63.09 | 22.85 | 31.20 | 35.43 |
| ByT5Base | 90.10 | 93.90 | 96.07 | 44.74 | 60.25 | 64.89 | 24.18 | 32.27 | 36.18 |
| molT5Small14 | 88.98 | 93.23 | 95.60 | 42.63 | 59.09 | 63.53 | 20.89 | 27.81 | 31.39 |
| molT5Base14 | 89.90 | 93.68 | 95.75 | 42.71 | 58.45 | 63.77 | 25.0 | 32.87 | 36.82 |
| nach0Base15 | 87.33 | 92.12 | 94.72 | 41.33 | 57.35 | 62.59 | 25.0 | 33.54 | 37.26 |
Running the best-performing models on the full test set of FWD-S yields the peak Acc@1/Acc@5 scores of 90.2/96.2 which surpasses the performance of Molecular Transformer5 (88.8/92.6) and is close to the T5-Chem model (90.4/96.4) despite the fact that the models have not been pretrained on SMILES prior to task-specific fine-tuning. The advantage over Molecular Transformer also occurs on the full test set of RETRO (Acc@1 of 45.0 vs. 43.5), but the gap to T5-Chem is slightly larger (45.0/66.7 vs. 46.5/70.5). Additional results showing the minimal variation in relative trends in results between the subsampled 10k test set and full set are available later in Table 4.
| Variant ↓ | FlanT5Small | FlanT5Base | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FWD-S | RETRO | FWD-S | RETRO | |||||||||
| Acc@1 | Acc@2 | Acc@5 | Acc@1 | Acc@2 | Acc@5 | Acc@1 | Acc@3 | Acc@5 | Acc@1 | Acc@3 | Acc@5 | |
| +orig+smi | 88.92 | 93.18 | 95.59 | 42.53 | 59.79 | 65.73 | 89.83 | 93.73 | 95.73 | 42.87 | 58.47 | 63.83 |
| +trim+smi | 89.32 | 93.29 | 95.66 | 44.68 | 60.71 | 66.13 | 89.91 | 93.72 | 95.70 | 44.86 | 61.45 | 66.55 |
| +orig+none | 88.70 | 93.22 | 95.56 | 43.07 | 58.05 | 63.21 | 89.84 | 93.51 | 95.58 | 42.03 | 57.71 | 62.05 |
| +trim+none | 0.03 | 0.04 | 0.06 | 0.16 | 0.22 | 0.22 | 0.04 | 0.05 | 0.07 | 0.16 | 0.2 | 0.24 |
| cont+trim+smi | — | — | — | — | — | — | 89.34 | 93.39 | 95.48 | 43.55 | 59.89 | 65.17 |
| +orig+simple | 88.95 | 93.31 | 95.52 | 43.86 | 60.71 | 65.67 | 89.25 | 93.31 | 95.52 | 42.61 | 58.83 | 64.39 |
| Random | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Finally, as expected, the +trim+none variant provides extremely low scores: we report it for didactic purposes to emphasise how the mismatch between input preprocessing, vocabulary and tokenisation can have extremely detrimental negative impact; this variant trims the vocabulary based on the +spaces preprocessing, while the actual input to the model does not undergo the same preprocessing step, which creates the mismatch and the model has to deal with (sub)sequences that result in the ‘unknown’ tokens.
In order to verify if further performance benefits can be reaped from an even larger encoder–decoder model, we fine-tune FlanT5Large (780m parameters) in the RETRO task. However, the increase in model size does not result in any increase in performance, with obtained Acc@1/Acc@3/Acc@5 scores of 43.96/59.61/64.23. In fact, the scores even decrease a bit, which might be the result of overfitting to the RETRO training data.
We analysed a classic organic reaction: the generation of ketoxime from hydroxylamine and ketone in the presence of HCl and CH3OH. The SMILES representation of the reaction is as follows: NO.O=C1CCCc2ccccc21.CO.Cl>>ON=C1CCCc2ccccc21. This reaction involves first the nucleophilic attack of the nitrogen in hydroxylamine (NH2–OH) on the carbonyl carbon in the ketone, followed by two successive proton transfers from the nitrogen to the oxygen in C
O to allow for elimination of water, resulting in the formation of the oxime functional group (C
N–OH) (Fig. 3).
Fig. 4 displays the computed Shapley values for this reaction from the ByT5 and FlanT5 models. The figure is generated using Seaborn heatmap and coloured using the ‘bwr’ colormap from matplotlib. The y-axis represents the sequence of tokens from the reactants and reagents, while the x-axis represents the sequence of tokens from the predicted product. Additionally, we visualised the impact of tokens in the reactants and reagents on the first few tokens in the product by projecting the Shapley values onto their 2d structures (Fig. 5). Fig. 5 was generated using the GetSimilarityMapFromWeights function in RDkit26 and coloured using the ‘bwr’ colormap from matplotlib.
![]() | ||
| Fig. 5 Visualisation of the impact of tokens in the reactants and reagents on the first few tokens in the predicted product. | ||
Both models highlight the hydroxylamine (N and O, token_0 and token_1 from ByT5; NO, token_0 in FlanT5) as having the most significant impact on the product, which aligns with the underlying reaction mechanism. Furthermore, both models correctly identified that
N–OH in the product originates from the hydroxylamine, while the oxygen in the reactant ketone has a much weaker impact. This suggests that the models may have learned the correct reaction pathway, i.e. the oxygen in the ketone as the leaving group. In addition, atoms present in both the reactant and product (such as the aromatic ring and the cyclic aliphatic ketone) also exhibit noticeable impact. For the double bond in C1
N–OH, neighbouring carbons in the cyclic aliphatic ring have a strong impact. Although the exact reason for this is not clear, it is well known that SN2 reactions, like this one, are sensitive to steric hindrance from neighbouring atoms. It is likely that our models have learned correlations between key substructures of the molecule, as has been observed in T5Chem.
We provide two additional examples of SHAP analysis with more complex product structure and reaction conditions (e.g. reagents) in the SI.† FlanT5 seems slightly better at identifying key reagents involved in the reactions compared to ByT5. However, the impact of the reagents is not always associated with the reaction centre. Further work is needed to provide a more quantitative analysis of the SHAP method for chemical reaction prediction models.
518 training instances), 1/4, …, 1/128 (3196 instances) of the full FWD-S training set. Each smaller set is a subset of all the larger (sub)sets. The models in comparison are Base versions of FlanT5 (+trim+smi), ByT5, and molT5: the summary of their performance across different FWD-S data sizes is provided in Fig. 6. The three models display very similar ‘performance trajectories’ where molT5 lags slightly behind the two other models for smaller data samples across both Acc@K metrics. Combined with the results already observed in Table 1 and 2, these plots point to a more general conclusion that the actual training data size and quality is more instrumental to the final performance than the chosen encoder–decoder architecture. While there is some variation that stems from the choice of input preprocessing, tokenisation, and model size, the data size is still a key factor determining the ‘performance magnitude’. This conjecture calls for future research across two axes: (1) creation of larger and higher-quality training and evaluation datasets15 for single-task and multi-task training; (2) work on more sample-efficient learning via transfer learning.
![]() | ||
| Fig. 7 Beam analysis experiments: (a) FWD-S; (b) RETRO; we vary beam size from 1 (greedy) to 10; the FlanT5Base variant is +trim+smi. | ||
Additional experiments with top-K sampling, nucleus sampling and contrastive search with hyper-parameters suggested from NLP research (e.g., temperature for top-K sampling, top_p for nucleus sampling), did not yield any noteworthy benefits over the simple greedy search. Changing the hyper-parameters for the efficient nucleus sampling based on development set performance can yield slight benefits at inference over greedy search without damaging inference efficiency, but the gains are typically slight, ranging between 0.1 and 0.4 Acc@1 points. In summary, a more focused study on the impact of decoding strategy for SMILES generation is also warranted as part of future research.
000+ instances) in Table 3. The estimates were made in the following setup for all models: 100
000 training steps with the batch size of 16 and gradient accumulation of 4 (yielding the effective batch size of 64), input sequence length and output sequence length set to 144. All the estimates are based on single runs on a single 24GiB NVIDIA RTX 4090 GPU. The time estimates indicate that the main experiments with the chosen model architectures and their corresponding size can typically be run on consumer-level GPUs. Another finding is that some speed-ups can be achieved via trimming the vocabulary and the corresponding embeddings (moving from +orig to +trim) without any performance degradation (see Table 2 again).
000 training steps) in the FWD-S task on a single 24GiB RTX 4090 GPU
| Model variant | Time |
|---|---|
| FlanT5Small+orig+smi | 5 h 20 min |
| FlanT5Small+trim+smi | 4 h 25 min |
| FlanT5Base+orig+smi | 14 h 50 min |
| FlanT5Base+trim+smi | 13 h 35 min |
| ByT5Small | 16 h 20 min |
| ByT5Base | 30 h 55 min |
| Model ↓ | 10k | Full | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc@1 | 64 | 16 | 4 | Acc@1 | 64 | 16 | 4 | |
| Greedy search | ||||||||
| FlanT5Small | 89.10 | 1′15′′ | 3′57′′ | 11′56′′ | 89.20 | 4′53′′ | 15′53′′ | 46′25′′ |
| FlanT5Base | 89.85 | 2′27′′ | 5′49′′ | 17′59′′ | 89.73 | 9′59′′ | 22′53′′ | 1h12′14′′ |
| ByT5Small | 89.91 | 2′33′′ | — | — | 90.01 | 10′23′′ | 14′29′′ | 28′08′′ |
| ByT5Base | 89.95 | 3′23′′ | 4′21′′ | 9′6′′ | 90.01 | 13′36′′ | 17′25′′ | 36′38′′ |
![]() |
||||||||
| Beam search (beam = 5) | ||||||||
| FlanT5Small | 89.32 | 6′48′′ | 8′36′′ | 18′22′′ | 89.36 | 27′23′′ | 35′55′′ | 1h12′48′′ |
| FlanT5Base | 89.91 | 12′59′′ | 13′41′′ | 27′4′′ | 89.86 | 52′32′′ | 54′1′′ | 1h44′19′′ |
| ByT5Small | 90.06 | 12′40′′ | 13′42′′ | 16′5′′ | 90.10 | 51′5′′ | 53′57′′ | 1h4′32′′ |
| ByT5Base | 90.10 | 14′20′′ | 15′23′′ | 18′39′′ | 90.20 | 58′18′′ | 1h4′4′′ | 1h14′9′′ |
Since we implemented customised inference batching, we acknowledge that further efficiency gains might be met through additional code optimisation and/or through the use of quantisation techniques;27 this goes beyond the scope of this work.
000 training steps on multi-task data of the USPTO_500_MT dataset:6 it again covers three tasks (forward reaction prediction, retrosynthesis, and reagent prediction), where training data of each task constitutes 116
360 instances for the total of 349
080 training instances for multi-task training. Each instance is marked by a specific prefix (Product:, Reactants:, Reagents:) which links it to the original task.
We fine-tune two models – FlanT5Base (+orig+none for simplicity) and ByT5Small – and run standard evaluation on the three tasks. Acc@1 scores of the multi-task model on the three tasks are 24.01 (REAG), 95.82 (FWD), 71.77 for FlanT5Base, and 24.15, 96.78, 72.24 for ByT5Small. This indicates that multi-task fine-tuning is also possible starting from language-pretrained model checkpoints. We noted that Acc@1 scores from multi-task models are higher on both FWD (95.82 and 96.78) and RETRO (71.77 and 72.24) compared to our single task models presented in Table 1 and 2 (∼90 for FWD and ∼44 for RETRO). The 70% RETRO accuracy from the multi-task models is above the typical 45–50% but is consistent with the findings from the recent paper of Lu and Zhang6 which used a similar multi-task approach. Our accuracy range is also broadly in line with several recent literature sources that explored various GNN models and data augmentation strategies (top-1 accuracy 60% to 70% for retrosynthesis).28–31 Possible reasons of our higher top-1 accuracy on RETRO task are:
(1) Different datasets: for single task-specific experiments, we used corresponding dedicated datasets (e.g., USPTO_MIT for FWD-S; USPTO_50k for RETRO). For multi-task fine-tuning we used the dataset USPTO_500_MT created by Lu and Zhang exactly for multi-task fine-tuning and evaluation. This dataset contains five objectives: forward reaction prediction, single-step retrosynthesis, reagent prediction, reaction type prediction, and reaction yield prediction. According to Lu and Zhang, “the training, validation and testing sets are well separated to ensure no reaction overlapping across all tasks” to avoid a possible data leakage problem.6 In terms of size, USPTO_50k includes 50
016 reactions which is much smaller than USPTO_500_MT which contains 143
535 reactions. However, the higher accuracy may not be fully accounted for just based on dataset size as it has been shown that a larger dataset such as USPTO_MIT (480k reactions) only brings small improvement in RETRO top 1 accuracy.28,29
(2) Advantage of the multi-task approach: multi-task reaction prediction has only emerged in the past few years, demonstrating that forward reaction, retrosynthesis and reagent predictions are different yet closely related tasks and can therefore be leveraged together to build a more robust model.6,32,33 Qiao et al. showed that transformer-based multi-task models can improve top-1 accuracy by 7–10% using just a few thousand reactions and the extent of improvement can depend on training data size.32 With our much larger multi-task training data (USPTO_500_MT, 143
535 reactions), a higher improvement in accuracy is possible.
We then continue to fine-tune multi-task ByT5Small for the single REAG task. This yields marginal gains in the REAG task (from 24.15 to 24.38 with greedy search), but, as expected, it also yields catastrophic forgetting of the other two tasks (e.g., Acc@1 drops from 95.82 to 35.26) due to full-model specialisation. Motivated by these preliminary results, we plan to delve deeper into multi-task fine-tuning setups in future work, also coupled with recent advances in modular and parameter-efficient learning in NLP that by design avoid issues such as catastrophic forgetting and interference in multi-task setups.34–36
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4fd00104d |
| ‡ Instruction-tuning is a specialised form of fine-tuning in which a model is fine-tuned using pairs of input–output instructions, enabling it to learn specific tasks guided by these instructions. |
| § In practice, we did not tokenise the entire set of 116m SMILES sequences but stopped tokenisation when we encountered that the set of ‘seen subwords’ has not been extended after seeing 500k new SMILES sequences. |
| ¶ Preliminary experiments in the task version with mixed reactants and reagents,6 yields very similar relative trends in results and comparisons; we thus omit it for brevity and to save computation. |
| || We have empirically validated that the relative trends in results do not change due to the test set sampling. We also run a smaller selection of models on full test sets to enable direct comparison to prior methods that operated on the same datasets. |
| ** We capped the maximum number of steps to 100k to keep training times manageable and the work computationally feasible given the large number of trained models. We note that FWD-S has by far the largest training set out of the three tasks and there is a possibility that the models on FWD-S are ‘under-trained’ and would benefit from additional fine-tuning. |
| This journal is © The Royal Society of Chemistry 2025 |