EnzymeMap: curation, validation and data-driven prediction of enzymatic reactions

Enzymatic reactions are an ecofriendly, selective, and versatile addition, sometimes even alternative to organic reactions for the synthesis of chemical compounds such as pharmaceuticals or fine chemicals. To identify suitable reactions, computational models to predict the activity of enzymes on non-native substrates, to perform retrosynthetic pathway searches, or to predict the outcomes of reactions including regio- and stereoselectivity are becoming increasingly important. However, current approaches are substantially hindered by the limited amount of available data, especially if balanced and atom mapped reactions are needed and if the models feature machine learning components. We therefore constructed a high-quality dataset (EnzymeMap) by developing a large set of correction and validation algorithms for recorded reactions in the literature and showcase its significant positive impact on machine learning models of retrosynthesis, forward prediction, and regioselectivity prediction, outperforming previous approaches by a large margin. Our dataset allows for deep learning models of enzymatic reactions with unprecedented accuracy, and is freely available online.


Introduction
][3][4][5][6][7][8][9][10][11] Recently, enzymatic synthesis routes to chemicals such as 1,4-butanediol, 12 branched chain higher alcohols, 13 or complex natural products such as the investigational HIV treatment islatravir, 14 or the investigational antiviral agent molnupravir 15 were developed.9 Since enzymes can exhibit both substrate and reaction promiscuity, 20 i.e. can act on non-native substrates or catalyze new reactions, the toolbox of possible biocatalytic transformations extends well beyond reactions seen in nature.Low catalytic activity may be increased by directed evolution.[21][22][23][24][25] Furthermore, computer-aided de novo design of enzymes has been reported for selected transformations, 26 opening up the possibility to design an enzyme for a given task.Enzymatic reactions and cascades thus provide a promising eco-friendly alternative to conventional synthesis pathways for a large range of compounds.
However, compared to organic synthesis, where a huge variety of computational tools based on heuristics and machine learning aid the design of synthesis plans, [27][28][29][30][31][32][33][34][35][36][37] enzymatic synthesis tools are less common and oen limited in their applicability and accuracy.Existing tools for bioretrosynthesis planning, [38][39][40][41][42][43][44] enzyme selection, 45,46 and reaction rule extraction or scoring [47][48][49] are oen restricted by the available data, both concerning the amount and the quality of reported reactions in databases.Especially deep-learning-based approaches such as reinforcement learning models, 39 or transformer models 43 rely on large amounts of high-quality reaction data, which is easily obtainable for organic reactions, but not for enzymatic reactions.High-quality, curated datasets such as RHEA 50 usually only report one or a few reactions per enzyme class and are thus small, whereas larger, uncurated databases such as BRENDA 51 pose problems to resolve the provided substrate names, and may contain unbalanced or erroneous reactions.In addition, stereochemical aspects of enzymatic reactions are oen disregarded, due to missing or erroneous entries in many databases.This is a major limitation of current approaches, since enzymatic reactions are favored especially for products containing multiple stereocenters, which can be difficult to obtain via organic reactions.Furthermore, atom mappings of enzymatic reactions in current databases are oen not available, and can be tedious and error-prone to obtain, especially for biochemical transformations.Datasets of enzymatic reactions that include correct stereoinformation and are atom mapped, balanced, validated, diverse, and sufficiently large for datadriven deep learning applications are currently not available, severely hindering the development of new tools and models to design enzymatic synthesis pathways.In a sense, the current paradigm of data-driven research falls short in the eld of enzymatic synthesis planning due to the limited size and quality of the available data.We therefore propose to instead follow a research-driven data approach, where the data necessary for new developments is identied, created, and curated rst.
In this work, we obtain a broad, high-quality dataset of atom mapped and balanced enzymatic reactions including stereoinformation by developing a large toolbox of automated reaction curation and correction steps, which we apply to BRENDA entries on natural and non-natural substrate-product pairs.The obtained reactions comprise, to the best of our knowledge, the currently largest atom mapped database of biocatalytic transformations.The newly introduced Python code underlying this study furthermore bridges several current gaps concerning the handling of reactions, and tracking of atoms through reactions which current packages such as RDChiral 52 or RDKit 53 do not address yet.We then showcase how the size and quality of the obtained database, EnzymeMap, substantially improves deep learning models of enzymatic reactivity for a broad range of tasks and model architectures, leading to previously unseen accuracies, as well as performances on par with state-of-the-art organic synthesis planning tools.We make the full database and processing steps freely available.

Methods
In the following, we describe the construction of a validated, atom mapped database of balanced enzymatic reactions from BRENDA entries.We then describe the details of the models trained for retrosynthesis, forward predictions, and regioselectivity prediction.The code to reproduce all processing steps from a raw BRENDA entry to a validated, mapped reaction is available as easy-to-use Python package at https://github.com/hesther/enzymemap, along with the full dataset via Zenodo at https://zenodo.org/records/8254726(raw unmapped and processed mapped reactions).The EnzymeMap Python package amounts to nearly 4000 lines of code introducing currently missing functionality for mapping reactions via reaction rules, correcting wrong reactions, and standardizing atom mappings, amongst many other functions.We note that for some reactions or enzyme classes BRENDA includes additional (uncurated) information not included in EnzymeMap.If one is searching for more information on a particular reaction or enzyme class such as reaction conditions, we suggest the reader check the corresponding BRENDA entry and the original literature sources.

Data preparation
An overview of the data processing pipeline developed in this work is given in Fig. 1.The numbering of the individual steps corresponds to the subsection numbering in the following.In general, we denote molecules (substrates, products, cofactors) by SMILES strings, 54 and reactions by pairs of SMILES strings separated by "[".Enzymes were represented via their Enzyme Commission number (EC numbers/classes), as well as (if available) organism information and protein identiers.The EC number is a numerical classication scheme which groups enzymatic reactions and does not specify the specic enzyme, its sequence or origin organism.For a given reaction we recommend to query an enzyme database, or tools like BridgIt 46 to identify genes for enzymatic reactions if this information is not available.
2.1.1Processing of the BRENDA text le.BRENDA 2023-1 was downloaded from the internet (free of charge). 55Scripts to load BRENDA entries were taken from ref. 49, and several formatting xes, such as missing whitespaces or hyphens were added.Raw BRENDA reactions are given as text strings, with substrates and products specied by trivial names, as well as a tag to indicate reversibility.An example entry is shown in Fig. 2, where aer step 1 (extracting names and stoichiometry), the reaction text of the reservible reaction was obtained, where one molecule of acetyle and water react to one molecule of acetaldehyde.When querying BRENDA, we nd that this reaction is linked to acetylene hydratases in 12 different organisms (where for nine of them reversibility is indicated), 14 literature references and one uniprot entry.
2.1.2Resolving and standardizing molecules.Next, all trivial names present in the database were attempted to be resolved to valid SMILES strings.In Fig. 2, this amounts to the names .We followed six different resolving strategies, where we queried a BRENDA ligands download le whether the name is associated with an InChi or CHEBI key, and then resolved via InChi, 56 CHEBI, 57 and the trivial name.InChi keys were directly turned into SMILES strings via RDKit, 53 as well as sent as queries to Pubchem. 58CHEBI keys were resolved via Pubchem either via a direct name query, or a synonym query.Trivial names were resolved via Pubchem and Opsin. 59All returned results were standardized, and canonicalized using RDKit, and all unique entries kept.Therefore, for a single name, multiple SMILES strings may arise.We also tested tautomerization in RDKit but found several reactions where tautomerization was not helpful, so chose to remove it.In Fig. 2 the SMILES strings were obtained for , respectively, where all names were resolved to a single SMILES string.The individual SMILES strings were combined to yield reaction SMILES, where all possible combinations were taken into account if one or more names were associated with several SMILES strings.In Fig. 2 only a single reaction was obtained, namely .We note that step 2 can yield multiple SMILES string for a molecule name if records differ in the queried databases.In that case, all options were further considered, leading to multiple reaction SMILES for a single entry.Multiple entries were pruned later aer atom mapped SMILES strings were obtained, i.e. in step 6.
2.1.3Correction heuristics.Each reaction was checked for stoichiometry, as well as for common mistakes concerning missing or wrong cofactors, hydrogens, or hydrogenperoxide.Detected errors were corrected in an automated fashion, such as increasing the amount of already present molecules, or adding hydrogens.Furthermore, molecules occurring in racemic mixtures were combined into a single molecule.In Fig. 2, the reaction passed all tests, and was kept as is.
2.1.4Atom mapping reactions.All balanced reactions were then atom mapped by applying the publicly available reaction rules from ref. 60 (termed "Broadbelt rule set" in the following) and tracking each atom throughout rule application.Since the Broadbelt rule set does not contain rules for reactions only affecting stereochemistry (e.g.cis/trans isomerases or racemases), we mapped reactions with the same achiral reactants and products directly without rule application.All other reactions were mapped through rule application.We rst aimed to reproduce the recorded product via a single application of each rule.If multiple rules produced the correct product, leading to different atom mappings, only the rule changing the fewest number of atoms and bonds, as well as being most frequently applicable was used.To speed up rule application, we ordered the rules from ref. 60 based on their frequency of applicability to all BRENDA entries.In Fig. 2, this procedure yielded the mapped reaction .The mapped reactions were saved along with the rules that were used to produce them.
The mapped reactions were then corrected for stereochemistry if a chiral center not participating in the reaction changed.This is usually the case when an isomeric SMILES was retrieved only for either the reactant or product, if the correct information was missing in the trivial name, or for wrong entries in the molecular databases used for name retrieval.Fig. 3 depicts an example where the wrong stereoisomer was retrieved for the product, and thus corrected.In principle, we do not know whether the stereoinformation is correct in the reactant or product.We choose to keep the stereoisomer of the reactant (concerning the chiral center which is not in the reaction center), since many product entries in BRENDA are erroneous.We note that this offers no guarantee to obtain a correct reaction, but was benecial in a large number of cases upon manual examination.
We furthermore validated each mapped reaction, e.g.checked whether the products can be recreated by extracting an RDChiral 52 template and applying it to the reactants.During the course of the project, we corrected some errors in RDChiral, available at https://github.com/hesther/rdchiral.If the template could not recreate the reaction, the mapping was dropped.
Several reactions in BRENDA could not be mapped by a single rule application.To address this, we developed an algorithm to allow for reactions occurring at multiple sites.For example, if a reaction yields the oxidation of two hydroxyl moieties in a molecule, the respective oxidation rule needs to be applied twice.We exhaustively enumerated the outcomes of applying a rule up to two times (the maximum number of steps can be customized in our soware package).To speed up this enumeration, we used only the reaction rules already present in the EC class from step 4A (or the full set if no reaction rules were recorded yet).If multiple rule applications led to the desired outcome, the overall reaction was saved, as well as all the individual steps.As an example, Fig. 4 depicts the oxidation of 1,2-butanediol, which requires rule application (oxidation by Fig. 2 Processing of an exemplary reaction text from one substrateproduct pair of acetylene hydratase.Since the reaction is tagged as reversible, two mapped and validated reactions are obtained.
Fig. 3 Oxidation of 17b-hydroxyetiocholan as an example of a reaction where wrong stereoisomers were retrieved.For readability, atom maps were omitted, as well as NAD+ and NADH.The reactive center is marked in blue in the reactant molecules, and the corresponding wrong chiral center in the product is marked in red.
molecular oxygen) at two different sites.Since the order of the individual steps is not known, all four individual reaction steps are included in the database, and agged with a keyword to indicate the reaction was obtained from a multi-step reaction.For individual reactions, molecules not participating were omitted, here for example the second oxygen molecule at step 1.
If the obtained single-step reactions were the same due to symmetry, duplicates were dropped.Multi-step reactions as well as their corresponding single-step components were then corrected for stereoinformation similar to single step reactions.
2][63] First of all, the available tools feature an imperfect accuracy especially for biochemical transformations. 62Second, recording the reaction rule used to map a reaction allows for further processing steps, such as judging the quality of a mapping, as well as the quality of the reaction itself by comparing reaction rule counts across enzyme classes.We can furthermore correct wrong stereoinformation in the products.Finally, we can easily group reactions into classes by comparing their reaction rules.Mapping via known rules thus offers the possibility to construct a higher quality dataset compared to simply mapping a known dataset with conventional mapping tools.We demonstrate the benet of our mapping scheme for subsequent machine learning tasks later in this manuscript.However, we note that the coverage of reactions that can be mapped heavily depends on the chosen rule set.To the best of our knowledge, the Broadbelt rule set from ref. 60 is the most complete and validated set currently available, but even for this set a number of missing rules was identied in the course of the present study, and will be addressed in future work.
2.1.5Proposing reactions based on reaction rules.A signicant number of reactions in BRENDA are unbalanced, or could not be mapped via reaction rules because the entry itself was awed.Usually, the reactant was correctly extracted from literature, but the product was not entered correctly, oen because it is not explicitly mentioned in the original publication.
For example, the reaction , EC 1.1.1.103wrongly lists isopropylaldehyde as product, although the oxidation of isopropanol clearly should lead to propanone.The same mistake (a ketone falsely named as aldehyde, which introduces an additional carbon atom) occurred hundreds of times in BRENDA for many different reactions, implicating a systematic error in their reaction curation.To correct for unmapped or unbalanced reactions, we propose probable reaction outcomes to correct reactions based on the reaction rules occurring in the same EC class and the similarity of both reactants and products via Tanimoto similarities of Morgan ngerprints 64 as implemented in RDKit. 65These reactions are agged as 'suggested', and may be ltered by the user.Both reactant and product similarities were taken into account to identify similar reactions based on the reactants, and then choose the most probable product aer rule application, which oen leads to multiple possible products due to the generic nature of the employed rule set.
2.1.6Post-processing.As described above, a single BRENDA entry can lead to multiple valid mapped reactions if the trivial names were resolved to different SMILES strings.Although formally valid, not all reactions may necessarily be meaningful.For example, sugars were sometimes retrieved in their closed or opened form for the reactants and products respectively, as depicted in Fig. 5.We therefore counted the number of bond edits (and the number of involved atoms) and only kept the set of reactions with the minimal number of bond edits observed.
For all mapped reactions in an EC class, we then computed the frequency of appearance of its corresponding reaction rule, and saved its relative frequency to the column 'quality'.This column can be used to lter the database for high-quality reactions.Oen reactions with a low quality correspond to reactions that were balanced and atom mapped, but nevertheless erroneous.For example, within EC 1.1.1.1,the oxidation and reduction of hydroxy, oxy and carboxy groups occurred frequently, as depicted in Fig. 6, whereas a rule showing an oxidation plus methyl shi only occurred once, suggesting that the entry in BRENDA was possibly wrong.In fact, the depicted reaction was extracted from ref. 66 which states that the reactant 3-methylbutanol was oxidized to 3-methylbutanal (not 3-Fig.4 Oxidation of 1,2-butanediol as an example of a multi-step reaction requiring rule application at two sites.For readability, atom maps were omitted.The stoichiometry of the reaction was corrected (note the missing "2" in the original reaction text).The multi-step reaction can be split into four distinct single-step reactions, where either the C1 or C2 can be oxidized first.methylbutanone as listed in BRENDA).The entry thus corresponds to a wrong extraction of products in BRENDA, which we found to occur frequently.
We then standardized the obtained mappings so that the mapped reaction SMILES can be used directly to detect duplicates.
As a nal post-processing step, we added the backward reactions for reversible reactions.BRENDA labels some reactions as reversible, some as irreversible and some as unknown.For reversible reaction, we directly added the backward reaction to our dataset and labeled it 'reversed'.For reactions with unknown reversibility, we queried whether reactions with the reverse reaction template existed in the same EC class.If so, the reverse reaction was added to the dataset and labeled 'rever-sed_suggested'.Altogether, we thus arrive at approximately 63k reactions (235k if protein + organism information is taken into account), yielding the largest atom mapped database of enzymatic reactions to date.If the EC class is disregarded, this boils down to 48k non-duplicate reactions.
The keywords for reversible and presumably reversible are listed in Table 1, together with all keywords for single and multistep reactions, as well as suggested reactions based on the recorded reactants.

EnzymeMap database
The full EnzymeMap database is freely available via Zenodo at https://zenodo.org/records/8254726as a CSV le, with columns as displayed in Table 2.The benchmarks described in this study use version 2.0 of EnzymeMap.The columns contain information as obtained from BRENDA without any curation or verication.We note that many reactions occur both as natural and non-natural reactions, oen in different organisms, but sometimes also in the same organism.Since this information together with the original text of the reaction can be used to link an entry back to BRENDA, we chose to keep this information.Depending on the intended usage of the database, we recommend to rst lter for the reactions of interest, and then remove duplicates stemming from information not needed for a specic application.

Other datasets
In addition to BRENDA, this work also utilized reactions from KEGG and MetaCyc to compare the coverage and quality of atom mappings of the EnzymeMap database.
A list of KEGG enzymatic reactions given as trivial names and their EC numbers was downloaded from https:// www.genome.jp/kegg-bin/get_htext(accessed 2023-07-24) amounting to 20k reactions.The reactions were then passed through the EnzymeMap workow, leading to 10k entries where a full 4-digit EC number was specied and all names could be resolved (aer step 2).7.1k entries had at least one possible balanced reaction SMILES.Aer step 6, 8.0k reactions (including reversed and suggested reactions, thus more than the initial 7.1k mapped reactions) were obtained, out of which 6.6k reactions were unique disregarding the EC number.
MetaCyc version 27.0 was obtained from http:// www.biocyc.org/,Copyright SRI International 2022.The 18k atom mapped SMILES strings distributed with MetaCyc did not include cofactors and were partially not balanced, so that we cross-linked the distributed atom mapped SMILES with

Machine learning models
In the following, we describe the methodology and details of the employed machine learning models for retrosynthesis, forward prediction and regioselectivity prediction, which we train on the EnzymeMap database.To this end, we have selected tasks that depend on atom mapped reactions (as opposed to tasks and models taking only substrate information into account To assess the improvements EnzymeMap offers over other databases, we train a neural network model to rank relevant retrosynthetic templates to produce a given product, as also used in the open-source synthesis planning tool ASKCOS. 67The model therefore suggests reaction templates given an input molecule, where we use RDChiral 52 to produce chiral reaction templates and retrain the template-relevance model from ref. 31  with the default hyperparameters described therein.Since the model is trained as classication task (i.e. to identify the template that led to the given product), it is important that all templates extracted from a dataset are mutually exclusive.We thus used the code from ref. 75 to arrive at exclusivity-corrected templates.
The model itself takes product Morgan ngerprints of length 2048 and radius 2 as input, and uses a single hidden layer of 2048 neurons with RELU activation functions to map the ngerprint to its retrosynthetic template.A dropout rate of 0.2 was applied.The learning rate was set to 0.001, with early stopping if the validation error did not improve for three consecutive epochs.
Three different datasets were employed.RHEA, 50 exactly as provided in ref. 44, as well as MetAMDB 76 (which is based on BKMS-react 77 ) and EnzymeMap (this work, based on BRENDA 51 ), since these datasets contain balanced atom mapped reactions.MetAMDB consists of approximately 43 000 reactions with atom mappings generated by the Reaction Decoder Tool, 61 and thus constitutes the largest atom mapped reaction set prior to the current study, serving as a comparison to EnzymeMap regarding quantity.RHEA was chosen because it is highly curated and reliably, thus serving as a comparison regarding quality.Furthermore, it was utilized in a recent bioretrosynthesis study. 44For MetAMDB and EnzymeMap, we follow the RHEA preparation instructions from ref. 44 closely.Namely, we removed all products occurring more than 100 times in the dataset to get rid of common cofactors or other frequent molecules such as protons, and then deleted molecules from the reactants that had no matching atoms in the products.We then kept all reactions that had a single product molecule.Importantly, duplicates were removed for all reactions, which might occur where e.g.reactions differed only in their cofactors before cofactor removal.This creation of single- product reactions is necessary for the chosen model architecture, which can only take a single product as input.For Enzy-meMap, this led to 18k reactions and 2.8k templates.For MetAMDB, 14k reactions and 5k templates were obtained.For RHEA, 8k reactions and 4k templates were obtained.All datasets were randomly split to 80% training, 10% validation and 10% test sets.We used the entire dataset, including templates that occurred infrequently or even only once.This made our modelling task much more challenging compared to approaches that exclude templates with few reaction precedents.We furthermore report ablation studies with the same test set as EnzymeMap for MetAMDB, RHEA, and subsets of EnzymeMap, were all identical reactions to this common test set were removed and the remainder of reactions split into 80% training and 20% validation sets, respectively.In another ablation study, we remapped EnzymeMap (either raw from BRENDA, or aer the EnzymeMap workow) using the popular transformer-based atom mapping soware RXNMapper, 63 which is the current state-of-the-art mapper regarding versatility and accuracy.We used default settings as provided in the RXNMapper Python package. 78e report top-N accuracies for all models, i.e. the ratio of test datapoints where the correct template was identied in the top-N suggestions.Although this metric is associated with a few shortcomings concerning its translation to real-world performance for retrosynthesis tasks, 79 it allows us to compare our approach to performances for organic retrosynthesis, which is typically reported using top-N accuracies.
2.4.2Forward reaction and retrosynthetic pathway prediction models based a transformer model.To assess the importance of a large and diverse dataset for deep learning reaction models, we retrained a recently published enzymatic reaction prediction tool 43 within the IBM RXN toolbox, [68][69][70][71] based on a transformer architecture, and originally trained on the ECREACT dataset. 43Both the forward-reaction prediction, as well as the retrosynthesis prediction were retrained exactly as described in ref. 43 aer adding the respective reactions from EnzymeMap on top of ECREACT and deduplication.The original model was retrained, too, to ensure that the correct values were reproduced.In detail, the models 'EC3' were retrained, which utilize the rst three digit of the EC number in the reaction SMILES, for example for the EC number 1.1.3.2: The forward model predicts the product side, based upon the reactant side, .The backward model takes the product side as input and predicts the reactants, including the three-digit EC number.Each model consisted of a transformer encoder and decoder with 4 layers, a word vector and RNN size of 384, positional encoding turned on, 8 attention heads with global and self attention, adam as optimizer with b 1 = 0.9 and b 2 = 0.998, a learning rate of 2.0 with the noam decay method, and a dropout rate of 0.1.Batches of 6144 were used.Both models were trained on an OpenNMT 80 Version adapted for molecules, see ref. 43 for further details.
Since the models do not require atom mapped reactions, we use the full set of reactions aer step 2 in Fig. 1, amounting to 100k reactions, where only duplicates within an EC class (but not across EC classed) were removed, since the model makes use of the EC class.Aer tokenization at the third-digit EC class, and adding the ECREACT reactions, we arrive at 90k unique reactions (as opposed to 57k reactions with only ECREACT).In this benchmark, we do not showcase the capabilities of the EnzymeMap correction and curation pipeline (since the reactions are unmapped and uncorrected), but only the inuence of the size of the dataset and the number of reactions per EC class.The 100k reactions contain also unbalanced reactions, as well as those that could not be mapped in our processing pipeline, since we expect the transformer model to still be able to make use of them.Furthermore, if different options to resolve a trivial name to a SMILES string were found, the dataset contained all options, which explains the large size of 100k versus 63k reactions in EnzymeMap.
To further showcase the effect of data curation, correction and validation, we retrain both the forward and backward model with ECREACT with only processed EnzymeMap reactions (n = 63k) added.Aer tokenization, the dataset amounts to 84k reactions, so less than in the previous benchmark where the raw reactions were added, but at higher quality.
We report top-N accuracies for the forward prediction, backward prediction, and roundtrip task (backward prediction followed by single forward prediction).
2.4.3Regioselectivity models based on graphconvolutional neural networks of the transition state.Lastly, we showcase the importance of the underlying dataset when predicting the regioselectivity of enzymatic reactions.We again use the single-product versions of RHEA, MetAMDB, and EnzymeMap described above for the template-relevance model.For EnzymeMap, only reactions obtained via a direct, single rule application were used.For all reactions in the respective datasets, reaction templates were extracted via RDChiral. 52The templates were then applied to the reactants, and all reactions were kept that produced more than one possibility for the products.Thus, regioselective reactions were identied, i.e.where multiple sites could have reacted in theory.Recorded reactions were labeled "1" and all other reactions as "0".For example, if the application of a rule extracted from , produced and the following three lines were added to the dataset: where and are SMILES strings.For Enzy-meMap, this led to 21k reactions from 5.1k reactants.For MetAMDB, 15k reactions from 3.3k reactants were obtained.For RHEA, 7.6k reactions from 1.9k reactants were obtained.All datasets were split randomly into 80% training, 10% validation and 10% test data.
We note that the models did not make use of the EC class or protein information, and therefore only learn which reaction outcomes are more likely in a general sense, and not specic to a protein.Reactions with different outcomes based on the EC class or protein identity thus add noise to the dataset, so that even a perfect model cannot reach 100% accuracy.We did not include the EC class or protein sequence, since recent work identied major shortcomings in current approaches to encode the protein information in a meaningful way even for highly curated data. 81e then trained a classication model on the full training sets, as well as random subsamples of differing sizes.We chose the graph-convolutional neural network architecture Chemprop 72 with reaction support from ref. 73 (CGR-Chemprop) for our classication models, since this framework was recently demonstrated to learn high-accuracy reaction properties such as barrier heights, rates, and regioselectivities. 74CGR-Chemprop relies on transforming reaction SMILES to their corresponding condensed graph of reaction, an overlay between the reactant and product graphs, i.e. the articial, graph-based transition state of the reaction.We trained for 100 epochs and used 10-fold cross-validation for each prediction task since the performance is very sensitive to the data split.All other hyperparameters such as number and size of layers, or learning rates were kept at their default values (three rounds of message passing with hidden size of 300, mean aggregation over the graph, 2 feed-forward layers with a hiden size of 300).We report the at accuracy of the classication, i.e. the model's ability to discern between reactive and non-reactive data points.For example, if the classication model predicted for the test set the accuracy (percentage of correct predictions) would be two out of ve, i.e. 20%.We furthermore report the top-1 accuracy to identify the correct products given the reactants, i.e. the fraction of test data points where the reaction labeled "1" had a higher raw predicted value than all reactions labeled "0" originating from the same reactants and template.For the example above, if the raw scores from the model (prior to applying a threshold to create binary labels from the continuous predictions) were , then the model scored the correct products at rank 1 for , and at rank 2 for , leading to a top-1-accuracy of 50%.

Database analysis
Table 3 lists the number of mapped reactions that were obtained via direct mapping of the original entry via a single reaction rule application (step 4A in Fig. 1), via multiple rule applications (step 4B in Fig. 1), as well as via suggesting products based on the reactants of unmapped or unbalanced reactions (step 5 in Fig. 1).In total, 63 137 reactions with a unique mapped reaction within an EC class were obtained, not taking into account information on natural/non-natural substrates, organisms and proteins.Disregarding the EC class, 47 974   3.
unique reactions were obtained.When also discerning between natural/non-natural substrates, organisms and proteins, a total of 234 845 non-duplicate entries is obtained.The majority of the reactions in EnzymeMap (71%) stem from single step reactions from the original BRENDA entries.17% of non-duplicate entries (21% taking into account organisms and proteins) stem from BRENDA entries classied as "natural substrate/product pair".Fig. 7 depicts the number of mapped reactions per EC class, split into overall (outer circle), balanced (second circle) and mapped (inner circle) reactions, aer all pre-and postprocessing steps.We nd that no EC class exhibits a disproportionally large number of imbalanced reactions or failed mappings, and that EnzymeMap is mainly comprised of reactions in the EC classes 1 (oxidoreductases), 2 (transferases) and 3 (hydrolases) to nearly equal amounts, whereas the EC classes 4 (lyases), 5 (isomerases) and 6 (ligases) only make up about 10% of all reactions in total.Translocases (EC 7) were not considered for mapping, since they catalyze the movement of ions or molecules across membranes, so that the reactant and product molecules are usually the same.
We then compared the obtained reactions to KEGG and MetaCyc aer removing atom mappings, and found that of the nearly 48k unique reactions in EnzymeMap, only 0.5k are also found in MetaCyc (0.6k aer passing MetaCyc through the EnzymeMap workow, i.e. also adding reverse reactions), and 1.8k are also found in KEGG (5k aer passing KEGG through the EnzymeMap workow), where we compared the unmapped reaction SMILES strings.EnzymeMap thus provides access to several tens of thousands of reactions not covered by other databases.The workow furthermore is not limited to BRENDA.With minimal changes to the setup of the initial raw reactions, we were easily able to map reactions from KEGG and MetaCyc, leading to 8.0k and 4.6k mapped reactions, respectively.We distribute those alongside EnzymeMap, since a combination of all sources might be benecial for future data-driven prediction of enzymatic reactions.For MetaCyc, we were furthermore able to compare the atom mappings distributed with MetaCyc with atom mappings from the EnzymeMap workow.For reactions that led to a valid mapped reaction with our workow, we found same mappings for 70% of the reactions (accounting for equivalent mappings due to symmetry).For the reactions with different atom mappings, our mappings had an equal or less number of bond edits than the MetaCyc mappings in 92% of cases.Manual examination of random cases where EnzymeMap lead to better mapping results than the native MetaCyc mappings revealed a few hundred of errors in MetaCyc, with an example shown in Fig. 8, where ve atoms are wrongly mapped.We also found cases where atoms corresponding to different elements (for example phosphorus and carbon) were assigned the same map number.This highlights the need for better atom mapping routines for enzymatic reactions, or at least better post-processing steps to ag erroneous reactions, even for highly curated databases like MetaCyc.However, the Enzyme-Map workow is not without aws itself, due to missing rules in the employed Broadbelt rule set, leading to possibly wrong mappings or no mappings at all.For the latter case, we compared the number of MetaCyc reactions that could not be mapped via the EnzymeMap workow per 1-digit EC class, and found that all EC classes are about equally affected by failed mapping attempts.Manual inspection revealed many reactions in MetaCyc, for which no rule exists currently, including multistep reactions involving different reaction steps (e.g. a reaction followed by decarboxylation), reactions involving changes in rings, some transfer reactions, ring-forming reactions, and hydrolysis of peptide bonds, amongst others.We attempt to address missing rules in a future publication, and thus future versions of EnzymeMap.Since a new rule set only requires to update one le and re-run the workow, the incorporation of new rules is a trivial exercise.For the remainder of this study (and the current version of EnzymeMap), we stick with the Broadbelt rule set, since the creation and validation of a new rule set is well beyond the scope of the current study.on ngerprints, graph-convolutional neural networks on graphs, and transformer models on strings or graphs.Although these are different models trained on different datasets (i.e.not directly comparable), they showcase an important prerequisite for computer-aided retrosynthesis, namely that powerful one-step synthesis models are needed, which can subsequently be used in multi-step synthesis planning tools.Using EnzymeMap, for the rst time, an enzymatic retrosynthesis model is able to compete with the accuracy of organic retrosynthesis tools, which is remarkable given that the model is very simple, and was not optimized in any way.In comparison, models trained on Met-AMDB or RHEA feature a relatively low accuracy.A low top-N accuracy is especially problematic for designing multi-step synthesis pathways, where clever ranking algorithms are essential to navigate the combinatorially explosive number of template application possibilities at each reaction step.For instance, using the low top-1-accuracy of 0.18 for RHEA from Table 4, designing a pathway of three independent steps by taking the top-1 template at each step, only 0.18 3 = 0.006 = 0.6% of test products would yield the correct synthesis pathway.In practice, this would be even lower since we do not want to nd just any pathway consisting of three reactions that produce the product, but pathways starting from e.g.buyable materials, or having no unnecessary loops of protection and de-protection.In comparison, 9.7% of test products (0.46 3 = 0.097) would be recovered using the EnzymeMap model (taking only the top-1 recommendation), which is an about twenty-fold increase in success rate.We therefore anticipate EnzymeMap to also perform well with multi-step retrosynthesis models, although the training of such models via e.g.Monte Carlo Tree Search or Reinforcement Learning is beyond the scope of the current study.

Retrosynthesis based on neural-network
Table 4 furthermore lists several ablation studies to showcase the importance of the data curation, cleaning, and validation routines developed in this study.First, we explore whether mapping the resolved raw reaction SMILES from BRENDA via an alternative route to Fig. 1 affects the performance of retrosynthesis models.To this end, we map the rst reaction SMILES per BRENDA entry using the state-of-the-art transformerbased atom mapper RXNMapper. 63This leads to a large loss in performance yielding top-N accuracies close to those achieved by the MetAMDB model.We therefore conclude that the careful mapping, correction, and validation approach for enzymatic reactions developed in this work is essential for a good performance of reaction prediction models.Second, we take the cleaned and curated reactions from EnzymeMap and re-map them with RXNMapper.This also leads to a large loss in performance, although to a less extent, resulting in top-N accuracies below the performances typically achieved for organic reactions.We therefore conclude that also the quality of atom maps is essential for training reaction prediction models, and that simple, uncurated mapping impacts the performance negatively.This ablation study therefore showcases the importance of database quality.
We then retrain all models and test them on the EnzymeMap test set, aer removing overlap in the training and validation sets with the new test set.For RHEA and MetAMDB, this provides an even harder task trying to predict retrosynthesis steps of reactions for which no close analogue might be available in the training set, so that the low performances in Table 4 were somewhat expected.The performance loss is less for MetAMDB vs. the much smaller RHEA, indicating that Met-AMDB covers a wider range of reaction functionalities.
We then removed reactions from the EnzymeMap training and validation sets in a further set of ablation studies to benchmark the importance of database quantity.The performance degrades for every loss of information, starting with omitting multi-step reactions, reverse reactions, suggested reactions, and combinations thereof, showcasing the value and need for each of the processing steps in the EnzymeMap workow.The increased coverage of enzymatic reaction space by adding multi-step reactions, reverse reactions, or correcting entries via suggesting outcomes (instead of dropping the entry) is benecial to the performance of the trained retrosynthesis model, highlighting the need for larger reaction databases.
The ablation studies therefore highlight the importance of the developed data generation/curation approach both in terms of quality and quantity of the obtained reactions.Although our approach adds computational burden compared to out-of-thebox mapping strategies, the performance improvement in machine learning models warrants its use, and, for the rst time, enables accurate data-driven modeling of enzymatic synthesis planning.

Forward reaction prediction and retrosynthesis based on transformers
Next, we showcase the merits of EnzymeMap for large deeplearning models not dependent on atom mappings, i.e. benchmark only the effect of increased coverage, but not quality.The IBM RXN-for-Chemistry platform provisions forward and backward models for biocatalysed reaction predictions.Both models have been trained using a multi-task transfer learning approach on a transformer architecture.5][86][87] Including the enzyme commission numbers during training enables the backward model to not only predict the substrates based on a product, but also the class of the catalysing enzyme.While the model showed state-of-the-art performance, it was held back by the limited availability of enzyme-catalysed reactions.As ECREACT contained only 11 130 reactions extracted from BRENDA, retraining the models using the considerably larger non-atom mapped version of EnzymeMap in addition to ECREACT leads to a 59% increase in training set size (at an overall dataset size of 90 028), which results in an increase in the mean predictive accuracy of sub-classes in the forward model across all enzyme classes and an increase in the mean predictive accuracy of the retro model with the exception of transferases and ligases, see Table 5.As the new data introduced with the inclusion of EnzymeMap not only adds new samples for the regions of reaction space covered by ECREACT but also increases the covered space, the increase in dataset size does not always lead to an increase in predictive accuracy.Indeed, the addition of diverse data can even lead to a (insignicant) decrease in predictive accuracy as is the case for ligases in the backwards model.The above benchmarks showcase the effect of a sheer increase in dataset size, but not necessarily dataset quality.We therefore also retrained all models with only the processed and validated EnzymeMap reactions + ECREACT (with an overall dataset size of 83 470).Here, we nd even further improvements compared to the raw EnzymeMap dataset for nearly all enzyme classes in both the forward and reverse model.Thus, even for unmapped reactions, curating a high-quality dataset can have benecial effects to models trained on them, highlighting the need of both high-quality and high-quantity datasets for chemical deep learning.
Overall, EnzymeMap leads to a large performance increase in both the forward and retro model.The retrained models (on the previous version 1 of EnzymeMap) have been made available as the default biocatalysis models on the IBM RXN-for-Chemistry platform (https://rxn.res.ibm.com).

Regioselectivity prediction based on graph-convolutional neural network
Fig. 10 depicts the classication accuracy of CGR-Chemprop trained on regioselective reactions from EnzymeMap, Met-AMDB, and RHEA, as well as the fraction of data points in the test set where the true product was ranked highest (as opposed to false products obtained via template application to different sites in the reactants) in dependence on the number of training reactions.For the full training sets, Table 6 furthermore lists the achieved accuracy and top-1 accuracy.As evident in Table 6, EnzymeMap offers a large performance boost for regioselectivity predictions for the full datasets (right-most data points in Fig. 10), but also for random subsets of the training data.Interestingly, we nd a better top-1 accuracy for RHEA vs. MetAMDB when trained on the same number of datapoints (right panel in Fig. 10), which underpins the quality of RHEA which is heavily curated.Similarly, we conclude that ECREACT 43  EnzymeMap not only offers a benet in size and diversity (i.e.coverage of reactions), but also in its low inherent noise obtained via the extensive validation and curation efforts described in the Methods section.Through comparison of different training sizes, we conclude that both the quantity and quality of recorded reactions inuence the performance of reaction prediction models, giving EnzymeMap a large benet over other databases.With the full dataset utilized, we obtain a at accuracy of 87% to discern between reactive and nonreactive reaction instances, as well as a top-1 accuracy of 76% to identify the correct regioselective reaction outcome given a set of product options.Together with the high accuracies reported for retrosynthesis and forward prediction, this lays the groundwork of successful data-driven biocatalytic synthesis design.We anticipate further performance improvement upon inclusion of enzyme information to the model input, so that the model can learn different regioselectivies exhibited by different enzymes (which in this study only manifests as aleatoric, irreproducible error).

Conclusion
We have developed a database curation, validation, and mapping pipeline for enzymatic reactions, producing an extensive dataset of atom mapped, balanced, validated, and diverse reaction which include correct stereoinformation from BRENDA entries.We showcase that this new dataset, EnzymeMap, is sufficiently large for data-driven deep learning and offers signicant performance improvements over all previous databases for retrosynthesis, forward prediction and regioselectivity prediction tasks.For the rst time, we report prediction performances on par with organic retrosynthesis tools.This performance boost comes from both better coverage (larger number of total reactions, larger number of reactions per EC class), as well as better quality regarding the reactions itself and their atom mappings.We distribute the full pipeline as easy-to-use Python package, and demonstrated that the workow can be easily adapted to other databases such as MetaCyc or KEGG.We expect that the EnzymeMap database will spark numerous inventions in the eld of computer-aided enzymatic reaction prediction, especially for applications relying on mapped, balanced reactions such as the computeraided design of enzymatic cascades.

Fig. 1
Fig. 1 Schematic data processing pipeline to arrive at atom mapped, balanced reaction SMILES from raw BRENDA entries.Grey boxes represent the start and end points, purple parallelograms indicate input or output, pink boxes represent processes, and orange diamonds pose decision points in the pipeline.

Fig. 5
Fig. 5 Reduction of D-arabinose.The left reaction corresponds to a combination of an closed and open form, thus wrongly assuming that the reduction includes a ring-opening reaction.We therefore choose the reaction with less edits (right panel).

Fig. 6
Fig. 6 Quality of reaction based on rule count.
the unmapped MetaCyc reactions to add missing molecules and cofactors.Only entries with a full 4-digit EC class and not leading to an error upon loading into RDKit were retained, amounting to 7.0k reactions.The resulting reactions (balanced SMILES strings) were then passed to the EnzymeMap workow skipping steps 1-3.Aer step 6, 4.6k reactions (including reversed and suggested reactions) were obtained, out of which 4.5k reactions were unique disregarding the EC number.

Fig. 7
Fig. 7 Composition of BRENDA (without natural/non-natural, organism and protein information).(Outer ring): number of reactions per EC class.(Middle ring): fraction of balanced (colored) and unbalanced (gray) reactions.(Inner ring): fraction of reactions that were atom mapped (colored, multiple mappings per BRENDA entry were only counted once) or remain unmapped (gray).Since unbalanced reactions can be mapped via step 5 of our workflow, this number might be larger than the number of balanced reactions.Exact numbers in Table3.

Fig. 9
Fig. 9 depicts the top-N accuracy of neural networks trained on identifying templates which lead to the recorded precursors, i.e. predicting retrosynthetic single-step pathways.The gray area corresponds to top-N accuracies typically achieved by organic retrosynthesis models for the USPTO-50k dataset, with values taken from ref. 82, 83 and 75 including neural network models

Fig. 8
Fig. 8 Example of a wrong atom map in MetaCyc for the entry 'ACETYLGLUTKIN-RXN'.

Fig. 9
Fig.9Top-N accuracies for retrosynthesis models trained on different databases (RHEA, MetAMDB and EnzymeMap).The gray area corresponds to top-N accuracies typically achieved by organic retrosynthesis models on the USPTO-50k dataset.

Table 5
Top-N-accuracies of the forward, backward and roundtrip prediction task with the IBM Rxn-for-Chemistry transformer models with 3digit EC classes trained on different datasets.Training with ECREACT reproduces the results reported in ref. 43.Additionally adding all raw, unprocessed reactions increases the performances for most EC classes and tasks.When utilizing only the validated EnzymeMap reactions, the performance further increases Forward Backward Roundtrip N = 1 N = 3 N = 5 N = 10 N = 1 N = 3 N = 5 N = 10 N = 1 N = 3 N = 5 N = 10

Table 1
Meaning of the columns 'steps' and 'source' in the EnzymeMap database Likely reversible based on other reactions in EC class Source = suggested BRENDA entry could not be mapped or was unbalanced.Product was inferred based on rule frequency and product similarity Source = suggested reversed Reverse of a suggested reaction where the reversibility tag indicated a reversible reaction Source = suggested reversed_suggested Reverse of a suggested reaction where the reversibility tag was not specied.Likely reversible based on other reactions in EC class [72][73][74][71] an open-source model architecture representing the current state-of-the-art was chosen, namely (1) the single-step retrosynthesis model from ref.31which underlies the popular open-source synthesis planning tool ASKCOS,67(2) the transformer-based single-step forward and retrosynthesis model of the successful IBM RXN model,[68][69][70][71]and (3) the message-passing neural network Chemprop which recently was shown to produce high-quality reaction predictions.[72][73][74]2.4.1 Template-based retrosynthesis via neural networks.

Table 2
Format of the EnzymeMap database.All entries correspond to reactions that were balanced and mapped DescriptionIndex of the reaction in the raw le before mapping Atom mapped reaction SMILES Reaction SMILES where atom maps were removed Original reaction text from BRENDA.This can be used to trace back to the original BRENDA entry, which does not have any identier unfortunately SMARTS of Broadbelt rule used for mapping ID of the rule as distributed together with EnzymeMap Whether entry was obtained directly, via reversal or via suggesting products, see Table1Whether entry was obtained from single or multiple rule applications, see Table1Relative frequency of rule EC number Whether reaction was classied as naturally occurring Source organism One or multiple IDs of protein sequences Name of database for IDs in the protein_refs column

Table 3
Number of mapped and unmapped reactions without duplicates in EC class (and with protein/organism information in third column), where the mapping could either be inferred from the originally recorded database entry via one or multiple steps of reaction rule application, or was suggested from reaction rules in the same EC class

Table 4
Top-N accuracies for retrosynthesis models trained on different databases (RHEA, MetAMDB and EnzymeMap)