Xinchun
Ran
a,
Yaoyukun
Jiang
a,
Qianzhen
Shao
a and
Zhongyue J.
Yang
*abcde
aDepartment of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, USA. E-mail: zhongyue.yang@vanderbilt.edu; Tel: +1-343-9849
bCenter for Structural Biology, Vanderbilt University, Nashville, Tennessee 37235, USA
cVanderbilt Institute of Chemical Biology, Vanderbilt University, Nashville, Tennessee 37235, USA
dData Science Institute, Vanderbilt University, Nashville, Tennessee 37235, USA
eDepartment of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
First published on 17th October 2023
Hydrolase-catalyzed kinetic resolution is a well-established biocatalytic process. However, the computational tools that predict favorable enzyme scaffolds for separating a racemic substrate mixture are underdeveloped. To address this challenge, we trained a deep learning framework, EnzyKR, to automate the selection of hydrolases for stereoselective biocatalysis. EnzyKR adopts a classifier–regressor architecture that first identifies the reactive binding conformer of a substrate–hydrolase complex, and then predicts its activation free energy. A structure-based encoding strategy was used to depict the chiral interactions between hydrolases and enantiomers. Different from existing models trained on protein sequences and substrate SMILES strings, EnzyKR was trained using 204 substrate–hydrolase complexes, which were constructed by docking. EnzyKR was tested using a held-out dataset of 20 complexes on the task of predicting activation free energy. EnzyKR achieved a Pearson correlation coefficient (R) of 0.72, a Spearman rank correlation coefficient (Spearman R) of 0.72, and a mean absolute error (MAE) of 1.54 kcal mol−1 in this task. Furthermore, EnzyKR was tested on the task of predicting enantiomeric excess ratios for 28 hydrolytic kinetic resolution reactions catalyzed by fluoroacetate dehalogenase RPA1163, halohydrin HheC, A. mediolanus epoxide hydrolase, and P. fluorescens esterase. The performance of EnzyKR was compared against that of a recently developed kinetic predictor, DLKcat. EnzyKR correctly predicts the favored enantiomer and outperforms DLKcat in 18 out of 28 reactions, occupying 64% of the test cases. These results demonstrate EnzyKR to be a new approach for prediction of enantiomeric outcomes in hydrolase-catalyzed kinetic resolution reactions.
However, for a non-native substrate, identifying biocatalysts with high stereoselectivity for kinetic resolution can be challenging due to the unknown structure–function relationships.8 To address this, empirical and computational models have been developed to predict stereoselective outcomes of hydrolase-catalyzed kinetic resolution. In 1998, Kazlauskas et al.9 established a model that links the size or hydrophobicity of stereocenter substituents with enantioselectivity for ∼130 esters derived from secondary alcohols. In 2002, Tomić et al.10 used quantitative structure–activity relationship (QSAR) analysis to predict the enantioselectivity of Burkholderia cepacia lipase (BCL)-catalyzed acylation reactions involving thirteen racemic 3-(aryloxy)-1,2-propanediols. In recent years, machine learning has emerged as a powerful tool to predict stereoselective biocatalytic processes.11 For one, Cadet et al.12 developed a machine learning model to predict the impact of mutations on the enantioselectivity for epoxide hydrolase. The model was trained using 9 possible single point mutation variants and achieves an R2 of 0.81 on a test set containing 28 mutants. Despite the significant advances in models that specialize in enantiomeric prediction for certain types of hydrolases, “generalist” models that can predict enantioselectivity across a broad spectrum of hydrolase scaffolds, mechanisms, and substrate types remain undeveloped.11
One promising strategy is to directly predict the kinetic parameters for an enzymatic reaction, because the apparent selectivity in kinetic resolution directly connects to the difference in hydrolytic rates between enantiomers. In recent years, predictive models for the enzyme turnover number (i.e., kcat) have been developed for metabolic engineering.11 For example, Heckmann et al.13 used elastic net regression, random forest, and deep neural network models to predict kcat values in Escherichia coli, achieving a cross-validated Pearson R2 value of 0.31 for kcat and 0.76 for kapp,max. Li et al.14 developed a deep learning model, DLKcat, to predict genome-scale kcat values for over 300 yeast species, achieving a Pearson R value of 0.94. However, one major pitfall in the existing models is the lack of chirality representation of the substrates. As such, these models likely fail in the task of enantiomeric prediction.
To address this limitation, here we developed a deep learning model, EnzyKR, to predict the enantiomeric outcome of hydrolase-catalyzed kinetic resolution reactions. EnzyKR adopts a classifier–regressor architecture to predict kcat values for hydrolase–substrate pairs. Distinct from existing kcat predictors, EnzyKR encodes the chirality information of substrates through geometric features, substrate dihedral angles and atomic distance maps extracted from hydrolase–substrate pairs. As the difference in kcat values between enantiomers informs stereoselectivity, EnzyKR can potentially be used to screen and select hydrolase scaffolds for stereoselective biocatalysis applications.
In the regressor component of the model, the input configuration consists of embeddings from the classifier, which are concatenated with the substrate–enzyme distance information and the dihedral angles representing the substrate's chiral center, mirroring the encoding approach employed in the distance encoder of the classifier. To encode the embeddings, the regressor uses one module of cross-attention with 8 attention heads and a dropout rate of 0.1. The attention module is followed by residual blocks to extract features with a dimension of 612 × 2718 from the cross-attention embeddings. The residual blocks consist of three 2D dilated convolution layers with a filter size of 11 and a padding size of 1, one 2D batch norm layer, and one ReLU layer. Subsequently, two layers of a fully connected neural network (i.e., a multiple-layer perceptron) are employed to conduct regression between the extracted feature and the activation free energy (i.e., ΔG‡).
The structural models for hydrolase–substrate complexes were constructed using RosettaLigand20 (ESI, Text S3†). Each substrate sdf file was obtained from PubChem API by searching for their SMILES string. Conformational sampling was conducted for each substrate to generate 250 conformers using the BCL::Conf web interface.21 These conformers were used as an input to dock into the active site of their corresponding hydrolase using RosettaLigand. The docked hydrolase–substrate complexes were divided into two categories based on the spatial proximity between enzymes' catalytic residues (i.e., the catalytic triad) and geometric center of the reacting functional group on the substrate. If the distances are all within 4.0 Å, the substrate–enzyme complexes were classified as reactive substrate–enzyme complexes. Otherwise, the complexes were classified as unreactive. Each reactive complex was also visually inspected to ensure optimal positioning of the substrate into the active site. In total, we curated 224 reactive hydrolase–substrate complexes versus 448 unreactive ones. To examine the capability of EnzyKR to differentiate enantiomers, we curated an independent test set comprising the structure and experimentally characterized enantiomeric excess ratio (ee%) for 28 hydrolytic kinetic resolution reactions catalyzed by fluoroacetate dehalogenase RPA1163 (PDB ID: 5K3F),6 halohydrin HheC (PDB ID: 1PWX),22A. mediolanus epoxide hydrolase (PDB ID: 4I19),23 and P. fluorescens esterase (PDB ID: 1AV4).24 The data for the ee% ratio were manually curated from the publication. For each of the 56 hydrolase–enantiomer complexes, we adopted the above-mentioned docking approach to build the structural model.
The regressor leverages a cross-attention module to encode a representation matrix that concatenates the embedding of the classifier, substrate dihedral angles, and the atomic distance maps. The representation matrix is fed into a one-layer residual block to extract features from the cross-attention embeddings. These features are then used to predict the ΔG‡ value of a hydrolase–substrate complex through a two-layer multiple-layer perceptron (MLP) neural network.
The EnzyKR architecture is distinct from existing deep kcat or ΔG‡ predicters in three aspects.11,13,14 First, EnzyKR explicitly encodes spatial interactions between hydrolase and the substrate in the form of a substrate enzyme atomic distance map and substrate dihedral angles for both the classifier and regressor, rather than relying on annotation or tensor concatenation to embed them.25 Second, EnzyKR uses a cross-attention block to extract important features from the hydrolase sequence, substrate isomeric SMILES strings, substrate dihedral angles and the enzyme–substrate atomic distance map. This allows the model to effectively identify the most relevant encoded features for downstream prediction tasks. Third, EnzyKR employs a GNN to encode the substrate's atomic connectivity, which is likely more effective than mere one-hot embedding. Notably, new encoding strategies for molecular structures have been developed that preserve chiral information, such as ChIRo26 and SELFIES.27 These methods serve as potential alternatives for the future development of EnzyKR.
(1) |
Fig. 2 Statistics of the curated dataset used for developing EnzyKR. (a) Distribution of enzyme commission (EC) subtypes for the hydrolases used in this work. The specific hydrolase subtypes as well as their EC numbers (up to the second digit) are labeled on the right-hand side of the pie chart. (b) Distribution of activation free energy, ΔG‡ for a total of 224 hydrolase–substrate complexes, in which ΔG‡ values are converted from kcat using Eyring's equation shown in eqn (1). The bin size is 1.8 kcal mol−1. |
To evaluate the performance of EnzyKR's ΔG‡ regressor, we employed the Pearson correlation coefficient R, Spearman correlation coefficient R, and mean absolute error (MAE) as metrics (Fig. 3). Additional statistical metrics, such as mean square error and root mean square error, are reported in the Table S1 of the ESI.† The parity plot for the training set (204 data points) shows a linear correlation with a Pearson R of 0.85, Spearman R of 0.79, and an MAE of 0.97 kcal mol−1. For the test set, the parity plot shows a Pearson R of 0.72, Spearman R of 0.72, and MAE of 1.54 kcal mol−1. In both training and test sets, EnzyKR involves a similar range of Spearman R and Pearson R, indicating a balanced prediction accuracy of the ΔG‡ value and ranking without overfitting. Further benchmarks show that the dataset splitting ratio used here (i.e., training set:test set = 204:20, roughly 90%:10%) is optimal – further decreasing the proportion of the training set leads to reduction of model performance (ESI, Table S2†).
Compared to the training set, the drop of EnzyKR performance on the test set is likely due to the small data size. We thus tested the model performance by employing pretrained large-scale sequence embedding, evolutionary scaling modeling-2 (ESM-2),28 to encode the input enzyme sequence (ESI, Table S1†). We expect that the ESM-2 model can help improve the model accuracy by enriching the latent space with evolutionary and biophysical information. However, the results indicate no improvement of regressor accuracy compared with the original CNN encoder (Pearson R = 0.66, Spearman R = 0.67, and MAE = 1.95). Neither does the employment of the ESM-2 sequence encoder improve the classifier accuracy (AUC = 0.81, ESI, Fig. S1†). These results suggest that the prediction accuracy of EnzyKR on substrate binding poses or ΔG‡ values does not critically depend on the sequence encoder (ESI, Text S1†). Our hypothesis is that the accuracy likely relies on the capability of the deep learning model to describe enzyme–substrate interactions. The ESM-2 embedding, despite incorporating evolutionary and biophysical information trained from large amounts of sequences, does not explicitly incorporate the information that describes enzyme–substrate interaction, thereby failing to enhance the model performance. As a support to this hypothesis, we observed a significant increase of errors in the regressor after excluding the atomic distance map of substrate–enzyme complexes (ESI, Table S1†). We should note that curating a high-quality structure-sequence-kinetics dataset is challenging. In our integrated structure-kinetics database IntEnzyDB,19 the total number of hydrolase–substrate pairs is only 355, where the hydrolase mutants and unstructured substrate (e.g., cellulose) have been removed for the development of EnzyKR.
Furthermore, we compared the performance of EnzyKR against two predictors: DLKcat,14 a deep learning kcat predictor, and a compound–protein interaction (CPI) model25,29 that predicts the substrate–enzyme binding affinity Kd. Using the same hydrolase training set (204 data points) and test set (20 data points) curated for EnzyKR, we retrained DLKcat and CPI models based on the code reported in their original publications, and then evaluated their predictive performances. The results show that the retrained DLKcat model exhibits a Pearson R of 0.64, a Spearman R of 0.63, and an MAE of 1.7 kcal mol−1, and the CPI model exhibits a Pearson R of 0.63, a Spearman R of 0.65, and an MAE of 1.8 kcal mol−1 (ESI, Table S3†). In comparison, EnzyKR performs better in accuracy (especially for Spearman R) than DLKcat and the CPI model in predicting activation free energies. This is likely due to EnzyKR's incorporation of the atomic distance map of substrate–enzyme complexes, which enhances the efficiency of the model to learn structure information that is critical for predicting reaction kinetics.
We curated a test set comprising the structure and experimentally characterized enantiomeric excess ratio (ee%) for 28 racemic substrates that undergo hydrolase-catalyzed reactions (Fig. 4). Four types of enzymes are included: fluoroacetate dehalogenase RPA1163, halohydrin HheC, A. mediolanus epoxide hydrolase (AMEH), and P. fluorescens esterase (PFE). We defined a positive sign of the ee% value for a substrate whose S-configuration is more favored than its R-configuration; a negative sign if the opposite is true. To balance the ee% test set, we included 17 reactions with a positive sign of ee% and 11 with a negative sign. Among these reactions, 13 of them fall into the range of (50%, 100%), 4 into the range of [−50%, 50%], and 11 into the range of (−100%, −50%). The ee% test set biases toward a higher ee% value (either positive or negative) because these reactions are more stereoselective, thereby signifying a stronger relevance to synthesis.
Specifically, RPA1163 catalyzes the defluorination of (S)-2-fluoro-2-phenylacetic acid and its derivatives (i.e., 1a–i) with a high, positive ee% value (i.e., ≥95%). HheC catalyzes the ring–opening reaction of (R)-spiro-epoxyoxindoles and its derivatives (i.e., 4j–r) with a high, negative ee% value (i.e., ≤ −95%). AMEH catalyzes the hydrolysis of epoxide compounds (i.e., 7s–z) with a diverse range of ee% – racemic substrates 7t and 7v show an ee% value of <−99%; 7u, 7w, 7y, and 7z show a positive ee% value greater than 85%; 7s and 7x show a positive ee% value lower than 50%. PFE catalyzes the hydrolysis of the ester bond of (S)-1-phenyl-2-pentyl acetate. Both cases included in the test set (i.e., 10 and 14) involve a positive ee% value lower than 50%.
To predict the ee% value using EnzyKR, we first constructed the isomeric SMILES strings and structural files (i.e., .sdf file) for the substrate enantiomers, and then the hydrolase–substrate complexes. Taking the hydrolase–substrate complex, enzyme sequence, and substrate SMILES string as the input, EnzyKR predicts the ΔG‡ values for both R- and S- enantiomers, denoted as ΔG‡R and ΔG‡S, respectively. Finally, the predicted ΔG‡R and ΔG‡S values are plugged into eqn (2) to obtain ee% values, which range from −100% to 100%. A positive ee% value indicates the preference of the S-configuration in the reaction.
(2) |
Fig. 5 shows the ee% values predicted by EnzyKR (red) and DLKcat (grey), along with the reference experimental values (black). EnzyKR correctly predicts the favored enantiomer and outperforms DLKcat in 18 out of 28 reactions (i.e., 1a–c, 1e–i, 4k–o, 7s–t, 7w–x, and 14), occupying 64% of the test cases. We observed 12 reactions whose predicted ee% value is within 30% margin of error compared to the experimental value using EnzyKR (i.e., 1a, 1c, 1e, 1f, 1g, 1h, 1i, 4j, 4k, 4o, 7t, and 14), but only 1 reaction using DLKcat (i.e., 10). In 15 out of 28 test cases, DLKcat predicts trivial ee% values that fall within ±5% (i.e., 1g, 4k–n, 4p, and 7s–10, ESI, Table S4†). This is likely caused by the fact that DLKcat does not explicitly learn structural or chiral interactions between a hydrolase and its substrate enantiomer. Therefore, the predicted ee% values from DLKcat are largely mediated by random distribution.
Fig. 5 The predicted enantiomeric excess (ee%) values of EnzyKR (red) and the baseline model DLKcat (grey) for 28 enantiomeric pairs in hydrolase-catalyzed kinetic resolution. The labels of the derivatives are consistent with those used in Fig. 4. The reference experimental ee% value is shown in black. |
Since the distribution of ee% values appears sparse and discrete, we classified the test set reactions into three categories, including (1) strong preference for the R-configuration: ee% ∈ (−100%, −50%), (2) strong preference for the S-configuration ee% ∈ (50%, 100%), and (3) moderate stereoselectivity ee% ∈ [−50%, 50%]. We evaluated the prediction performance of EnzyKR using four statistical metrics of classification: accuracy, recall, precision, and F1-score. We compared the performance scores of EnzyKR to those of DLKat (ESI, Table S5†). EnzyKR achieves an accuracy of 0.55, indicating that 55% of the reactions are predicted in the correct category of enantiomeric preference. In contrast, DLKcat achieves an accuracy of 0.21, which is significantly lower. EnzyKR achieves a recall of 0.58, indicating that 58% of the actual positive cases are correctly predicted. This also outperforms DLKcat, which shows a lower recall of 0.39. Both models exhibit a similar precision score (EnzyKR: 0.53 vs. DLKcat: 0.55), indicating a similar proportion of true positive predictions among all predictions. Finally, the F1-score, the harmonic mean of precision and recall, was employed to evaluate the “balanced” accuracy of both models. EnzyKR has an F1-score of 0.51, which is significantly higher than that of DLKcat (0.19). These results show that EnzyKR, which embeds the 3D structure and substrate–hydrolase interaction into the model (i.e., atomic distances and dihedrals), substantially outperforms DLKat in which no such information is effectively encoded. Noticeably, the classification accuracy of EnzyKR is 6 times more than that of DLKcat (i.e., EnzyKR: 0.50 versus DLKcat: 0.08) if we focus only on the two categories of reactions that involve a strong preference for the R- or S- configuration (i.e., 24 reactions). Since these two categories of reactions are desired in synthesis, EnzyKR is more practically advantageous than DLKcat in guiding the identification of hydrolase scaffolds for resolving a racemic substrate mixture.
Finally, we would like to discuss several limitations and challenges that warrant future development of EnzyKR. First, the accuracy of the current version of EnzyKR is likely limited by the small data size in the training set. Collecting more quality data (e.g., sequence, structure, selectivity, and kinetics) for enzyme-catalyzed hydrolytic kinetic resolutions, which can potentially enhance the model performance, remains a difficult task. Although extensive studies have been reported for hydrolase-catalyzed kinetic resolution,9,10 recycling these data from the literature for machine learning uses requires huge efforts of data cleaning and validation. The advances of a large language model can potentially assist the information extraction for biocatalytic data. Second, the current version of EnzyKR predicts an intrinsic trend of stereoselectivity, and cannot predict the impact of temperature, pH, and other conditions on kinetic resolution. How to effectively embed temperature effects into the model is an open question for our ongoing investigation. The solution to this problem likely depends not only on the improvement of data quality, but also on the innovation of model architecture. Third, the current version of EnzyKR applies only to hydrolases, and will be expended to other classes of enzymes (e.g., oxidoreductases, transferases, etc.) in our future work. In particular, it remains to be investigated how to build a general encoder for representing substrate–enzyme complexes across different reactions and substrate types. New deep learning-based structural encoders, such as Equivariant Graph Neural Network30 or E(n)-transformers, are promising strategies to further enhance the encoding of EnzyKR. In our future studies, we aim to address these challenges and further evolve EnzyKR into a generalizable model.31
EnzyKR was tested on a kinetic resolution task involving 28 hydrolytic reactions catalyzed by fluoroacetate dehalogenase RPA11636, halohydrin HheC, epoxide hydrolase AMEH and esterase PFE. EnzyKR correctly predicts the favored enantiomer and outperforms DLKcat in 18 out of 28 reactions, occupying 64% of the test cases. To statistically assess its performance on the kinetic resolution dataset, we conducted a three-category classification based on experimental enantiomeric excess values: ee% ∈ (−100%, −50%), ee% ∈ (50%, 100%), and ee% ∈ [−50%, 50%], which indicate strong preference for the R-configuration, strong preference for the S-configuration, and moderate stereoselectivity, respectively. Remarkably, the classification accuracy of EnzyKR is 2.5 times more than that of DLKcat in the whole dataset (i.e., 28 reactions, EnzyKR: 0.55 versus DLKcat: 0.21) and 6 times in the two categories of reactions with strong preference for the R- or S- configuration (i.e., 24 reactions, EnzyKR: 0.50 versus DLKcat: 0.08). These results demonstrate the special advantage of EnzyKR in guiding the identification of hydrolase scaffolds for resolving a racemic substrate mixture for stereoselective synthesis.
Footnote |
† Electronic supplementary information (ESI) available: The performance of the EnzyKR classifier; the benchmark of the EnzyKR classifier; the method used to obtain a substrate 3D structure; the method used to obtain the substrate–enzyme complexes; the benchmark results of EnzyKR features; the comparison between different splits of the dataset; the comparison between EnzyKR and other models; kinetic resolution predictions for various substrates; and the comparison of a kinetic resolution dataset with multiple enantiomeric excess splits (PDF). The csv file of kinetics curated from IntEnzyDB; the pdb dataset of the original docked structure complexes (ZIP). See DOI: https://doi.org/10.1039/d3sc02752j |
This journal is © The Royal Society of Chemistry 2023 |