Felix
Brandt
and
Christoph R.
Jacob
*
Technische Universität Braunschweig, Institute of Physical and Theoretical Chemistry, Gaußstraße 17, 38106 Braunschweig, Germany. E-mail: c.jacob@tu-braunschweig.de
First published on 25th July 2023
The construction of a suitable QM region is the most crucial step in setting up hybrid quantum mechanics/molecular mechanics (QM/MM) simulations for enzymatic reactions. The QM region should ideally include all important amino acids residues, while being as small as possible to save computational effort. Most available methods for systematic QM region construction are based either on the distance of single amino acids to the active site or on their electrostatic effect. Such approaches might miss non-electrostatic and long-range allosteric interactions. Here, we present a proof of concept study for the application of protein network analysis to tackle this problem. Specifically, we explore the use of the protein network centralities as descriptor for QM region construction. We find that protein network centralities, in particular the betweenness centrality, can be a useful descriptor for systematic QM region construction. We show that in combination with our previously developed point charge variation analysis, they can be used to identify important residues that are missed in purely electrostatic approaches.
For simulation scientists, setting up a QM/MM calculation requires many manual choices. Most important is the selection of the QM region.11–15 A suitable QM subsystem should ideally take into account all effects which have to be investigated in the quantum-mechanical calculation, while its size should be as small as possible to minimize the computational effort.
Several different approaches for the systematic construction of suitable QM regions have been reported in the past (for a recent review, see ref. 16). In the simplest case, only catalytic or active site residues are included. This can be extended by including further residues based on their distance to the active site. However, this approach on the one hand might miss important residues and on the other hand include residues which do not have a significant effect. Therefore, systematic methods have been developed to exclusively detect high-impact residues. Commonly, these consider the electrostatic effect of single amino acids on the active site. Examples of such approaches include charge deletion analysis (CDA),17,18 charge shift analysis (CSA),14 and Fukui shift analysis (FSA).19 Point charge variation analysis (PCVA),20 recently developed by us based on uncertainty quantification of the electrostatic effect of single amino acids on the QM/MM reaction energy, also falls into this category.
However, such approaches based solely on distance criteria or the electrostatic effect might miss residues that play an important steric role or take part in covalent interactions along allosteric pathways in enzymes. This is particularly relevant if the structure of the active site changes along the reaction path, which is not captured by a purely electrostatic analysis. Hix et al. used an analysis of evolutionary conserved residues in protein families to identify such residues and to systematically construct QM regions based on this information.21 However, this approach relies on the availability of protein sequence data for families of proteins that are related to a system of interest.
Here, we present a novel approach to detect residues with a high allosteric effect based on graph theory for proteins. Specifically, we propose and explore the use of the protein network centralities as descriptor for systematic QM region construction. This can be combined with one of the aforementioned methods based on electrostatic effects. In the present work, we will focus on the comparison and combination with our previously developed PCVA approach.20,22
Graph theory can be applied to proteins to investigate long-range interactions and dynamics. A common approach is to represent the amino acids as nodes connected by edges which are weighted by the distance or other more complex descriptors such as interaction energies or correlated motion. In the literature numerous examples can be found in which network approaches are used to describe protein dynamics,23–25 for the construction of coarse-graining models of proteins,26–28 or to partition proteins into fragments for quantum-chemical subsystem methods.29–32
Here, we follow an approach that has previously been employed for finding and characterizing allosteric pathways in proteins.33–35 A graph representation is constructed using the graph generation part of the WISP (Weighted Implementation of Suboptimal Paths) script developed by Wart et al.36 The Cα atoms are chosen as nodes, which are connected by edges, to which weights (corresponding to a distance in “network space” between the nodes) are assigned according to
wij = −log|Cij| | (1) |
![]() | (2) |
That is, nodes i and j are connected by a short edge (wij → 0) if the fluctuations Δr(t) = r(t) − 〈r(t)〉 of their Cα atoms along a molecular dynamics trajectory are highly correlated (Cij → +1) or anticorrelated (Cij → −1). Conversely, a long distance is assigned to edges (wij → ∞) between nodes if their fluctuations are uncorrelated. This graph is further simplified by removing edges between residues without any atom pair distance below the contact map cutoff pcutoff = 4.5 Å as described in ref. 36.
The resulting graph can be analyzed using network analysis tools. An important concept in network analysis is the so-called centrality. In this concept, ranks are assigned to nodes within a network graph according to their position and in consequence, the centrality acts as an indicator for the importance of a node (i.e., of a single amino acid) in the network.
Three different centrality approaches will be considered in this work. The degree centrality37,38 (d) is the simplest approach and corresponds to the number of connections of a specific node. A more complex concept is called betweenness centrality37,39 (b) which quantifies how often a node is crossed on the shortest pathway between two other nodes. The third concept, the eigenvector centrality35,37 (e), quantifies the influence of a certain node by assigning a relative score which takes into account that a connection to a high-scoring node contributes more to the score than a connection to nodes with a lower score.
We applied this approach to catechol O-methyltransferase (COMT) since it is a well-understood system regarding systematic QM region construction. The graph was constructed based on a classical MD trajectory (see Computational Details) using WISP36 as described above. Based on this graph, we calculated the d-, b-, and e-centralities for each single amino acid. These are visualized in Fig. 1, where a representation of the residue–active site center-of-mass distances is included for comparison.
![]() | ||
Fig. 1 Visual comparison of residue–active site center-of-mass distances and of the degree, betweenness, and eigenvector centralities for catechol O-methyltransferase (COMT). |
The simple d-centrality does not follow a clear trend regarding the correlation between residue position and assigned centrality value. The b- and e-centrality approaches assign low centrality values to surface residues, which matches the expected behavior since the corresponding nodes rarely appear in paths along the graph and are not able to form connections to other high-scoring nodes. Overall, the centrality value distribution for b-centrality reasonably matches the distance distribution, which still is an important measure in QM region construction. This indicates that betweenness is the most sensible approach to be used in the following.
Centrality rankings for each centrality descriptor are listed in Table S1 in the ESI† and can be compared to the indicator ranking from our single amino acid PCVA (saaPCVA) approach.20,22 The saaPCVA indicator combines the electrostatic effect of each amino acid on the QM/MM reaction energy with its inverse center-of-mass distance and was used in our previous work to systematically construct QM regions.
There are clearly remarkable differences between the three centrality rankings. To pick one example, VAL170 is ranked 14th and 18th in the e- and d-centrality, respectively, but was assigned to rank 83 in the b-centrality. Based on the comparison with the PCVA ranking, b-centrality seems to be most promising as it is able to detect a few residues also ranked very high by the PCVA scheme such as ALA66 or ASP140, and it additionally assigns high ranks to residues which are expected to be quite important and which have been previously detected by the CSA and FSA approaches of Kulik and co-workers14,19 such as GLU63 or TYR70.
Several residues such as MET75 or MET136 ranked very high by b-centrality can be found in the midfield of the PCVA ranking at rank 79 and 81, respectively. Additionally, their low scores in CSA and FSA (MET136 is not considered in CSA) indicate a small electrostatic effect. Nevertheless, both methionine side chains are positioned towards the active site and especially MET75 is proximal to SER71, which was considered to have an important effect on SAM.19 Since the network analysis approach aims at identifying additional residues which do not primarily apply a high electrostatic effect, but are crucial in allosteric interactions, we assume these amino acids to be interesting to be included in the QM region.
To further verify these observations, we constructed atom-economical 16-residue QM regions based on the three centrality rankings by including the corresponding highest-ranked residues and performed QM/MM calculations analogous to the previous assessment of the atom-economical PCVA-constructed QM regions.20,22 Since d-centrality often assigns the same value to more than one residue, in this case, the QM region contains 17 residues to cover all equally-ranked residues. The different QM regions are visualized at the bottom of Fig. 2, and their composition is listed in the ESI† in Table S3. Comparisons to previous PCVA results concerning QM/MM reaction energies and VDD ligand charges are shown in Fig. 2. Here, it emerges that the b-centrality performs much better than the d- or e-centrality. The reaction energy (Fig. 2B and D) using the b-centrality (about −13.5 kcal mol−1) is closer to the best estimate at −10.5 kcal mol−1 than for the d-centrality (−4 kcal mol−1) and the e-centrality (−7 kcal mol−1). Regarding VDD charges (Fig. 2A and C), the atom-economical QM region constructed based on the b-centrality is able to reproduce the best estimate charges for Mg2+ and CAT, similar to PCVA. In contrast, the QM regions based on the d- and e-centrality assign an about 0.2 higher charge for Mg2+ and CAT compared to the best estimate. Regarding the VDD charge of the SAM ligand, none of the methods is able to assign a charge near to the best estimate, as previously observed for PCVA.20 Nevertheless, using the b-centrality ranking pushes the charge from around zero to a positive charge of about 0.6, indicating the ability to detect SAM-impacting residues, even though the charge is overestimated here. Overall, the b-centrality seems to be the most promising centrality descriptor for QM region construction.
![]() | ||
Fig. 2 Ligand VDD charges and reaction energies from QM/MM calculations for COMT using QM regions of increasing size constructed using PCVA (upper part, adapted from ref. 20) and for atom-economical 16-residue QM regions constructed using PCVA as well as centrality-based descriptors (middle part). In the bottom part, the different QM regions are visualized without ligands. (A) Convergence of Mg2+ (blue), SAM (magenta) and CAT (yellow) VDD charges with increasing QM region size. (B) Convergence of the QM/MM reaction energy for the methyl transfer reaction in COMT with increasing QM region size. Best estimate results (corresponding to QM region 9′) are indicated by a solid horizontal line. (C) Ligand VDD charges for atom-economical 16-residue QM regions (indicated by dashed vertical line in A) constructed using different centrality-based descriptors. (D) QM/MM reaction energies for atom-economical 16-residue QM regions (indicated by dashed vertical line in B) constructed using different centrality-based descriptors. |
With these results in mind, we constructed two different atom-economical, 16-residue QM regions in a hybrid PCVA/b-centrality approach. One QM region, referred to as b-half, was constructed by including the highest-ranked eight residues from PCVA and the b-centrality approach, respectively. If a residue detected by one scheme is already ranked higher in the other scheme, it is replaced by the next lower-ranked residue in this scheme. The second QM region, referred to as b-norm, is constructed by normalizing the two descriptors with respect to each other. Since the differences in the assigned PCVA indicators and b-centralities, respectively, of the highest-ranked residues are quite large, the first 5 residues of each scheme are included analogously to the b-half scheme, and all other descriptors for rank 6 and lower are normalized with respect to the rank 5 value. The six remaining places in the 16-residue QM region are then awarded to the six residues with the highest normalized value of any of the two descriptors.
These two QM regions are again compared to the best estimate with respect to VDD ligand charges and the QM/MM reaction energy (see Fig. 2C and D). Regarding VDD charges, b-half and b-norm perform similar to PCVA for Mg2+ and CAT. In contrast to PCVA, both b-half and b-norm are able to predict a positive ligand charge for SAM, where b-norm still overestimates the charge while b-half underestimates it. Nevertheless, for b-half the SAM charge of about 0.1 is closer to the best estimate than for all PCVA or centrality-based approaches considered so far. Concerning the reaction energy, b-norm, delivers an energy of about −20 kcal mol−1 which is close to the PCVA but about 10 kcal mol−1 below the best estimate. Here, the b-half approach outperforms all other methods with −9.5 kcal mol−1, nearly perfectly reproducing the best estimate.
These results give first hints that protein network centralities, in particular the betweenness centrality, can be a useful descriptor for the systematic construction of QM regions. Besides residues that have a large electrostatic effect (and which are also detected by other schemes such as PCVA), the b-centrality ranking detects specific residues that have large non-electrostatic effects. As shown above, the inclusion of such residues in systematically constructed atom-economical QM regions can improve the agreement with our best estimate, both for ligand charges and for reaction energies.
To further ensure the applicability of the centrality-based and hybrid methods, we additionally applied the identical procedure to the triosephosphate isomerase (TIM), which was also subject to our previous investigations.22 These results are shown and discussed in Section S3 of the ESI† and confirm the general trends observed for COMT, even though the agreement for the QM/MM calculations using a 16-residue, atom-economical QM region with the best estimate is worse.
Overall, we have shown that protein network centralities, in particular the betweenness centrality, are a promising descriptor for the systematic construction of QM regions. The b-centrality allows one to identify residues that have important non-electrostatic or allosteric effects on the active site, and that are thus missed by descriptors that only consider the direct, electrostatic effect, such as our previously developed PCVA scheme. Therefore, we propose to augment QM regions constructed using PCVA with residues that have a high b-centrality, for which two simple schemes have been proposed.
In future work, we plan to further refine these hybrid schemes. To this end, it will be crucial to further rationalize the usefulness of protein network centralities as descriptor for QM region construction by systematically analyzing the electrostatic and non-electrostatic effects that contribute to the error in QM/MM reaction energies.
More generally, network analysis offers a promising framework for systematic QM region construction. Here, we employed the construction of graphs based on the correlated motion of the amino acid residues, which allows for the identification of allosteric pathways. Other schemes for the construction of graphs, e.g. based on electrostatic interactions, could be considered and subjected to the tools of network analysis to minimize the error introduced by QM/MM schemes. Work along these lines is currently in progress in our group.
The analysis of results was achieved with Python. Network analysis and centrality calculations were also performed in Python using the WISP script36 for correlation matrix and contact map construction and the networkx module47,48 for calculating d-, b- and e-centralities. Plots were generated with Matplotlib49,50 and structures were visualized using Visual Molecular Dynamics (VMD).51
QM regions using PCVA were constructed as shown in our previous work (using the standard variant).20,22 Residues included in the differently-sized QM regions for each system are listed in the ESI† (see Section S2) with the corresponding QM region charge and the number of atoms and link atoms.
All QM/MM calculations used the minimized and equilibrated initial structures (see above) obtained in our previous work,20,22 from which geometry optimizations were performed for reactants and products. All QM/MM calculations were performed using the Amsterdam Modeling Suite (AMS Version 2020.203),52 using PBE53 and a DZ and a TZP Slater-type orbital basis set54 for geometry optimizations and single point calculations, respectively. For the MM region the ForceField engine of AMS was used with the AMBER95 force field,55 which was extended by parameters for the substrates. All charges are calculated from the Voronoi deformation density (VDD)56 of the reactant structures only. Further details are given in the ESI† (see Section S1).
Footnote |
† Electronic supplementary information (ESI) available: Extended computational details, additional results for triosephosphate isomerase (TIM), supplementary tables showing descriptor rankings as well as the compositions of the different considered QM regions. See DOI: https://doi.org/10.1039/d3cp02713a |
This journal is © the Owner Societies 2023 |