Nian
Wu‡
*a,
Ruotian
Zhang‡
a,
Xingang
Peng
a,
Lincan
Fang
b,
Kai
Chen
c and
Joakim S.
Jestilä
b
aInstitute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China. E-mail: wunianwhu@gmail.com
bDepartment of Applied Physics, Aalto University, Espoo, Finland
cInstitute of Catalysis, Zhejiang University, Hanghzhou, China
First published on 5th February 2024
The identification of interaction between protein and ligand including binding positions and strength plays a critical role in drug discovery. Molecular docking and molecular dynamics (MD) techniques have been widely applied to predict binding positions and binding affinity. However, there are few works that describe the systematic exploration of the MD trajectory evolution in this context, potentially leaving out important information. To address the problem, we build a framework, Moira (molecular dynamics trajectory analysis), which enables automating the whole process ranging from docking, MD simulations and various analyses as well as visualizations. We utilized Moira to analyze 400 MD simulations in terms of their geometric features (root mean square deviation and protein–ligand interaction profiler) and energetics (molecular mechanics Poisson–Boltzmann surface area) for these trajectories. Finally, we demonstrate the performance of different analysis techniques in distinguishing native poses among four poses.
Among these methods, various scoring functions have been developed to estimate the binding affinity of given 3D structures.17,18 These scoring functions could be majorly classified into three types. i.e., (1) physics-based: fitting energies with a weighted sum of several energy terms based on force field parameters. (2) Empirical: counting the number of interactions that could be defined based on geometry criteria of bond length or angles. (3) Knowledge-based: analyzing intermolecular interactions between certain types of atoms or functional groups. However, the ability of these models to predict binding poses and affinities is still limited, which could predominantly be ascribed to three factors: molecular flexibility, binding score and computing time.19–22 In terms of the enormous chemical phase space, the geometrical optimization of ligands and proteins during the docking tends not to drastically change the initial structure due to the prohibitive cost. To some extent, molecular dynamics (MD) may be used to address the issue of flexibility via relaxing the initial structure based on Newton's laws.
Based on the MD trajectories, a series of analytical methods have been applied to evaluate the conformations in terms of residence time or binding free energy.23,24 For the former, the stability of a given conformation is normally indicated by the structural deviation over time. Parameters could range from global structure (whole complexes, proteins or ligand) to selected fragments or pairs (like atom pairs or residue pairs with hydrophobic interaction,25–27 hydrogen interaction,28 halogen bonds29). Root-mean square deviation (RMSD) is a typical characteristic capable of estimating the movement of each atom, Liu and coworkers30 found that about 94% of the native poses (experimental crystal structures) remain stable during the simulation, while only 56–62% of decoy poses performs comparable stability. The number and strength of interaction pairs28,31 could also be regarded as critical descriptors of stability. In addition, Sakano and coworkers24 discussed the impact of simulation time which revealed that MD simulations longer than 100 ns can deal with membrane proteins or highly-flexible proteins, whereas MD shorter than 5 ns is insufficient to accurately predict the stable structure of protein–protein. For the latter, MM/PBSA or MM/GBSA is one of the typical methods used to predict binding affinity by treating the solvent effect implicitly without atomistic detail. Numerous studies also proved the ability of MM/PBSA or MM/GBSA in distinguishing between the native structures from decoy poses through ranking the binding affinity of different ligands binding to one protein in specific cases.32,33
These previous papers have provided considerable insights for specific complex systems with few methods of analysis. However, studies performing comprehensive analysis based on relatively numerous trajectories is still lacking, and some corresponding knowledge is not necessarily transferable to new systems. Therefore, many issues associated with the information in trajectories remains elusive. For example, whether combining these analysis techniques outperforms the use of a single technique on the performance of distinguishing poses, whether the decoyed poses deviating from the native pose to various degrees share some similarity or distinctions, whether combining various analysis methods could contribute to the identification of native pose, as well as the impact of the simulation time of MD, whether the longer time means better performance.
To address these questions, we developed a framework named Moira (molecular dynamics trajectory analysis) by integrating molecular docking, molecular dynamics and various analysis methods with more details shown in the Methods. Moira not only enabled the automation of the whole process, but also allowed batch processing for a large number of systems. All complex samples were collected from the refined PDBbind repository due to their high quality in crystal structures. To make a trade off between computational cost and generalization for either proteins or ligands, we randomly selected 100 samples. For each complex, based on the extent of deviation of the ligand structure from those in the corresponding native structures, four conformations (c_native, c_2a, c_5a, c_10a, representing native structures, while ligand conformations are represented by the RMSD values, which are close to 2 Å, 5 Å, 10 Å relative to the native structures) were considered. To summarize, we calculated 400 MD trajectories for these 100 complexes with four initial conformations of each ligand. Subsequently, multiple perspectives analysis methods (geometric and energetic) involving RMSD, protein–ligand interaction profiler (PLIP), MM/PBSA were carried out to analyze these trajectories. Finally, we evaluated these techniques for their performance in distinguishing native conformations from four initial conformations.
Further, to compare the performance of various scoring strategies for ranking different poses, including both static and dynamic methods, we tested these on a randomly selected set of 100 complexes. The distribution of experimental binding affinities −logKd for these 100 complexes is illustrated in Fig. 1(b), ranging from 2 M to 11 M, where higher values indicate strong interactions between ligands and proteins. For each complex, we considered four poses (c_native, c_2a, c_5a, c_10a) as shown in Fig. 1(c), with increasing deviation from the native poses with RMSD near 0 Å, 2 Å, 5 Å, 10 Å. When ranking c_native or c_2a poses as the stablest structures out of the aforementioned four poses using the classifical scoring function in AutoDock Vina (Fig. 1(d)), the probability of failure could reach up to 30%, thus demonstrating the limited accuracy of AutoDock Vina on the evaluation of static poses.
There are significant conformational variations among the ligands with the four initial poses. For the complex with pdbid 4xtx, the RMSD values relative to the initial structures for the c_native and c_2a initial structures remain consistently low at less than 2 Å, with minimal fluctuations throughout the simulation. In contrast, the RMSD values for the c_5a and c_10a pose initial structures are not only noticeably larger, but also exhibit substantial fluctuations, particularly for the c_10a pose after 8 ns. In contrast, the RMSD rankings of four initial poses for the complex of pdbid 5fov display different patterns. The c_2a pose exhibits the highest RMSD value with the largest fluctuation, followed by the c_10a pose. The c_5a pose and the c_native pose consistently remain the lowest RMSD values over the whole simulation period. Overall, while there are no definitive universal trends regarding the timing of sharp changes and rankings during the RMSD evolution of the four poses for both complexes, the results suggest that higher RMSD values are typically associated with larger fluctuations. Conversely, the RMSD evolution of the protein (shown in Fig. S1 and S2, ESI†) demonstrates a similar pattern over time for the four poses, with minimal deviations in conformation. Given the high similarity and minimal changes in protein conformation over time across the four poses, we will focus our discussion on the results pertaining to the ligand in the following section of the RMSD analysis.
Firstly, we extracted 2500 snapshots during the 25 ns simulations at 100 ps intervals for each trajectory. This was done for all 100 complexes with the four initial poses, where the RMSD describes each respective snapshot in relation to the initial pose.
To provide a qualitative interpretation of the overall conformational changes in the 100 complexes, we depict the distributions of the averaged RMSD values from the last 5 ns of the simulations for the four initial poses in Fig. S3 (ESI†). The analysis revealed that the mean RMSD for the complexes was the lowest, less than 1 Å, with the c_native initial pose for the MD simulation. This was closely followed by pose c_2a, outperforming the mean values associated with c_5a and c_10a. Furthermore, the boxplots representing each initial pose show how lower mean values are accompanied by less scattered distributions. These results indicate that the c_native pose exhibits a lower RMSD compared to the other three initial poses across the majority of the complexes.
To further evaluate the structural fluctuations in 100 complexes quantitatively, a comparative analysis was done on the average RMSD values during the last 5 ns of the trajectories, referred to initial structures for four initial poses. In Fig. 2(e)–(g), each point represents a comparison between trajectories from the same ligand but with varying initial structures. For clarity, we grouped two pairs together, distinguished by different colors (dark turquoise and plum). The horizontal axis represents the RMSD values from the pose before underscore, while the vertical axis indicates the RMSD values from the pose after underscore. For example, in Fig. 2(e), the dark turquoise points “native_c_10a: 83/17” represent a comparison between the RMSD of trajectories from c_native poses (horizontal axis) and c_10a poses (vertical axis) as initial poses. The value “83/17” signifies that 83 complexes had a higher RMSD value when c_10a was the initial pose in comparison to the native initial pose. The findings imply that, for the majority of 100 complexes, the c_native poses exhibit greater stability than the c_10a poses. Similarly, the distribution of the plum colored points in Fig. 2(e) reveals that the majority of points lie above the diagonal line, indicating that the c_native pose is generally more stable than the c_5a pose, with approximately 90% of complexes following this trend. However, a few points (10%) show the opposite trend.
In Fig. 2(f), the c_2a poses exhibit a similar pattern as the c_native poses when compared to the c_5a and c_10a poses. This observation is further supported by the relatively even distribution near the diagonal line in Fig. 2(g). In contrast, the distribution of points representing the c_10a versus c_5a poses does not have any apparent bias (dark turquoise points in Fig. 2(g)).
Subsequently, we accounted for temporal factors by dividing the 2500 snapshots into 12 parts with 2 ns increments, and calculating the averaged RMSD for these intervals. Fig. 2(h) displays the RMSD evolution for all 100 complexes across four poses (native: blue, c_2a: sky blue, c_5a: salmon, c_10a: pink), where the full line indicates the average RMSD values, while the shaded section represents the corresponding standard deviation. The figure reveals slightly increasing mean RMSD values for all four poses over time, with the value at the starting point consistently higher than that at the ending points, despite minor fluctuations. Furthermore, similar to patterns observed in the boxplot analysis (Fig. S3, ESI†), the c_native initial poses consistently exhibit higher stability compared to c_2a, c_5a and c_10a during each period across the 100 complexes, having the lowest mean RMSD and standard deviation values. Additionally, the c_native and c_2a poses display more similar trends, while c_5a is seen to be closer to c_10a in terms of trends. The results illustrate that partitioning the MD simulations into given periods leads to similar tendencies in determining the stability of conformations across the four poses in general, despite sample specific deviations.
Algorithm 1 Counting the occurrence frequency of interaction pairs detected by PLIP |
---|
Input: S = snapshots I = interactions P = pairs |
Output: C = count |
Do PLIP for all snapshots, find all pairs and all interactions |
Define count = np.zeros((len(P),len(I)) |
Forj = 1, 2,…, len(S); j = 1, 2,…,len(P); k = 1, 2,…,len(I) |
If I[k] exists in P[j]; count[j,k] = 1 Elsecount[j, k] = 0 |
Return C = count/len(S) |
To analyze the dynamic evolution of the interaction patterns, PLIP was employed to assess the 2500 snapshots from each trajectory, and subsequently these were compared across the four poses. Unlike interactions in the static simulations, certain interaction patterns may appear or disappear during the process. Consequently, we calculated the occurrence probability of each interaction pattern over the 25 ns of simulation time. In this analysis, we considered the protein residue–ligand pair only. Note that the entire ligand was treated as a single unit, although the software may also be used to gain information on specific atom pairs. In cases where multiple atoms within a residue could interact with the ligand, the occurrence probability may exceed 1.0.
Furthermore, we examined the probability of interaction pair occurrence within the seven interaction groups for 4xtx (Fig. 3(b)) and 5fov (Fig. 3(d)) over the 25 ns of MD simulation time (blue, sky blue, salmon, and pink representing native, c_2a, c_5a, and c_10a, respectively). In the figures, the horizontal axis represents the residue numbers involved in ligand interactions. In Fig. S2 (ESI†), all seven interaction patterns, including hydrophobic interactions, hydrogen bonds, and water bridges (as depicted in the cartoon schematic Fig. 3(c)), exhibit noticeable frequencies compared to the other patterns. In the 4xtx case, the c_native pose demonstrates both a higher number of interactions, as well as their corresponding probabilities, surpassing the other three initial poses. This is particularly prominent for hydrogen bonds and salt bridges. Meanwhile, the trend is less pronounced for the 5fov complex.
To ensure a sufficiently large sample size, we examined the probabilities of all interaction pairs for the four initial poses across the 100 complexes. As the residue–atom pairs involved in interactions may completely be different, we opted to compare the probabilities among all types of interactions, independent of the residue–atom pairs. Furthermore, the interactions giving rise to the largest occurrence probabilities tend to significantly influence the overall interaction dynamics, representing the dominant interactions for the complex in question. Therefore, further analysis is mostly based on the interaction modes with the largest occurrence probability (top one), while the results based on top two and top five are provided in Fig. S7 and S8 (ESI†).
Fig. 4 illustrates the differences between the largest occurrence probabilities of specific interaction types in two poses (pose1 and pose2). A positive value indicates that the most stable interaction pair in pose1 has a higher probability than in pose2, while a negative value suggests the opposite.
As was the case for the number of interactions, hydrophobic interactions, hydrogen bonds, and water bridges are also seen to far exceed other types of interactions (pi–stacking, pi–cation interactions, salt bridges, halogen bonds). These three interaction patterns emerge as the most prominent driving forces in ligand–protein binding. However, due to the limited number of cases, the latter four interaction patterns cannot be reliably discussed based on the results depicted in Fig. S9 (ESI†).
As shown in Fig. 5(a), the stablest hydrophobic interaction pairs in the MD trajectory for the c_native case tend to have higher occurrence probabilities in comparison to the other three initial poses for the majority of the 100 complexes. Furthermore, the positive/negative value ratios increase in the following order: c_2a, c_5a, and c_10a. The highest ratio, with a value of 75:
20, is observed when comparing the c_native and the c_10a. Meanwhile, the ratio reduces to 59
:
36 for the c_5a and c_10a case.
Hydrogen bond interactions are dominant contributors to the electrostatic interactions. Fig. 5(b) exhibits similar patterns as hydrophobic interactions, albeit with some differences. The gap between the c_native pose and the c_2a pose narrows down to 48:
50, as well as the gap between the c_5a pose and the c_10a pose with a ratio of 54
:
44. Moreover, the gaps between the c_native pose (or c_2a pose) and the c_5a pose (or c_10a pose) are larger, particularly for the second stablest hydrogen interaction pair, as shown in Fig. S9 (ESI†).
The hydrogen bond network between the ligand and the protein facilitated by water molecules plays a significant role in the overall structural stability. In contrast to the aforementioned interaction patterns, the water bridge interactions (Fig. 5(c)) exhibit different patterns. There is no clear gap observed in any of the comparisons among the four poses, and no explicit trend indicating which initial pose is more likely to possess the water bridge interaction pair with the largest occurrence probability among complexes.
First, we started by calculating the MM/PBSA binding affinities for the four initial structures (Fig. 5(a)–(f), lime green points). Similar to the RMSD values, the MM/PBSA binding affinities for the native structure are significantly lower than those for the other three poses in most complexes, followed by the c_2a pose. However, no clear trend was observed for c_5a and c_10a. Therefore, MM/PBSA can often distinguish the native structure from the other poses in most complexes, despite the positive values indicating an imperfect match between the ligand and protein in static snapshots, which could mainly be ascribed to different force fields being used for the docking method and for MM/PBSA. Dll values turned out to be negative under the specific force fields over times during molecular dynamics. In comparison, the average MM/PBSA values from the dynamic trajectories (violet points in Fig. 5(a)–(f)) are substantially lower than those from the static snapshots, indicating that the structural relaxation during molecular dynamics provide more reasonable structures. The results from the dynamic analysis highlight the differences between the c_2a and c_5a poses, as well as those between the c_2a and c_10a, boosting the number of positive samples from 68 to 84 and 80, respectively. Moreover, the structural relaxation leads to a much narrower gap between c_native and c_2a poses, consistent with the drastic increase in Pearson coefficient from 0.054 to 0.83 as shown in Fig. 6l. In contrast, relaxation increases the difference between c_5a and c_10a poses from 49:
51 to 62
:
38. This difference could result from the fact that both c_5a and c_10a are metastable structures on the potential energy surface, which is insufficient to unambiguously determine the energetically favored pose. Additionally, it is reasonable to assume that two initial conformations differing remarkably in geometry will either evolve towards similar or different structures, thus leading to the energetic disparity. Next, we analyzed the interactions in terms of the decomposed energy (ELE, VDW, PBSUR, PBSOL, detailed explanations for these abbreviations is available in mm_pbsa.pl). Fig. 5(g)–(k) reveals that the dominant contributions for the differences come from ELE and VDW. Hence, the non-covalent interaction in vacuum can be inferred from ELE and VDW decomposed from the MM/PBSA binding affinities.
Subsequently, considering dynamic factors, we calculated the MM/PBSA binding affinities of 2500 snapshots from each of the 25 ns trajectories at 100 ps intervals. Fig. 5l shows the Pearson correlation coefficients for the binding affinities of different poses in the initial state and the averaged MM/PBSA binding affinities for the last 5 ns of the simulation, as well as the experimental values. The MM/PBSA binding affinity scores for the static poses generally exhibits a weaker correlation with the experimental values than the dynamic pose MM/PBSA binding affinities. For the static states, the highest Pearson correlation of 0.15 is observed with the c_2a, and similar for the corresponding dynamic poses, the highest Pearson correlation of 0.53 is achieved with the c_2a poses, closely followed by 0.45 from the c_native poses. These correlations far exceed the values obtained from the c_5a and c_10a poses, which are less than 0.25.
Furthermore, to investigate the evolution of protein and ligand flexibility over time, we divided the 2500 snapshots into 12 intervals of 2 ns each and calculated the averaged MM/PBSA scores during these periods. In similar fashion, we analyzed the trajectories of the 4xtx and 5fov complexes to observe the changes in MM/PBSA scores over time. From the 4xtx trajectories (Fig. 6(a)), the c_native (blue) and c_2a poses (sky blue) exhibit higher binding affinities compared to the c_5a (salmon) and c_10a poses (pink). In contrast, from the 5fov trajectories (Fig. 6(b)), the binding affinities in c_5a trajectories are stronger than those in the other three initial poses. These results indicate that even with an analysis employing dynamic trajectories, MM/PBSA does not always accurately distinguish the correct binding pose. Surprisingly, when considering all 100 complexes (Fig. 6(c)), the c_2a initial pose consistently outperform the other three initial poses, including the c_native pose in terms of Pearson correlation between MM/PBSA and experimental values (Fig. 6(d)). Although the difference between the c_native pose and c_2a pose is small, they both significantly outperform the c_5a and c_10a poses. Additionally, there is little fluctuation in the correlation values during the simulations. This is also reflected in the number of complexes where the averaged MM/PBSA binding affinity of pose1 is higher than that of pose2 among the four initial poses. These findings suggest that 25 ns of simulation time is not always sufficient to estimate binding affinities to experimental accuracy.
From a geometric point of view, we analyzed the RMSD relative to the initial structure of the molecular dynamics simulation, as well as the decomposed interactions using PLIP. The c_native poses consistently exhibited higher stability compared to c_5a and c_10a. The average RMSD values of all 100 complexes gradually increased over time for all initial poses, indicating a wider range of configurations deviating from the initial structures being visited, aligning with the canonical distribution.
Regarding the energetic aspect, the remarkably high Pearson coefficient (0.86) between c_native and c_2a poses in terms of MM/PBSA binding affinities indicates that initial structures deviating slightly from one another tend to exhibit similar molecular dynamics trajectories within the accessible simulation time. Conversely, the low correlation between the c_native and the c_5a and c_10a poses illustrate the difficulty of initial structures with significant deviations in overcoming high energy barriers caused by factors like torsion angles. Moreover, in terms of binding affinity correlations, the c_2a initial conformations exhibits a significant energetic advantage over the c_5a and c_10a poses, even slightly outperforming the c_native poses.
Furthermore, simulation time shows similar patterns in the rankings of the four poses in terms of both RMSD and MM/PBSA binding affinities, although longer simulation times did not necessarily improve the results. Overall, the combination of these techniques improves the accuracy of identification of native poses compared to relying solely on single static poses.
In conclusion, the Moira workflow provides a comprehensive paradigm for the exploration of ligand-protein interactions based on classical molecular dynamics, including both the simulations and their analysis. Thereby, the workflow facilitates convenient investigation of ligand–protein interactions on a general basis, important for the further advancement of drug design methodologies.
As shown in Fig. 1, docked protein–ligand structures were first generated using RDKit to provide the initial 10 ligand conformations. Then, by using Vina to make the docked structures while keeping the best 10 of each initial ligand conformation, it demonstrated that these conformations are at least reasonable as docked ligand structures. Secondly, taking the deviation from native structures into consideration, c_2a (usually regarded as the native structure with trivial deviation), c_5a (slight deviation), c_10a (drastic deviation) structures from molecular docking plus native structure were selected as the initial structures for molecular dynamics runs to generate the 25 ns trajectories, which were split into 2 fs intervals. Finally, the resulting trajectories were comprehensively analysed.
Thus, we randomly selected 100 compounds (shown in Table S1, ESI†) from the high-quality set in the pdbbind-CN database, which were subsequently filtered on the basis of the six aspects described above to avoid multiple chains, metal ions and non-standard amino acids in our complexes. We thereby simplified our analysis by reducing the influence from the aforementioned confounding factors.
For the receptor, the protein pdb file was pre-processed to remove solvent molecules, multiple chains and metal ions using PyMOL. For the ligand, it is known that the number of possible conformers is exponentially correlated to the number of rotatable bonds. To obtain multiple highly different conformers, we generated 3D conformations from SMILES.
Step 1: convert ligand mol2 file to SMILES with allBondsExplicit and allHsExplicit true as well as sanitize False based on RDKit package.
Step 2: employ AllChem.EmbedMultipleConfs module to generate 10 conformations with pruneRmsThresh of 1 and maxAttempts of 500 from SMILES, which will be taken as initial conformers in the format of pdb file for docking.
Step 3: convert the receptor and ligand pdb files to the pdbqt format required by Autodock Vina using the MGLTools program of the AutoDockTools package. All rotatable bonds are allowed to relax during docking, while receptor atoms were all kept rigid.
Step 4: define the docking parameters. The geometric center of the native ligand atoms was defined as the center grid of the docking cubic box, and the box grid was set as 60 Å × 60 Å × 60 Å with 0.375 Å per grid point. The 10 best scoring poses were retained each time. The maximum allowed score difference between the lowest and the highest scores was set to 10 during docking, and default values were used for other parameters.
In consideration of some sophisticated situations at different stages, we removed complexes which failed to generate conformers by RDKit, or those that failed during docking or RMSD calculation in subsequent steps. In the end, 13424 complexes passed successfully through all of the steps in the procedure, generating 10
44
819 conformers in total. The isoRMSD.py script was used to calculate the RMSDs between the docked and the native ligand poses without translation and rotation after aligning the protein. Due to the experimental uncertainty of the hydrogen positions, only heavy atoms are considered. The docking and native ligand RMSD values range from 0 to 20 Å, with a spread approximating a normal distribution.
For the simulations, there are four stages: (1) minimization: a total of 1000 steps of minimization, the box was initially minimized by 500 steepest descent steps, followed by 500 conjugate gradient steps. (2) Heating: the box was then gradually heated up to 300 K from 0 K and was relaxed within 50 ps by a simulation in the NVT ensemble (constant number of atoms, volume, and temperature). (3) Equilibrium: the system was further relaxed in an NPT ensemble (constant number of atoms, pressure, and temperature) simulation of 50 ps. For the first three stages above, weak restrains with 2.0 kcal mol−1 Å−2 was assigned to all heavy atoms of the complex for first three stages. (4) Production: the molecular dynamics simulation of 25.5 ns was carried out under NPT ensemble with all atoms relaxed. Due to the expensive computational demand, only one single MD runs were performed for each docking pose, despite the influence of the initial velocities assigned in the simulations.40,41
ΔGbind = ΔH − TΔS = ΔEMM + ΔGsol − TΔS |
ΔEMM = ΔEinternal + ΔEelectrostatic − ΔEvdw |
ΔGsol = ΔGPB + ΔGSA |
ΔGSA = γ × SASA + b |
ΔGbind = ΔGcomplex − ΔGprotein − ΔGligand |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3cp03492e |
‡ These authors contributed equally. |
This journal is © the Owner Societies 2024 |