Zhao Qin‡
a,
Qingyi Yubcd and
Markus J. Buehler*a
aLaboratory for Atomistic and Molecular Mechanics (LAMM), Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, USA. E-mail: mbuehler@mit.edu; Tel: +1 617 452 2750
bDereck Bok Center for Teaching and Learning, Harvard University, 125 Mount Auburn Street, Cambridge, MA 02138, USA
cDepartment of Educational Psychology, Counseling and Special Education, State University of New York at Oneonta, 108 Ravine Pkwy, Oneonta, NY 13820, USA
dBarnes Center at The Arch, Syracuse University, 150 Sims Drive, Syracuse, NY 13244, USA
First published on 27th April 2020
Natural vibrations and resonances are intrinsic features of protein structures and enable differentiation of one structure from another. These nanoscale features are important to help to understand the dynamics of a protein molecule and identify the effects of small sequence or other geometric alterations that may not cause significant visible structural changes, such as point mutations associated with disease or drug design. Although normal mode analysis provides a powerful way to accurately extract the natural frequencies of a protein, it must meet several critical conditions, including availability of high-resolution structures, availability of good chemical force fields and memory-intensive large-scale computing resources. Here, we study the natural frequency of over 100000 known protein molecular structures from the Protein Data Bank and use this dataset to carefully investigate the correlation between their structural features and these natural frequencies by using a machine learning model composed of a Feedforward Neural Network made of four hidden layers that predicts the natural frequencies in excellent agreement with full-atomistic normal mode calculations, but is significantly more computationally efficient. In addition to the computational advance, we demonstrate that this model can be used to directly obtain the natural frequencies by merely using five structural features of protein molecules as predictor variables, including the largest and smallest diameter, and the ratio of amino acid residues with alpha-helix, beta strand and 3–10 helix domains. These structural features can be either experimentally or computationally obtained, and do not require a full-atomistic model of a protein of interest. This method is helpful in predicting the absorption and resonance functions of an unknown protein molecule without solving its full atomic structure.
The recent evolution of experimental techniques makes molecular-scale high-temporal resolution imaging feasible,13 making it more possible to monitor the dynamical behavior of nanostructures. However, up to date, the most advanced imaging technique provides atomic and microsecond resolutions, which is only able to capture the dynamic behavior of 106 Hz, or equivalent as 0.00003 cm−1, which is five orders of magnitude smaller than the frequency of the first vibration mode of most protein structures.14 One way to obtain the vibrational modes of a protein through computational analysis is normal mode analysis (NMA),8,14 which is a computationally intensive process and it requires the atomistic structure of the protein and the force field that defines all the interatomic interactions.
The key steps in this process include building the Hessian matrix, a 3N × 3N square matrix composed of second-order partial derivatives of the potential energy function with the coordinates of atom in all the three dimensions, and computing the eigenvector and eigenvalue of this matrix. For large protein structures, the time and memory requirement for NMA can be huge. For example, a membrane protein that has more than 1000 amino acids requires more than 3.6 GB memory for simply loading the Hessian matrix, making computing more difficult for larger protein structures. Moreover, the method can only be applied to proteins with known high-resolution 3D structure (which must be obtained either from experiments, e.g. by using Nuclear Magnetic Resonance, X-ray Diffraction or Cryo-electron microscopy or could be obtained from a computational simulation of protein folding15,16). However, neither of the two strategies are cost and time efficient, and often require significant computational resources.
Artificial intelligence (AI), enabled by machine learning (ML) techniques, has demonstrated its advantage in solving sophisticated scientific problems that involve multiple physics-based interactions that are hard to directly model or non-polynomial problems that require extremely large computational power that cannot be solved by brute force methods.17 It now provides a novel feasible way of solving such problems by utilizing efficient algorithms to searching a high-dimensional parameter space for optimal solutions. In several recent materials-focused studies, such a data-driven material modeling for optimized mechanics of mechanical properties of materials, it has shown a unique and powerful role.18–21 Moreover, it shows a breakthrough capability to optimize the multiscale and multiparadigm architecture of materials for high sensitivity to environmental factors, as needed in sensors, electronics and for multi-purpose material applications.22
Here we use a dataset of the structural features and the first 64 normal modes of more than 100000 protein molecular structures to train and test a machine-learning model that is capable of predicting the frequency spectrum of any protein. This computational model will be helpful that allows to directly obtain a protein's frequency spectrum without knowing the high-resolution 3D structure nor solving for its eigenvalues.
Here we hypothesize that the frequency spectrum, which is defined by the chemistry and the folded structure of a protein, provides a unique fingerprint to identify a protein from others. Such a feature has been applied in infrared and Raman spectroscopy to identify the content of small chemical groups. Combining these experiments with the library of the frequency spectrum of proteins may enable to identify a protein only if it has a high-resolution structure deposited in the Protein Data Bank. However, the Protein Data Bank only includes a tiny portion of proteins, with most of them only known for their sequence and functions (such as the 147413762 protein sequences given in https://www.uniprot.org25) but not for high-resolution protein structures.
Indeed, it is difficult to identify the complex structure of a protein, which requires advanced tools including Nuclear Magnetic Resonance, X-ray Diffraction or Cryo-electron microscopy, as well as protein crystal samples. Hence, the ML technique provides a great opportunity to obtain the frequency spectrum of a protein without experimentally solving for its folded 3D structure. To achieve that, we develop an ML model and train it based on a randomly selected portion (80%) of the NMA calculation results as the first 64 natural frequencies of all the protein structures in the Protein Data Bank.23 This training set is integrated with another table that summarizes 10 structural features of each protein molecule, including the largest diameter (Dmax, measured as the largest distance between the two infinite parallel sheets that clamp the molecule), the smallest diameter (Dmin, measured as the smallest distance between the two infinite parallel sheets that clamp the molecule) and the ratio of all the secondary structures given by DSSP (pH: content of alpha helix; pB: content of residue in isolated beta-bridge; pE: content of beta-strand; pG: content of 3–10 helix; pI: content of pi-helix; pT: content of turn; pS: content of bend and p∼: content of unstructured parts).
We train the ML model to predict the natural frequencies from the 10 structural features. For the sack of simplicity, we introduce a fitting function for the first 64 natural frequencies fi (f0…63) given as
fi_fit = biM | (1) |
(2) |
It has been shown that all the 64 frequencies of each of the protein structures can be fitted by eqn (1) to obtain its [b, M] value for the structure, and the frequencies of over 99% of the protein structure get interpreted with a small standard error σSE < 1 cm−1.23
By using eqn (1), we train the ML model to predict the value of b and M instead of predicting each of the 64 natural frequencies. The ML is implemented in TensorFlow in Python.26,27 We start by using the 10 structural features as predictor variables and using the b and M value as the target variables. We use a Feedforward Neuron Network (FNN) with Rectified Linear Unit (ReLU) activation function to read in the training data (80% randomly selected data, ∼88000 records) with a batch size of 10000 records and develop this data-driven model by minimize the standard error between the predicted and measured [b, M] values by using an Adam optimizer.28,29 This FNN ML model is realized via four hidden layers with 40, 20, 10 and 5 neurons included for each of the layer and a final output layer to evaluate the outcome, as shown in Fig. 1B. We use the remaining 20% data for validation. We run the training for 100000 epochs, with each of which represents one complete presentation of the training data to be learned by the machine, to ensure there is no further improvement for the optimization. We have tried to increase the number of layers to 10 layers and the width of each layer to 60 neurons and find the increment of depth and width do not improve the standard errors for training and validation, so we keep using this ML architecture for our study.
As a result of the validation, we obtain the predicted [b, M] of the protein structures within the validation dataset. Using the values, we compute each of the natural frequencies as fi_ML = biM and compare with their measurement result, with the comparison outcome summarized in Fig. 2. The comparison between the frequencies of the first 5 modes, 10 modes, and 20 modes are given in Fig. 2A, C and E, respectively, with each point (fi_NMA, fi_ML) plotted in the figure. It is shown that the ML prediction and the NMA data have a very strong correlation and most data point concentrated along the line fi_NMA = fi_ML, suggesting the overall good predicting result. Similar to eqn (2), the different between the two dataset is quantified by using the standard error as , where n is the total number of data points in the plot. We find the standard error of computing the first 5, 10 and 20 modes is 0.836, 0.860 and 1.256 cm−1, respectively. The standard error increases for having more modes as the natural frequency increase for higher modes. The prediction error σ = 1.256 cm−1 is still considered relatively low because the mean value and standard deviation of the natural frequency of the 20th mode of all the protein structures is 6.96 ± 3.50 cm−1. The histograms of the predicted frequencies, as given in Fig. 2B, D and F for the first 5, 10 and 20 modes, respectively, show a different trend as the more modes yields statistically better agreement between the prediction and measurement, suggesting the larger prediction error comes merely from the nature of the larger data value for higher modes. Moreover, it is interesting to see that the distribution of the frequencies is log-normal all the time, suggesting that the relatively low frequencies (corresponding to the peak of the log-normal distribution) correspond to more normal modes and also account more for mechanical resonance and energy absorption, because of the equipartition principle,24 which states that each normal mode takes equal kinetic energy for free vibration.
The result is important because it demonstrates that the ML model is able to directly predict the natural frequencies of a protein structure, instead of using the high-resolution 3D structure and NMA calculation or dynamic simulations. The ML model is validated by the testing data and can be applied to any protein molecule with unknown 3D structure but knowing the few structural features. For example, the minimum and maximum diameters can be measured by dynamic light scattering (DLS), and the secondary structure ratio can be extracted from Fourier Transform Infrared (FTIR) spectroscopy and these experiments are much easier than solving the 3D structure of a protein. It thus provides a convenient way of predicting the spectrum of natural frequencies of a protein. However, many of the eight secondary structures that are defined for FTIR are much less clear than what is defined for the full atomistic protein structures by the Dictionary of Protein Secondary Structure (DSSP),30 as what is used for the predictor variables. For instance, helix and beta sheet structures are usually better defined than other coiled or random structures for FTIR.31
It is useful to understand the contribution of each of the ten predictor variables and find out which are more essential than the others in prediction. Although FNN provides a convenient way of predicting the [b, M] from [Dmax, Dmin, pH, pB, pE, pT, pI, pG, pS, p∼], the relationship among each target variable and the predictor variables are highly nonlinear and difficult to quantify. We thus take the linear model and compute the pairwise correlation function between the twelve variables by using all the data records, as shown in Fig. 3A. As expected, not all of the predictor variables correlate much to [b, M]. b correlates more to [Dmax, Dmin, pH] than others while m correlates more to [Dmin, pE, pG] than others. Using the linear model,32 we increase the number of predictor variables nvar from 1 to 10 and for each nvar we test all the possible subsets of predictor variables for C10nvar times of tests. In total, we perform C101 + C102 + C103 + … + C1010 = 210 = 1024 tests in total. For each nvar, we select the best subsets for predicting b (Fig. 3B and C) and M (Fig. 3D and E) values with maximum coefficient of determination (rsq) value, which has value between 0 and 1 and measures the percentage of the response variable variation that is explained by a linear model. It is shown in Fig. 3B and D that the contribution of a unit increment in nvar keeps decrease for larger nvar value in predicting b and M, respectively. Actually, for nvar > 5, the further increment of nvar has only a small effect. Moreover, by running these 1024 tests, we manage to obtain the best subsets for each nvar value, as shown in Fig. 3C and E for b and M, respectively. It is shown that if we are only accessible to one predictor variable, Dmax and Dmin will be the most predictive for b, M, respectively. By using five predictor variables, the combination of [Dmax, Dmin, pH, pE, pG] will be the most predictive.
We use the knowledge obtained from a linear model to reduce the number of predictor variables of our ML model. We test nvar from 1 to 10 and the types of predictor variables are selected according to their importance as given in Fig. 3C. We use the selected predictor variables to train the ML model based on the selected columns of the training data by using the same learning architecture and compute the mean-square error of the testing data during the optimization process, as given in Fig. 4A and B. It is shown that 2000 epochs are enough to yield overall converged mean-square error and nvar > 5 does not yield a significant difference from nvar = 5, as what is predicted by the linear model. It is also shown that for nvar = 5, the ML predicted [b, M] values well agree with the measured [b, M] value of the testing data, as shown in Fig. 4C and D, respectively. The high correlation coefficient (0.88 for b and 0.78 for M) and low standard error (0.35 cm−1 for b and 0.084 for M) can be computed from Fig. 4C and D, which support the accuracy of the prediction. In addition to the comparison of the fitting parameters, we can expand the predicted [b, M] values back to fi_ML for each of the protein structure within the testing dataset and compute the standard error σ with fi_NMA as given in Fig. 4E. It is shown that nvar = 5 has an advantage in reaching small σ value for several different comparisons, including comparing of f1 per se, as well as from f1 to f5 and from f1 to f10.
For any protein sequences, recent progress in folding algorithms makes it possible to accurately predict their 3D atomic structures.33,34 Combining with classical force fields, these structures yields a Hessian matrix that is often too big to solve for their eigenvalues by NMA. The method proposed here provides an efficient way to obtain the eigenvalues without NMA. Thus, it can combined with the recent innovative method of solving eigenvectors from eigenvalues35 and provide all the normal frequencies and normal modes of a protein. We can use the information to predict the dynamic behavior of a protein. For example, for any protein, we obtain the ith normal modes as a {xi,yi,zi} normal frequency fi. By using the equipartition theorem, we can compute that each mode has the equal kinetic energy parts as (3/2)NkBT/M for the first M modes of a protein structure under the temperature of T, where N is the total number of atoms and kB is the Boltzmann constant. Because each of the normal modes are normal to the other modes, the velocity distribution for atom j in mode i is vij and should yield
(3) |
Considering the atomic velocity is also given by the magnitude of the particle vibration as , and the magnitude of vibration is given by the as {xj,yj,zj}i the ith normal mode on atom j. Putting these relations to eqn (3), we obtain the magnitude of vibration as
(4) |
This equation can be used to calculate the thermal vibration of the protein structure without running molecular dynamics simulations. With the coordinate of each atom given by a function of time t
(5) |
For example, we focus on the protein structure with PDB ID 1FH1 and perform this superposition analysis by using the predicted normal frequencies and obtain its thermal vibration spectrum at 300 K as given in ESI Video 1.†
The selection of the initial structural features, including diameter and secondary structure ratios is more intuitive than fully rational. The reasons are the availability of experimental measurements (DLS, FTIR, etc.) and their consistency with in silico measurements of atomic protein structures. It forms the fundamental basis of training an ML model with the computational data, which provides the only feasible way of generating massive data records for training and validation. Using linear models, we have successfully reduced the number of predictor variables to five. It is possible that there can be other structure features that can better predict the natural frequencies but the features are not included. Considering the automatic generation of the database and rational analysis based on ML and linear models, as long as these features can be extracted from the high-resolution protein structures, this ML model can be easily updated by considering other geometric features beyond the current ones.
The method proposed here can serve as a useful and computationally efficient approach to predict the absorption and resonance functions of the protein molecule with unknown structure and sequence. It can be useful in categorizing the dynamic function of unknown proteins according to their structural features. This method, combined with the recently evolved techniques that allow prediction of protein structure from its sequence,15,16 could facilitate predicting the natural frequencies of all the known protein sequences (such as sequences presented in the UniProt,25 with 99.9% of them not included in the Protein Data Bank), which may be a useful resource to identify protein type according to its frequency spectrum.
Looking ahead, our new method could also be applied to predict the vibrational features of other nanoscale objects such as nanoparticles. It may also be used to extract other material properties besides vibrational spectra, and as such, be broadly applied in a materials-by-design approach where geometric features are identified to achieve certain material properties. Other applications may include material sonification methods, to offer a rapid method to create audible representation of proteins in musical form.23
We use a Block Normal Mode (BNM) method8,37 in CHARMM for normal mode analysis on each of the protein structure. BNM projects the full atomic hessian matrix into a subspace spanned by the eigenvectors of blocks, as each block is defined by an amino acid. We save the first 70 modes with lowest frequencies, ordered from lower to higher frequency, of each protein molecule.
The first 6 modes always have zero eigenvalue and zero frequency because they correspond to the rigid-body movement and rotation of the molecule. The higher order modes 7 and onwards, i.e. the last 64 normal modes amongst the 70 generated, describe the molecular deformation. We hence only use the frequency value of the last 64 modes out of these 70 ones. These frequency values are denoted in sequence as f0, f1, f2, … f63.
Footnotes |
† Electronic supplementary information (ESI) available: File 1: database used for linear model: combine_name_freq2param_max_min_ss.dat; File 2: R code for linear model: R_code; File 3: database used for ML model: combine_freq2param_max_min_ss.dat; File 4: Python code for predicting b value: predict_f0.py; File 5: Python code for predicting m value: predict_power.py; Video 1: the dynamics of protein 1FH1 by using both superposition and MD simulations: Equipartition theorem.mp4. See DOI: 10.1039/c9ra04186a |
‡ Present address: Department of Civil and Environmental Engineering, Syracuse University, Syracuse, NY 13244, United States of America. |
This journal is © The Royal Society of Chemistry 2020 |