LMProtein: a protein language model based framework for protein structural property prediction
Abstract
Recent advances in machine learning and self-supervised deep language modeling have made it possible to accurately predict protein structural properties. Most existing models and pretraining methods leverage evolutionary information in multiple sequence alignments (MSAs) to obtain very promising results in protein structural property prediction. However, methods which make use of MSAs are computationally intensive and time consuming, and these methods cannot be applied to proteins which lack sequence homologs. Here, we present LMProtein, a fast and accurate framework for predicting protein structural properties such as protein secondary structure, backbone dihedral angles, fluorescence landscape and stability landscape only using protein primary sequence. By combining the unsupervised pretrained language model ESM-2 with a convolutional neural network, long short-term memory neural networks and multilayer perceptron, LMProtein achieves better performances than recent MSA-based models and single-sequence-based models. The accuracy of the eight-state secondary structures (SS8) prediction is approximately 74%, the mean absolute error of dihedral angle prediction is 19° and 29° for Phi and Psi, respectively, and Spearman's ρ between the experimental and predicted values of fluorescence and stability is 0.69 and 0.79, respectively. We believe that our framework has broad potential for predicting protein structural characteristics, providing important opportunities to accelerate the progress of protein engineering and drug target identification.

Please wait while we load your content...