Melting point prediction of organic molecules by deciphering the chemical structure into a natural language†
Abstract
Establishing quantitative structure–property relationships for the rational design of small molecule drugs at the early discovery stage is highly desirable. Using natural language processing (NLP), we proposed a machine learning model to process the line notation of small organic molecules, allowing the prediction of their melting points. The model prediction accuracy benefits from training upon different canonicalized SMILES forms of the same molecules and does not decrease with increasing size, complexity, and structural flexibility. When a combination of two different canonicalized SMILES forms is used to train the model, the prediction accuracy improves. Largely distinguished from the previous fragment-based or descriptor-based models, the prediction accuracy of this NLP-based model does not decrease with increasing size, complexity, and structural flexibility of molecules. By representing the chemical structure as a natural language, this NLP-based model offers a potential tool for quantitative structure–property prediction for drug discovery and development.