InChINet: a self-supervised molecular representation learning framework leveraging SMILES and InChI
Abstract
Molecular representation, as one of the fundamental challenges in artificial intelligence-driven drug discovery, has attracted increasing attention due to its low cost and impressive speed while it is applied in molecular property prediction, drug molecule generation, drug–drug interactions, etc. Numerous models that integrate multi-modal representations have been proposed for molecular representation learning. However, existing methods have not yet considered the IUPAC International Chemical Identifier (InChI) as one of the multi-modal inputs. To address this issue, we propose InChINet, a self-supervised molecular representation learning framework that is pre-trained on 10 million unlabeled molecules. It leverages mutual information across the simplified molecular line input system (SMILES) and InChI. In addition, we present token reordering and token masking for SMILES. Combined with SMILES enumeration, these three strategies introduce domain knowledge and improve the model's stability against syntactic variations in SMILES representations. Benefiting from the introduction of InChI and augmentation strategies, InChINet achieves impressive performance on a wide range of downstream tasks, including molecular property prediction, drug–drug interaction (DDI) prediction, clustering analysis, zero-shot cross-lingual retrieval, and ablation study.

Please wait while we load your content...