Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations

Translation between semantically equivalent but syntactically different line notations of molecular structures compresses meaningful information into a continuous molecular descriptor.

by the size of the kernel. Since the same kernels are applied along all outputs of the previous layer, a convolutional architecture is especially well suited for recognition of patterns in the input, independently of their exact position. Hence, CNNs are popular in image analysis tasks but were also successfully applied on sequence-based data. 1

Recurrent Neural Network
A more tailored architecture for sequence-based data is the Recurrent Neural Network (RN-N). In contrast to FNN and CNN where all information flows in one direction, from the input layer through the hidden layers to the output layer (feed-forward neural networks), an RNN has an additional feedback loop (recurrence) through an internal memory state. An RNN processes a sequence step by step while updating its memory state concurrently. Hence, the activation of a neuron at step t is not only dependent on the input at t but also on the state at position t − 1. By including a memory cell, an RNN is in theory able to model long-term dependencies in sequential data, such as keeping track of opening and closing brackets in the SMILES syntax. The concept of a neural network with a feedback loop can be simplified by unrolling the RNN network over the whole input sequence (see Figure 1 c

Baseline Molecular Descriptors
In this section we will introduce the basic concepts of the different molecular descriptors we used as baseline.
Circular fingerprints like the extended-connectivity fingerprints (ECFPs) were introduced for the purpose of building machine learning models for quantitative structure-activity relationships (QSAR) models. This class of fingerprints iterates over the non hydrogen atoms of a molecular graph and encodes the neighbourhood up to a given radius using a linear hash function. The resulting set of neighbourhood hash codes for a dataset can then be handled

QSAR modelling
We The classifier network consists of a stack of three fully-connected layers with 512, 128 and 9 neurons respectively, mapping the latent space to the molecular property vector. The model was trained on translating between SMILES and canonical SMILES representations. Both sequences were tokenized as described in the method section and fed into the network.
In order to make the model more robust to unseen data, input dropout was applied on a character level (15%) and noise sampled from a zero-centered normal distribution with a standard derivation of 0.05. We used an Adam optimizer 7 with a learning rate of 5 * 10 −4 which was decreased by a factor of 0.9 every 50000 steps. The batch size was set to 64.
To handle input sequences of different length we used a so-called bucketing approach. This means that we sort the sequences by their length in different buckets (in our case 10) and only feed sequences from the same bucket in each step. All sequences were padded to longest sequence in each bucket. We used the framework TensorFlow 1.4.1 8 to build and execute our proposed model.       running a QSAR experiment with our proposed descriptors is on the same timescale as with state-of-the-art molecular fingerprints, while being significantly faster than training a graphconvolution model. One way to make our encoder faster would be to replace the GRU cells by one-dimensional convolutional layers. In our experiments however these do not perform as well in the downstream QSAR validation sets.