Neuraldecipher – reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures

Protecting molecular structures from disclosure against external parties is of great relevance for industrial and private associations, such as pharmaceutical companies. Within the framework of external collaborations, it is common to exchange datasets by encoding the molecular structures into descriptors. Molecular fingerprints such as the extended-connectivity fingerprints (ECFPs) are frequently used for such an exchange, because they typically perform well on quantitative structure–activity relationship tasks. ECFPs are often considered to be non-invertible due to the way they are computed. In this paper, we present a fast reverse-engineering method to deduce the molecular structure given revealed ECFPs. Our method includes the Neuraldecipher, a neural network model that predicts a compact vector representation of compounds, given ECFPs. We then utilize another pre-trained model to retrieve the molecular structure as SMILES representation. We demonstrate that our method is able to reconstruct molecular structures to some extent, and improves, when ECFPs with larger fingerprint sizes are revealed. For example, given ECFP count vectors of length 4096, we are able to correctly deduce up to 69% of molecular structures on a validation set (112 K unique samples) with our method.

: Architecture for each Neuraldecipher model. Each hidden layer consists of the composition of three operations, namely affine linear transformation, batch-normalization followed by ReLU activation. Each integer within the hidden layers bracket, indicates the number of hidden neurons in the hidden layer. The output layer consists of 512 neurons and is activated with Tanh. The last column (elapsed time) states the average duration of one forward pass of 1M compounds through the network for 10 forward passes.

ECFP input-size
Hidden Degeneracy analysis for ECFP 6 settings The number of non-unique ECFPs for the processed dataset for training depends on the set bond diameter d. For the ECFP 6 , i.e. generated with bond diameter d = 6 with increasing fingerprint length k, we computed the number of non-unique ECFP samples for the bit-and count vectors. The results are shown in Table 2. Given fixed bond  Analysis CDDD-space vs. ECFP-space The ECFP 6,4096 -count cluster-split model reports a reconstruction accuracy of 41.02%

Validity on reconstructed SMILES in all experiments
and mean Tanimoto similarity of 72.58% on the validation dataset (112, 332 samples). To illustrate the dependency between CDDD-and ECFP-space for the predicted deduced molecular structures, we computed the Euclidean distance and the Cosine similarity between predicted and true from the validation set. The dependency between Cosine similarity and Euclidean distance against Tanimoto similarity is shown in Figure 2. Since we formulated the reverse-engineering task as machine learning problem of predicting a close sample, if not the correct sample, during training we aim to obtain a model f θ , that minimizes the empirical loss function on the training set. Since the empirical loss function contains the deviance d, see Equation (1), the Euclidean distance is implicitly minimized as well. Figure

Analysis of hash collision
The classical ECFP is an unfolded fingerprint with no pre-defined size and its length depends on the input molecular structure. Since the ECFP algorithm iteratively uses Applying that, the folded bit/count fingerprint has length k. Assume we set k = 10 such that our bit/count fingerprints have fixed length of 10. Since the unfolded fingerprint is For our ECFP 6 configurations, we computed the unfolded ECFP 6 vectors for all compounds in our processed dataset, obtained the number of unique keys and subtracted these values with the number of unique keys for the folded ECFP6 vectors. Figure 3a shows the results for increasing size k. As the fingerprint size k increases, the collision degree of larger than 1 decreases (or in other words, the collision degree of c = 0 increases).
Since our studies also include the analyses on the Neuraldecipher performance on a fixed fingerprint length k = 4096 but varying bond diameter d ∈ {4, 6, 8, 10}, we also computed the collision degrees for each of the five ECFP datasets with varying bond diameters. The results are shown in Figure 3b. Figure 3a illustrates that the collision degree of c = 0, i.e. no information loss due to the folding operation, is highest for the ECFP 6 that was folded into length 32768, followed When fixing the fingerprint length to k = 4096 and increasing the bond diameter d, we observe that the information loss also increases (see increasing average mean collision µ c for increasing bond diamter d in Figure 3b). Since the unfolded ECFP d with higher bond diameter d > d is a superset of the unfolded ECFP d , the number of unique keys for the ECFP d has at least the value of the number of unique keys for the ECFP d .
Since the two ECFPs are folded onto the same fixed length of k = 4096, it is natural that the ECFP with higher bond diameter suffers from more information loss. This information loss is shown in the higher number of counts for collisions degrees larger than 1, i.e. counts for c ≥ 1.