One class classification as a practical approach for accelerating π–π co-crystal discovery

The implementation of machine learning models has brought major changes in the decision-making process for materials design. One matter of concern for the data-driven approaches is the lack of negative data from unsuccessful synthetic attempts, which might generate inherently imbalanced datasets. We propose the application of the one-class classification methodology as an effective tool for tackling these limitations on the materials design problems. This is a concept of learning based only on a well-defined class without counter examples. An extensive study on the different one-class classification algorithms is performed until the most appropriate workflow is identified for guiding the discovery of emerging materials belonging to a relatively small class, that being the weakly bound polyaromatic hydrocarbon co-crystals. The two-step approach presented in this study first trains the model using all the known molecular combinations that form this class of co-crystals extracted from the Cambridge Structural Database (1722 molecular combinations), followed by scoring possible yet unknown pairs from the ZINC15 database (21 736 possible molecular combinations). Focusing on the highest-ranking pairs predicted to have higher probability of forming co-crystals, materials discovery can be accelerated by reducing the vast molecular space and directing the synthetic efforts of chemists. Further on, using interpretability techniques a more detailed understanding of the molecular properties causing co-crystallization is sought after. The applicability of the current methodology is demonstrated with the discovery of two novel co-crystals, namely pyrene-6H-benzo[c]chromen-6-one (1) and pyrene-9,10-dicyanoanthracene (2).


Supplementary Material
1. Generating the datasets

Cocrystal extraction from Cambridge Structural Database (CSD)
Starting from eight representative polyaromatic hydrocarbons (PAHs) we extracted all the co-crystals that include as a co-former either these or their structurally similar molecules. The structural similarity was measured with Tanimoto similarity (> 0. 35). The multi-component crystal structures that contain solvent molecules were removed, keeping only the benzene like solvents, as they might hold information about π-π interactions. The solvents list implemented was the default CCDC most common solvent list.  S5 For the extraction of the PAH co-crystals, the Python API functionality of CCDC was employed. The two main search functions used are the similarity and substructure search. 1 The similarity search is based on the comparison of molecular fingerprints and works as following: given a query molecule, in our case the molecules were given as SMILES strings, a 2D structure-based search is performed to determine molecular components that are similar to the input. For each separate molecule in a crystal structure a molecular fingerprint of 2040 bits is generated, using all atom and bond paths up to ten atoms in a molecule. 1 That search reveals not only single molecules but also combinations of molecules, potentially because of the large fingerprint space used.
The similarity search function of the CSD Python API was applied to the starting PAHs, using the standard CSD fingerprint similarity search with a Tanimoto similarity threshold of > 0. 35. The extracted structures were then filtered by removing duplicate structures (polymorphs), as there are several polymorphs for some co-crystals but as our machine learning workflow is based on the two-dimensional descriptors we only considered the two different types of molecules that exist in a structure and not the packing. The INCHI number of each molecule was implemented for the filtering as INCHI numbers are more unique whilst two different SMILES might represent the same molecule. After removing the duplicates, the extracted molecules were split into categories based on the number of times the molecules in the pair appear. In that way we can measure the molecular stoichiometry. For the category including only single components the substructure search was further applied to detect any potential combinations that were not found from the similarity search. The same filters were applied as for the similarity search.
A substructure search was implemented to search for structures containing a required component, which was in our case the co-crystals containing at least one of the starting PAHs or any molecule similar to them as found from the similarity search.
After obtaining the final co-crystals dataset the structures that include common solvents are removed, except from those containing benzene-like solvents that might hold important information about π-π interactions.
The percentage of the extracted PAH co-crystals connected with π-π stacking out of the whole co-crystals dataset was measured after calculating the number of existing co-crystals in the CSD database. The whole CSD was searched for structures containing two different molecules using the same search settings as for the extraction of PAHs co-crystals: We identified 13,817 co-crystals including co-crystals containing benzene-like solvents (solvates), meaning that the 1,722 PAH co-crystals connected with π-π stacking compose the 12% of the total. Figure S1. Flow diagram for PAH co-crystals extraction. The search starts with 8 representative PAHs and Python API CCDC is employed for extracting all the co-crystals that are formed from these 8 molecules or molecules that are similar to them on the basis of molecular fingerprints (ECFP4 > 0.35 Tanimoto Similarity). The extracted dataset was further filtered for removing co-crystals containing molecules with acidic parts.

Designing the unlabelled (ZINC15) Dataset
A search of the ZINC15 database for molecules similar to the eight initial molecules of Table 1 on the basis of  molecular fingerprints with a Tanimoto similarity threshold of > 0.35, which are purchasable and do not contain incompatible functional groups, afforded a library of 210 candidate molecules. All the possible order invariant pairwise combinations of these candidates compose the unlabelled dataset. Similarity search in ZINC15 is based on 512 bit ECFP4 fingerprints 2 , meaning that the atomic environment between two under comparison molecules is four bonds length with size of fingerprint is 512 bits. It is well discussed that different libraries present significant structural variations and thus the ECFP features can have quite different values 3 . The small overlap between Zinc and CSD databases can be explained in that way, especially if we consider how CSD database performs the similarity search.

Filtering with Pipeline Pilot
The filtering for incompatible functional groups in both the labelled and unlabelled dataset was performed using Pipeline Pilot 4 with the following workflow.

One class classification Review
Distribution based. Methods in this category are basically inspired from statistical modelling. They deploy some standard distribution model (e.g., normal distribution) and flag as outliers the instances that deviate from the model, whereas inliers are those that follow the same distribution. 6 Typical examples are the Autoencoders and the Gaussian Mixture models.
Density based. These methods assume that normal data points occur around a dense neighbourhood. The local outlier factor (LOF) approach is one of the well-known algorithms in this category, where normal points get low LOF values as they belong to a local dense neighbourhood. The density of a neighbourhood is estimated using the distance to the k nearest neighbours, with k being the minimum number of neighbours used for defining the local neighborhood. 7 Distance based. Among other distance based methodologies, k-nearest neighbour algorithm ranks each point on the basis of its distance to its k th nearest neighbor. 2,7 The lower the distance the closer to the normal data is the point.
Clustering based. Clustering Based Local Outlier Factor (CBLOF) is an algorithm developed for considering both the size of clusters and the distance between points and the closest cluster. Each datapoint is then assigned a score/outlier factor based on these considerations. 9 Support Vector Machine. One class support vector machine algorithm (OCSVM) is an extension on the well-known support vector machine technique. The planar approach of OCSVM is about finding a linear boundary to maximally separate all the data points from the origin, whereas the spherical approach designs a spherical boundary in feature space around the data (the hypersphere) and the algorithm tries to minimize the volume of the hypersphere. 10 Histogram-based. For each single feature, a univariate histogram is constructed where the height of the bins gives an estimation of the density. Then, the score of each point is calculated by combining all the histograms using the corresponding height of the bins where the point is located. 11 Forest-based. Whilst most of the aforementioned models are essentially used to profile the normal labelled data, this model is focused on isolating anomalous instances. The isolation forest algorithm is recursively randomly partitioning a randomly selected feature between its minimum and maximum values. The number of recursive partitions, represented as a tree structure, required to isolate an instance is equivalent to the path length from the root node to the terminating node. The instances with short path lengths are regarded as anomalies with the anomaly score being computed by the mean anomaly score of the trees in the forest. 12 Ensemble-based. The ensemble technique involves a number of base detectors being fitted to different sets of features of the dataset and the outliers are identified based on the probability of each point being an anomaly. Representative model of this category is the feature bagging algorithm. 13 Deep One Class. In contrast to traditional approaches which make use of heuristics or statistical methods, deep learning approaches stack multiple processing layers one above another with each layer providing higher order interactions among the features. Deep learning approaches specifically designed for one class classification are not yet very widespread. The majority of the existing models involve neural networks being trained to perform tasks other than one class classification which are then adapted for use in the one class problems. Deep networks designed for one class (anomaly detection) involve the objective function of a traditional one class approach. However, they are trained deeper i.e., using more layers and in higher dimensions for fitting the appropriate function to the normal data. Deep learning models could easily handle more complex molecular representations as inputs, e.g., SMILES strings or 3D molecular configurations. 14 All the aforementioned algorithms were tested for solving the co-crystal prediction problem. It should be highlighted that there is no single 'best' method for dealing with one class classification tasks. The appropriateness of each algorithm is highly associated with the problem to be solved and the available dataset.

Feature Engineering
Feature engineering based on correlations between the Dragon descriptors of the conformers of the known dataset. The most important descriptors were found to be the following:  Figure S4. A demonstration of the effect the implemented one class classification/anomaly detection algorithms have on the initial dataset when projected in two-dimensions. Principal Component Analysis (PCA) was employed are the dimensionality reduction techinique. The expained variance is 67.61% for the first Principal Component and 4.95% for the second. All the dimensions are implemented (3700). The outliers found each time either belong to to labeled dataset as noise or to the unlebeled dataset as the outling part. S11 Figure S5. Pairwise correlations of the scores of the known (yellow) and unknown (green) data using standard one-class classification algorithms. Each algorithm uses a different scoring function to assign scores to the molecular combinations, giving in all the cases higher scores to the known (training set) whereas only a certain part of the unknown combinations (test set) is getting high scores and can be regarded as inliers. S12 Figure S6: Heat-maps with the scoring of each algorithm on the unlabelled dataset. S13 Figure S7. Illustration of ensemble learning technique used for combining the scores of each of the standard models. Figure S8. Labelled/Unlabelled scores distribution and test scoring matrix using the ensemble of one-class classification algorithms.

HBOS kNN GMM
Known Dataset Final Score Average S14

Deep One Class (SetTransformer-DeepSVDD)
The neural network architecture is adapted from Ruff et al, 15 namely DeepSVDD. The convolutional autoencoder used on DeepSVDD network was replaced with an attention based autoencoder which is permutation invariant, namely SetTransformer. The architecture of SetTransformer was adapted from Lee et al 16 and was used for learning the representation of the molecular pair such that they will be perceived as order invariant vectors. SetTransformer includes two stacked SABs (Set Attention Block) and one PMA (Pooling by Multihead Attention) layers in the encoder followed by two linear decoder layers. The first part of the encoder independently acts on each element of the vector (SAB) and then on the second part the encoded features are aggregated to produce the desired output. The decoder part is only used for the initialization of the weights and then is not further employed in the training. The loss function of DeepSVDD is referred to the minimization of the volume of a hypersphere that includes the normal data. In our case as normal data we regarded all the known co-crystals extracted from CCDC. The hyperparameters of the network (number of epochs and learning rate) were selected based on k-fold cross validation such that after minimizing the volume of the hypersphere significantly, the majority of the datapoints of the hold-out test are found in the hypersphere Figure S9. Neural Network Architecture. S15 Figure S10. a) Labelled/Unlabelled scores distribution and test scoring matrix using the ensemble of one-class classification algorithms. b) training loss after 60 epochs implementing DeepSVDD network.
The reproducibility of the model was checked after performing the pretraining and training steps for 30 times with a varying number of seeds keeping 10% as a validation set each time. The mean Pearson correlation of the predicted scores was 0.96 with mean standard deviation 0.0017 That is an indication that there is high reproducibility of the results and thus for being able acquire the same results each time the seed was set to 0. S16 Table S3. Short description and hyperparameter settings of the models used for one-class classification. Bayesian optimization was implemented to determine the best performance of a single model.
Calculates the distance between points and the closest cluster. alpha=0.9 Ratio of the number of samples in large clusters to the number of samples in small clusters. Number of epochs: a single pass through all the training data lr=10 -5 Learning rate S17 Figure S11. Illustration of the input on the algorithms.

S18
As there are no negative data (known outliers) for performing the typical evaluation steps, which would be based on measuring the AUC or APR, we measure the performance both of the neural network and the traditional classifiers based on the True Positive Rate (TPR%), i.e., the percentage of correctly classified positive data. In more detail, the labelled dataset was split in five-fold using k-fold cross validation, with four-folds being used for the training and one-fold (hold-out data) for the validation. As we assume that 95% of the labelled data are normal with a 5% of noise, a threshold is set as the score above which 95% of the labelled data is scored. Both the k-fold cross validation and the response of the models in the increasing amount of data are investigated for deciding the best model. Figure S12. Box plots showing the accuracy of the models after k-fold cross-validation using five folds.  Figure S13. Learning curves (TRP %) and standard deviation (%) of all the implemented models.

Visualizing the predicted pairs
The analysis of the outcomes of the two workflows, i.e., the standard approaches and the deep one class, is performed as following: i) The top scored pairs are shown and compared with the closest scores-wise structure of the labelled dataset (training set).
ii) The most popular co-formers are identified by counting how many times each co-former is found in high score pairs (upper quartile) iii) The predicted pairs are separated into lists based on the following criteria: a) No constrains b) Pairs after removing solvents c) Pairs including one of the initial PAHs c) Pairs without solvents and heteroatoms e) Pairs including heteroatoms f) Pairs including 1,6 dicyanoanthracene, the most similar (Tanimoto Similarity) molecule to TCNQ (well known for the electronic properties) g) Pyrene-cocrystals Figure S14. Examples of top scored combinations and most similar score-wise CSD entry for both workflows followed. On the first row we can observe the existing CSD molecular pairs, whereas on the second row is the closest in score predicted pair among those in ZINC15. Some of these trends could be seen in one of the high score pairs shown in Figure 6. It is easily observed that in this pair, the one co-former has both high molecular weight and distinctive branching index, whereas the pairing molecule lacks both a high molecular weigh and branching. This is in good accordance with the trends seen in the molecular weight and branching index between pairs. Obsiously the high score is not only based on these two descriptors as a wide range of different descriptors is taken into consideration of the deep learning model.

Predicting Molecular Stoichiometry
For the prediction of molecular stoichiometry on the labelled dataset the XGBoost classifier was implemented. The hyperparameters were optimized using the hyperopt library. Figure S19. Evaluation metrics on the ratios prediction. a) Training/test accuracy with XGBoost Classifier on the prediction of ratios using the initial bidirectional dataset (left) and after using the latent representation (right). The highest accuracy achieved on the test set was 77% whilst overfitting on the training data. When the latent dimension of the deep network was used as the input the training was more effective, achieving more than 90% accuracy and no-overfitting. It can be postulated that using an Attention-based encoder for capturing the relation between the molecular pairs, a better representation can be achieved. b) Classification reports on the validation set using the bidirectional dataset (left) and the latent representation (right).

Interpretability
SHAP (Shapley Additive exPlanations) was implemented as a model interpretation framework for providing chemical insights into predictions. Rationalizing model decisions would assign priority to meaningful experimental attempts and help the chemists to choose the next molecular pair worth testing. SHAP is a model independent method, meaning that it is not takes into consideration the feature weights but measures the influence each feature change has on the final decision of the model. In other words, by calculating Shapley values, the contribution of each feature of each combination to the final score is estimated.
The overall SHAP formula is shown in equation (1) . Hence, the importance of a given feature is defined by equation (2): It follows that Shapley values represent a unique way to divide a model's output among feature contributions satisfying three axioms: local accuracy (or additivity), consistency (or symmetry), and nonexistence (or null effect).
Using the SHAP approach, the identification and prioritization of features that determine the pairs ranking is enabled. In that way we can extract the connection between the molecular properties and co-crystallisation. In addition to model accuracy, the interpretability of the predictions is adding value to any machine learning model.
High negative Shapley values are driving the model towards outliers, whereas as high positive values are supporting the decision for inlierness. Initially, we tried to get the whole picture of the model and have an indication for the important features that dominate the training set. GradientExplainer method was used on the whole bidirectional dataset such that the position of the molecule will not matter. The summary plot with the features' contribution in descending order of importance is shown in Figure S20. The red and blue values indicate high and low values respectively, with high positive red enhancing the decision of inlierness and high negative red driving towards the decision of anomaly. The feature value reflects the contribution the feature makes to the final score of the molecular pair.
Model interpretation inherently depends on the interpretability of the implemented descriptors or features.
Herein, we used all the available Dragon descriptors as we wanted to compare with previous statistical analysis on the CSD based on similar descriptors and also to gain a physical meaning for the PAHs co-crystals to enable an experimental chemist to understand dominating patterns and prioritize the experimental work.

Interpreting the labelled dataset.
A density scatter plot of SHAP values was employed for illustrating the feature importance. Herein, it can be seen the impact each feature has on the model scoring for each individual pair in the labelled dataset. Features are sorted by the sum of the SHAP value magnitudes across all samples. In the plot the density of the datapoints can also be observed on the lumpier areas, as on the y axis the distribution of the datapoints is shown. Figure S20. Summary plot of Shapley values for global interpretation of the bidirectional dataset. The Shapley value of each feature represents its contribution towards the model's output. Red positive values are driving to higher scores and boosting inlierness, whereas red negative values tend to descrease the final score and hence outlierness. Blue values are idicative of low feature value. As for many of the features calculated by Dragon software a physical meaning is hard to be extracted, the correlations among the most significant descriptors with those that are more general is calculated.

5.2
Interpreting the pyrene co-crystals subset.
The advantage of using the Shapley analysis is that we can zoom in to a subset of interest and gain some knowledge about the important features in some molecular pairs of interest. In the work, as pyrene was identified as a popular co-former and was used for the experimental screening, the pyrene co-crystals family of materials was further investigated to extract the feature importance and understanding which properties dominate in the existing pyrene pairs. As the pyrene is a set molecule, it took the first place on the molecular vector and the pairing molecules were always second. As we are interested in the contribution of the pairing molecule only the features related to them are shown in the summary plot below. For those molecules where shape matters, it has more impact than the existence of heteroatoms. It can be postulated that in cases where we have a pairing molecule with no heteroatoms, then the shape will play an importanct role in the pyrene co-crystal formation. So we could expect that molecules with these groups in the certain topological distances and high scores (as the score is the outcome of the consideration of all the known features) are good candidated for forming co-crystals with Pyrene.

Descriptor distributions
As the main principle of machine learning is to encounter some underlying structure in the data, we visualize the distribution of the labelled dataset, used as the training set and the extracted distribution of the unlabelled dataset to compare the general patterns. The investigated properties are separated into the following categories: i) General descriptors, ii) Shape descriptors, iii) Polarity descriptors, iv) Size descriptors and v) Electronic descriptors.

Experimental Realization (Pareto Optimization)
As our target is to design functional materials, for the selection of the co-formers to be experimentally tested some further important parameters are taken into consideration. These parameters refer to common factors that a synthetic chemist will use as a guideline for the experimental design: quick availability, novelty and possible electronic properties. The decision making was driven by a commonly used criterion for determining solutions to multi-objective optimization problems, the Pareto optimality. 18 The co-formers to be tested are the Pareto Optimal points regarding the high score and the similarity to TCNQ molecule, which is well-known for the interesting electronic properties as a co-crystal co-former. 18 A point is regarded as Pareto optimal in cases where there is no other point such that the desired objectives are improved simultaneously, i.e. both score and structural similarity to TCNQ are maximized.

Pyrene-based co-crystals
Cambridge Structural Database (CSD, 2019 release) was investigated in the search for the known pyrene-based co-crystals. The graph of PYRENE entry, including hydrogen atoms, was used as starting query in the ConQuest software. The filters: 3D coordinates determined, not polymeric, no ions and only organics, applied to the results leads to the list reported in Table S16.

UMAP projection of the co-crystals space
UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) 82 was implemented for a low-dimensional space encoding of the labelled dataset. Each point on the UMAP visualization is coloured according to the difference of the molecular descriptors. All the descriptors are normalized to [0,1] to be comparable. The implemented UMAP settings were selected based on the best distance preservation between the high dimensions and the two-dimensional embeddings. The distance preservation was measured by calculating the Pearson correlation coefficient of the distance matrix using the whole dimensionality and the distance matrix after the dimensionality reduction. The most effective settings were as follows (n_neighbours = 80, min_dist = 0.1, euclidean distance metric) resulting in Pearson correlation coefficient of 0.748. Figure S34. UMAP 2D projection showing the distribution of selected molecular descriptors across the co-crystal space. It can be observed that not all the descriptors show similar trends across the molecular pairs map. Figure S35. UMAP 2D visualization of the overall co-crystal dataset (inset) and zoomed view of the hightlighted cluster. 1 and 2 are represented with red square and triangle respectively. The closest neighbours to 1, as calculated by the Euclidean distance of the descriptors, are visualized with smaller squares, whereas the closest neighbours to 2 with smaller triangles. The light green and grey color codes stand for molecular pairs containing pyrene and those without pyrene respectively. Intrestingly the majority of the pyrene co-crystals belong to the same cluster formed by molecules with similar characteristics. It was observed that even though 1 and 2 are quite similar feature-wise to known pyrene co-crystals, the crystal packing both of them adopt, (i.e., the γ motif) was rare and more complex.

Comparison with known structures
The synthesized co-crystals 1 and 2 were compared to the known co-crystals consisting the labelled dataset. The comparison was performed using all the available molecular descriptors acquired from Dragon software. 83 As before, each molecular pair is represented by the concatenation of the molecular descriptors of each molecule in the pair. The distance between 1 and 2 and the known CSD structures is calculated by measuring the Euclidean distance of the vectors of the two new structures to the vectors of the labelled dataset.

=1
(3) Figure S36. Euclidean distance of 1 and 2 to the closest known co-crystals (blue bars) of the labelled dataset. The red bar represents a more distant co-crystal for comparison purposes.  Table S18. List of the significant structural motifs and of the crystal packing coefficients (Ck) of 2 and of the most similar co-crystals in CSD database in terms of Euclidean distances.