Deep generative design of porous organic cages via a variational autoencoder

Porous organic cages (POCs) are a class of porous molecular materials characterised by their tunable, intrinsic porosity; this functional property makes them candidates for applications including guest storage and separation. Typically formed via dynamic covalent chemistry reactions from multifunctionalised molecular precursors, there is an enormous potential chemical space for POCs due to the fact they can be formed by combining two relatively small organic molecules, which themselves have an enormous chemical space. However, identifying suitable molecular precursors for POC formation is challenging, as POCs often lack shape persistence (the cage collapses upon solvent removal with loss of its cavity), thus losing a key functional property (porosity). Generative machine learning models have potential for targeted computational design of large functional molecular systems such as POCs. Here, we present a deep-learning-enabled generative model, Cage-VAE, for the targeted generation of shape-persistent POCs. We demonstrate the capacity of Cage-VAE to propose novel, shape-persistent POCs, via integration with multiple efficient sampling methods, including Bayesian optimisation and spherical linear interpolation.


S-5
Table S3: The number of POCs assembled by different reactions in the supervised and unsupervised dataset.

Data Augmentation Strategy
A two-step combinatorial method was developed to achieve BB2 augmentation.In the original dataset, BB2s exhibit high symmetry at a molecular level.This feature enables a convenient decomposition of a BB2 skeleton (with reactive end functional groups stripped from the BB2 backbone) to a precursor core and two subsequent linkers on both sides that are axially symmetrical with respect to the core.In the first step of data augmentation, the randomly chosen moieties of the core and linkers (a pair of identical linkers) were coupled to form a new precursor skeleton.The boundary between a core and a linker in the precursor skeleton is not strictly defined but is taken only for the convenience of deconstruction.In some cases, the moiety of core or linker is not present, which gives more flexibility in exploring augmentation possibilities.Therefore, precursor skeletons can be constructed by one of the following modes: li + cr + li, cr itself and li + li.Subsequently, the entire BB2 can either S-6 be mode (1).R + li + cr + li + R, (2).R + cr + R or (3).R + li + li + R (where li, cr, R denote the linker, core and reactive end functional group, respectively).The examples of BB2 constructed by three strategies are shown in Fig. S4.In total, the most common 79 cores (excluding an empty element for the core-absent case) and 35 linkers (excluding an empty element for the linker-absent case) are selected as candidates to implement the random combination (shown in Fig. S5 and Fig. S6).
Besides the random combination, random functionalisation was performed as the secondstep data augmentation method to further increase the number of available BB2s for virtual cage assembly.The random functionalisation was applied only to the linker moieties in BB2 skeletons to preserve the symmetry of the resulting POCs.Only one of the 20 functional groups was introduced in each functionalisation for a linker.The random functionalisation was designed to traverse all pairs of linkers in Fig. S7.Subsequently, the cores and generated linkers were combined to boost the number of BB2 skeletons using the same first-step random combination procedure described above.As a result of the two-step data augmentation, the number of POCs was raised to around 1.2 million and POCs were curated into a dataset (referred to as the "augmented dataset").The size comparison of the original and augmented datasets can be found in Table S1 and Table S4.

S-7
Figure S4: Schematic plot of the two-step data augmentation approach.a) Random combinations of sub-precursors as the first step.Three modes of BB2 construction are illustrated.
From left to right: 1) R + li Where li, cr, R denote the linker, core and reactive end group, respectively.The region of skeletons is specifically marked.Circles in the reactive end functional groups denote the connecting point.b).Random functionalization on the linker moieties in BB2 skeletons is the second step.The symmetry of BB2 skeletons is preserved after the augmentation.

S-8
Figure S5: 79 core moieties for BB2 random combination (excluding an empty element in the core-absent case).The orange circles denote the connections between cores and linkers for mode (1).R + li + cr + li + R or between cores and reactive end functional groups for mode (2).R + cr + R. S-9

Variational Autoencoder
A variational autoencoder introduced by Kingma and Welling S1 is a generative model enabling the mapping of data X to a latent continuous variable z.The objective of this continuous latent variable model is to find a model distribution q(z) based on characteristics of latent variable z to approximate the true posterior p(z|X).The encoder and decoder, usually modelled by deep neural networks, learn the approximate posterior q ϕ (z|X) and the likelihood distribution p θ (X|z).The parameters ϕ and θ of the respective neural networks are learnt by maximising the evidence lower bound (ELBO): The L KL term represents the Kullback-Leibler divergence between the prior distribution p(z) and the learnt posterior q ϕ (z|X).In practice, the prior is normally assumed as a multivariate Gaussian distribution.
As L KL can be alternatively understood as a regularization term, S2,S3 a hyperparameter β is introduced to balance the extent of regularization during training, S-11 β=1 is the original VAE proposed by Kingma and Welling.S1 β=0 results in the degeneration of the VAE to a standard autoencoder (AE) for data reconstruction only and fail to explicitly encourage the formation of broad and even distributions over the latent space.
By adjusting β to a reasonable value in (0,1), the reconstruction quality of POCs can be improved while maintaining the generalisability of the VAE to unseen data.S4,S5 The VAE aims to generate the multi-component cage representation formed by BB1s, BB2s and reactions written as The distribution of POCs over latent variable z can be organised by the property of POCs by adding a predictor q ϕ (y|z) S6 to enforce the property-oriented constraint.
Where the predictor takes data from the supervised dataset.z in the supervised domain z ∈ S supervised can be obtained by mapping supervised data X ∈ S supervised into latent space via the encoder q ϕ (z|X).
Therefore, the multi-component loss function with adjusted magnitudes in terms can be written as: S-12

Auto-Regressive Model
For the generation of sequences x bb2 = {x 1 , x 2 , x 3 , ..., x t }, an auto-regressive encoder-decoder pair both based on gated recurrent units (GRU) is used, S7 with the difference that the encoder employs the bi-directional architecture while the decoder is one-directional only.By incorporating an auto-regressive decoding process in VAE architecture, the generation of a token at time step t is based on both the local context, i.e., all tokens generated in previous time steps x <t , and global features, i.e., the latent variable z: For the generation of the rest of the cage representation, Multi-layer perceptrons are applied to the encoder-decoder pairs processing the category-represented components of the POC {x bb1 , x rxt } whose decoding process is solely dependent on the latent variable z.
In comparison, multi-layer perceptrons are applied to process the category-represented components of the POC {x bb1 , x rxt } whose decoding process is solely dependent on the latent variable z.

Analysis on the Generated POC Distribution
To identify how the generated POCs compare with the training datasets, we plotted the generated molecules in Section 3.1 in the latent space depicted by Fig. 5 S8), the approximated KL divergence between the generated POCs and combined dataset is significantly smaller than the original dataset.Therefore, the generated POCs resemble the combined dataset than the original dataset.
Table S8: Estimated KL divergence between distribution of the latent representation of the generated POCs and distributions of the original or Combined set, independently.S-22

Filter
The filter is a flexible module that can be cascaded behind any generation method and uses a simple evaluation matrix to validate generated molecules with minimal computational cost.The filter has a layered structure that evaluates the quality of generated POCs in the order of validity, novelty, precursor validity and symmetry, as described in Section 5.1.
Only the generated POC candidate that passes all layers of evaluations is considered valid output, otherwise, the generated candidate will be discarded and subsequently the generation method will be relaunched.The schematic diagram for the filter module is shown in Fig. S13.
Figure S13: The schematic diagram of the filter module.

Figure S6 :
Figure S6: 35 Linker moieties for BB2 random combination (excluding an empty element in the linker-absent case).The orange circles denote the connections between cores and linkers for mode (1) R + li + cr + li + R or between two linkers for mode (3) R + li + li + R. The green circles denote the connection between linkers and reactive end functional groups for mode (1) R + li + cr + li + R and (3) R + li + li + R or between cores and reactive end functional groups for mode (2) R + cr + R.
sequential molecular representation of predefined maximum length n and the rest of components are categories.Three components of the cage representation are assumed conditionally independent given a common latent variable z.A valid POC data requires the presence of all three components.Each component was encoded by learnt deep neural networks ϕ = {ϕ 1 , ϕ 2 , ϕ 3 } and decoded by θ = {θ 1 , θ 2 , θ 3 }.Therefore, the L recon term can be split into two components L GRU and L GRU for the reconstruction of {x bb2 } and {x bb1 , x rxt }.

Figure S8 :
Figure S8: The logarithmic loss curves of all loss components on the fixed test set.

Figure S9 :
Figure S9: The magnitude of all schedulers during the training process.a).constant scheduler for sequence reconstruction loss term L GRU .b) and c).linear scheduler for category reconstruction L M LP and property loss term L prop and d).cyclic scheduler for KL divergence term L KL .
(a).As shown in Fig. S10, for each principal component dimension, we used Kernel Density Estimation (KDE) to assess the distribution of the data points.Here, the generated POCs have a similar distribution to the training set in both reduced dimensions, yet the distribution of generated POCs has a lower variance.This indicates that the generated POCs reflect the information in the training set, which includes both original and augmented POCs.We also compared the distribution of generated POCs to the original and the combined set (Original + Augmented), respectively.To quantify the difference between the generated samples and the POCs in the training set, we computed the KL divergence between datasets latent representation of the generated POC samples and the latent representation of the original and combined datasets.By comparison (shown in Table

Figure S10 :Figure S11 :Figure S12 :
Figure S10: Kernel density estimate of the latent representation of generated samples and samples of the training data for a).principle component 1 and b) principle component 2 of the PCA.

Table S1 :
The number and percentage of POCs labeled by shape-persistence in the original dataset

Table S2 :
The number of porous organic cages in the supervised and unsupervised datasets and the training and test set splits.

Table S4 :
The number of categories of BB1 and BB2 skeletons and the minimum and maximum length of skeletons represented by SMILES string in the original dataset.

Table S6 :
The evaluations of test set performances at the last epoch of the training process.