Data-driven discovery of molecular photoswitches with multioutput Gaussian processes

Photoswitchable molecules display two or more isomeric forms that may be accessed using light. Separating the electronic absorption bands of these isomers is key to selectively addressing a specific isomer and achieving high photostationary states whilst overall red-shifting the absorption bands serves to limit material damage due to UV-exposure and increases penetration depth in photopharmacological applications. Engineering these properties into a system through synthetic design however, remains a challenge. Here, we present a data-driven discovery pipeline for molecular photoswitches underpinned by dataset curation and multitask learning with Gaussian processes. In the prediction of electronic transition wavelengths, we demonstrate that a multioutput Gaussian process (MOGP) trained using labels from four photoswitch transition wavelengths yields the strongest predictive performance relative to single-task models as well as operationally outperforming time-dependent density functional theory (TD-DFT) in terms of the wall-clock time for prediction. We validate our proposed approach experimentally by screening a library of commercially available photoswitchable molecules. Through this screen, we identified several motifs that displayed separated electronic absorption bands of their isomers, exhibited red-shifted absorptions, and are suited for information transfer and photopharmacological applications. Our curated dataset, code, as well as all models are made available at https://github.com/Ryan-Rhys/The-Photoswitch-Dataset.


A Sources of Experimental Data
Properties were collated from a wide range of photoswitch literature. An emphasis was placed on collating compounds with a spectrum of functional groups attached to the core photoswitch scaffold. In addition, this dataset is unique in that it is composed of the latest generations of azoheteroarenes and cyclic azobenzenes which possess far superior photoswitch properties to analogous, unmodified azobenzenes. See Figure 1 for an overview of these novel azophotoswitches with their properties summarised.

B Dataset Visualisations
The choice of molecular representation is known to be a key factor in the performance of machine learning algorithms on molecules. [23][24][25] Commonly-used representations such as fingerprint and fragment-based descriptors are high dimensional and as such, it can be challenging to interpret the inductive bias introduced by the representation. In order to visualise the high-dimensional representation space of the Photoswitch Dataset we project the data matrix to two dimensions using the UMAP algorithm. [26] We compare the manifolds located under the Morgan fingerprint representation and a fragment-based representation computed using RDKit. [27] We generate 512-bit Morgan fingerprints with a bond radius of 2, setting the nearest 2 neighbours parameter in the UMAP algorithm to a value of 50. The resulting visualisation was produced using the ASAP package (available at https://github.com/BingqingCheng/ASAP) and is shown in Figure 2. The structure of the manifold located under the Morgan fingerprint representation identifies meaningful subgroups of azophotoswitches when compared to the fragment-based representation. In order to demonstrate that the finding is due to the representation and not the dimensionality reduction algorithm we include the manifolds identified by k-PCA using a cosine kernel. Both algorithms identify the same manifold structure in the Morgan fingerprint representation.

C.1 Property Prediction
For representations, we use 2048-bit Morgan fingerprints with a bond radius of 3 implemented in RDKit. [27] We use 85-dimensional fragment features computed using the RDKit descriptors module. We use the Dscribe library [28]  We evaluate performance on 20 random train/test splits in a ratio of 80/20 using the root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination (R 2 ) as performance metrics, reporting the mean and standard error for each metric (Table   1). We evaluate the following models: Random Forest (RF), Gaussian Processes (GP), Attentive Neural Processes (ANP) , [29] Graph Convolutional Networks (GCN), [30] Graph Attention Networks (GAT), [31] Directed Message-Passing Neural Networks (DMPNN), [32] and the following representations: Morgan fingerprints, [33] RDKit fragments, [27] SOAP, [34] the simplified molecular-input line-Entry system (SMILES), [35] and self-referencing embedded strings (SELFIES). [36] In addition, we introduce a hybrid representation, fragprints formed by concatenating the fragment and fingerprint vectors. For the purpose of the benchmark, hyperparameter selection for GP-based approaches is performed by optimizing the marginal likelihood on the train set whereas for other methods cross-validation is performed using the Hyperopt-Sklearn library [37] for Sklearn models such as RF and 1000 randomly sampled configurations for other models.
RF is trained using scikit-learn [38] with 1000 estimators and a maximum depth of 300.
We implement a GP in GPflow [39] using a Tanimoto kernel [40,41] for fingerprint, fragment and fragprint representations, and the subset string kernel of [42] (following the exact experimental setup in Moss and Griffiths [41] ) for the character-based SMILES and SELFIES representations.
Additionally, we train a multioutput Gaussian process (MOGP) based on the intrinsic model of coregionalisation [43] in order to leverage information in the multitask setting. For all GP models, we set the mean function to be the empirical mean of the data and treat the kernel variance and likelihood variance as hyperparameters, optimising their values under the marginal likelihood. For the attentive neural process we use 2 hidden layers of dimension 32 for each of the decoder, latent decoder and the deterministic encoder respectively, 8dimensional latent variables r and z, and run 500 iterations with the Adam optimiser [44] with a learning rate of 0.001. For the ANP we perform principal components regression by reducing the representation dimension to 50. We implement GCNs and GATs in the DGL-LifeSci library. [45] Node features include one-hot representations of atom-type, atom degree, the number of implicit hydrogen atoms attached to each atom, the total number of hydrogen atoms per atom, atom hybridization, the formal charge and number of radical electrons on the atom. Edge features contain one-hot encodings of bond-type and Booleans indicating the stereogenic configuration of the bond and whether the bond is conjugated or in a ring.
For the GCN we use two hidden layers with 32 hidden units and ReLU activations, applying BatchNorm [46] to both layers. For the GAT we use two hidden layers with 32 units each, 4 attention heads, an alpha value of 0.2 in both layers and ELU activations. We use a single DMPNN model trained for 50 epochs, with additional normalised 2D RDKit features. All remaining parameters were set to the default values in Yang et al. [32] . We do not benchmark SchNet [47] because it is designed for the prediction of molecular energies and atomic forces.
All experiments were performed on the CPU of a MacBook Pro using a 2.3 GHz 8-Core Intel Core i9 processor.
We apply standardisation (subtract the mean and divide by the standard deviation) to the property values in all experiments. The results of the aforementioned models and representations are given in Table 1. Additional results including Message-passing neural networks (MPNN), [48] a black-box alpha divergence minimization Bayesian neural network (BNN), [49] and an LSTM with augmented SMILES, SMILES-X [50] are presented in Table 2.
We note that featurisations using standard molecular descriptors are more than competitive with neural representations for this dataset. The best-performing representation/model pair on the most data-rich E isomer π − π * task was the MOGP * -Tanimoto kernel and our own hybrid descriptor set "fragprints". Importantly, there is weak evidence that the MOGP * is able to leverage multitask learning in learning correlations between the transition wavelengths of the isomers, a modelling feature that may be particularly useful in the low-data regimes characteristic of experimental datasets. A Wilcoxon signed-rank test [51] is carried out in order to determine whether the performance differential between the GP/fragprints combination and the MOGP * /fragprints combination is statistically significant. In this instance, the MOGP * is provided with auxiliary task labels for test molecules where available (i.e. labels for tasks that are not being predicted). The null hypothesis is that there is no significant difference arising from multitask learning. In the case of the E isomer π − π * transition the resultant p-value is 0.33 meaning that we cannot reject the null hypothesis at the 95% confidence level. In the case of the Z isomer π − π * transition the resultant p-value is 0.06 meaning also that we cannot reject the null hypothesis at the 95% confidence level. In this latter case however, rejection of the null hypothesis depends on the confidence level threshold specified. As such, we conclude that only weak evidence is available to support the benefits of multitask learning over single task learning.
In this section we present, in Table 2 results with additional models on the property prediction benchmark for which extensive hyperparameter tuning was not undertaken. The black-box alpha divergence minimization Bayesian neural network is implemented in the Theano library [52] and is based on the implementation of. [49] the network has 2 hidden layers of size 25 with ReLU activations. The alpha parameter is set to 0.5, the prior variance for the variational distribution q is set to 1 and 100 samples are taken to approximate the expectation over the variational distribution. For all tasks the network is trained using 8 iterations of the Adam optimiser [44] with a batch size of 32 and a learning rate of 0.05. The MPNN is trained for 100 epochs in the case of the E isomer π − π * task and 200 epochs in Table 1: Test set performance in predicting the transition wavelengths of the E and Z isomers. Best-performing models are highlighted in bold. MOGP * denotes a multioutput gp such that auxiliary task labels (i.e. not the task being predicted) for test molecules are provided to the model where available.
the case of the other tasks with a learning rate of 0.001 and a batch size of 32. The model architecture was taken to be the library default with the same node and edge features used for the GCN and GAT models in the main paper. The SMILES-X implementation remains the same as that of the paper [50] save for the difference that the network is trained for 40 epochs without Bayesian optimization over model architectures. In the case of SMILES-X 3 random train/test splits are used instead of 20 for the Z isomer tasks whereas 2 splits are used for the E isomer n−π * task. For the E isomer π − π * prediction task results are missing due to insufficient RAM on the machine used to run the experiments.

C.2 Prediction Error as a Guide to Representation Selection
On the E isomer π − π * transition wavelength prediction task, we note occasionally marked discrepancies in the predictions made under the Morgan fingerprint and fragment representations. We show one such discrepancy in Figure 3. The resultant analysis motivated the expansion of the molecular feature set to include both representations as "fragprints" The molecule on which the prediction is being made is located at the apex of the triangle with the proximal training molecule at the base. Fragment descriptors identify another di-substituted nitro-azobenzene as the most similar molecule contained in the train set. By contrast, Morgan fingerprints identify a molecule in possession of a similar substitution pattern to the test case, but with different functionalization. On this particular test instance it is the identity of the functional groups rather than the substitution pattern which dictates the wavelength properties and hence fragment descriptors achieve a much lower error. As such, although fingerprints offer better overall performance, fragments are clearly informative features for certain test cases.

C.3 Impact of Dataset Choice
In this section we evaluate the generalisation performance of a model trained on the E isomer π − π * values of a large dataset of 6142 out-of-domain molecules (including non-azoarene photoswitches) from Beard et al. [53] with experimentally-determined labels. We train a Random Forest regressor (due to scalability issues with the MOGP on 6000+ data points) implemented in the scikit-learn library with 1000 estimators and a max depth of 300 on the fragprint representation of the molecules. In Table 3 we present results for the case when the train set consists of the large dataset of 6142 molecules and the test set consists of the entire photoswitch dataset. We also present the results on the original E isomer π − π * transition wavelength prediction task where the train set of each random 80/20 train/test split is augmented with the molecules from the large dataset. The results indicate that the data for out-of-domain molecules provides no benefit for the prediction task and even degrades performance, when amalgamated, relative to training on in-domain data only. Based on these results we highlight the importance of designing synthetic molecular machine learning benchmarks with a real-world application in mind and involving synthetic chemists in the curation process. By targeted data collation on a narrow and well-defined region of chemical space where the molecules are in-domain relative to the task, it becomes possible to mitigate generalisation error.

C.4 Human Performance Benchmark
Below in Table 4 we include the full results breakdown of the human performance comparison study.

C.5 Confidence-Error Curves
An advantage of Bayesian models for the real-world prediction task is the ability to produce calibrated uncertainty estimates. If correlated with prediction error, a model's uncertainty may act as an additional decision-making criterion for the selection of candidates for lab synthesis. In order to investigate the benefits afforded by uncertainty estimates, we produce confidence-error curves using the GP-Tanimoto model in conjunction with the fingerprints representation. The confidence-error curves for the RMSE and MAE metrics are shown in Figure 4 and Figure 5 respectively. The x-axis, confidence percentile, may be obtained simply by ranking each model prediction of the test set in terms of the predictive variance at the location of that test input. As an example, molecules that lie in the 80th confidence percentile will be the 20% of test set molecules with the lowest model uncertainty. We then measure the prediction error at each confidence percentile across 200 random train/test splits to see whether the model's confidence is correlated with the prediction error. We observe that across all tasks the GP-Tanimoto model's uncertainty estimates are positively correlated with prediction error, offering a proof of concept that model uncertainty can be incorporated into the decision process for candidate selection.

C.6 TD-DFT Benchmark
Below, in Figure 6 and Figure 7 we include further plots analysing the performance of the methods on the TD-DFT performance comparison benchmark. These plots motivated the use of the Lasso-correction to the TD-DFT predictions.
12 13   Density Functional Theory (DFT) is a modelling method used to elucidate the electronic structure (typically the ground state) of many-body systems. [54] The theory has been used with great success across physics, chemistry, biology and materials science. [55] DFT is considered to be an ab initio, or first principles method because it relies directly upon the postulates of quantum mechanics and the only inputs to the calculations are physical constants. [56] A concrete example of an application of DFT towards an electronic structure investigation is in simulating a relaxation of atoms in a crystalline solid to calculate the change in lattice parameters and the forces on each atom, with the introduction of defects or vacancies into the system. [57] Since its inception in 1964/5, Kohn-Sham DFT (KS-DFT) has been one of the most popular electronic structure methods to date. [55] KS-DFT relies on the Hohenberg-Kohn theorems [58] and the use of a trial electron density (an initial guess) with a self-consistency scheme. In practice, a computational loop takes a trial density, solves the Kohn-Sham equations, and obtains the single electron wavefunctions corresponding to the trial density; next, by taking these single electron wavefunctions and using a result of quantum mechanics, a calculated electron density can be computed. If this calculated density is consistent (within a set tolerance) of the trial density, then the theoretical ground state density has been found.
If the two densities are not consistent, the calculated density is taken as the new trial density, and the loop is repeated until the tolerance is met. With exchange and correlation functionals, the accuracy of DFT calculations can be very high, but may also fluctuate significantly with the choice of functional, pseudopotential, basis sets and cutoff energy [59] which are not always straightforward to optimise. A machine learning corollary would be the performance of a specific model, on a given dataset, greatly depending on its hyperparameters, with out-of-the-box implementations rarely giving satisfactory results without a significant amount of tuning.

D.2 Time-Dependent Density Functional Theory
Time-dependent Density Functional Theory (TD-DFT) is based on a time-dependent cognate of the Hohenberg-Kohn theorems; the Runge-Gross (RG) theorem. [60] This theorem shows that a unique delineation exists between the time-dependent electron density and the timedependent external potential. This allows for a simplification, permitting a computational time-dependent Kohn-Sham system to be substantiated [61] analogous to the computational system used in KS-DFT.
In conjunction with a linear response theory, [62] TD-DFT has excelled with investigations into calculating electromagnetic spectra, i.e. absorption spectra, of medium and large molecules. [63,64] It has become popular in these fields, due to its ease of use relative to other methods as well as its high accuracy. A relevant application of this methodology is to compute the π − π * /n − π * electronic transitions wavelengths for conjugated molecular systems, such as the photoswitch molecules in the Photoswitch Dataset.

E Further Screening Details
The SMILES for all experimentally-measured molecules are given in Table 5. Reagents and solvents were obtained from commercial sources (MolPort) and used as supplied.

E.2 Photoswitching
Samples were irradiated with a custom-built irradiation set up using 365 nm ( The PSS, and the "predicted pure Z" spectra was determined using UV-vis following the procedure reported by Fischer [65] .  In Figure 8, for each of the 6 candidates satisfying both performance criteria, we give some indication as to the novelty of the discovered photoswitch candidates by providing the 3 closest molecules by Tanimoto similarity from the photoswitch dataset.

G Datasheets for Datasets
Following the protocol outlined in [66] we provide a datasheet for our dataset of key transition wavelengths for azophotoswitches that we release as part of this submission:

G.1 Motivation
For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
The primary objective of this dataset is to assimilate a list of experimentally determined electronic transition wavelengths describing the key π − π * and n − π * electronic transitions of photoswitchable molecules. At present, there is no dataset that addresses the needs of synthetic chemists who would ideally like to know the aforementioned properties of a photoswitch prior to synthesising it so as to minimise the amount of synthetic effort required. In an imaginary, and grossly simplified work-flow, a synthetic chemist may be tasked with redshifting the π − π * transition wavelength of an initial hit molecule for a given biological application. Faced with countless possible modifications that could be made to the initial hit, it is likely that a significant effort could be expended synthesising molecules that do not achieve the desired objective. In an ideal world, said chemist would utilise TD-DFT to predict the spectral properties of each and every molecule prior to synthesis, however, as described in the attached manuscript, this is a slow process. Thus, the central aim of this dataset is to act as a catalyst for developing machine learning models that accurately predict the spectral properties of azophotoswitches in an efficient manner and with accuracy comparable to the state-of-the-art TD-DFT methods. The main manuscript describes how this aim has been achieved to a certain degree.
Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
The dataset was manually curated in 2020 by Dr. Aditya Raymond Thawani whilst pursuing a PhD at the Department of Chemistry, Imperial College London.
Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.
No financial support was necessary or received in pursuit of the current work.

G.2 Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. How many instances are there in total (of each type, if appropriate)?
Throughout this discussion the reader is reminded that each azophotoswitch molecule has two isomeric forms i.e. E and Z. The dataset assimilates experimentally determined characteristics of a series of azophotoswitches. The dataset includes molecular properties for 405 photoswitch molecules. All molecular structures are denoted according to the simplified molecular input line entry system (SMILES). We collate the following properties for the molecules: Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.
No data has been intentionally excluded. The absence of any data is indicative of the fact that said data was not experimentally determined for the given molecule.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
Scaffold splits and chronological splits are not as important in the case of experimental synthetic chemistry. We are not attempting to produce a model that achieves predictive accuracy for useful experimental properties across a large region of chemical space nor would we entertain this notion as being chemically realistic.
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
The irradiation solvent in which an experimental measurement was taken may act as a source of noise in our model formulation. Given that the transition wavelengths of the molecules in the dataset were measured under different irradiation solvents, we may either treat the choice of irradiation solvent as an additional categorical input feature to the model or we can absorb the choice of irradiation solvent into the measurement noise. We elected to take the latter modelling approach due its practicality in light of missing data on irradiation solvents for some molecules in the dataset. Explicitly modelling the irradiation solvent choice as an input feature would either require excluding many data points, limiting the size of the training set, or it would require imputation of the unobserved solvent variable. Imputation methods include substituting the most frequent solvent in the dataset as the "true" irradiation solvent which would not be formally correct and may bias the model, or training a separate model to predict the missing solvent categories which would be highly challenging if not impossible due to the lack of training data. As such, we adopted the most consistent formulation of treating the choice of irradiation solvent as a latent (unobserved) variable that contributes to the transition wavelength value but as a source of measurement noise rather than as a component of the representation of the photoswitch molecules.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
All data has been extracted from internationally recognised and renowned chemical journals. The dataset itself does not rely on access to or link to these journals or associated papers. References for the papers from which said data has been extracted are provided.

G.3 Collection
How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.
The data was observed as raw text in the aforementioned journal articles and manually transcribed from said journal articles to the stated dataset.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
Not applicable.

G.5 Uses
Has the dataset been used for any tasks already? If so, please provide a description.
The dataset was used to predict electronic transitions of a range of syntheti-cally accessible azophotoswitches to gauge their properties and decide on which compounds to synthesise. A candidate molecule was identified with promising spectral properties, synthesised and experimentally characterised. The experimental characterisation validated the initial prediction. Further work is ongoing to deploy this compound in a photopharmacological application.
Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
The repository is hosted at https://github.com/Ryan-Rhys/The-Photoswitch-Dataset What (other) tasks could the dataset be used for?
The dataset may be used to predict the thermal isomerisation properties of azophotoswitches given that thermal half-life data is tabulated within said dataset.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
No risks involved or possibility of any harm: the data is contained within publicly available journal articles.
Are there tasks for which the dataset should not be used? If so, please provide a description.
There are no associated issues with using this dataset for other tasks. However, clearly when dealing with the chemical synthesis of any molecules indicated in this dataset due care should be shown along with a full safety and risk assessment.

G.6 Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
The dataset is publicly available on the internet.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
The dataset is hosted on GitHub. The DOI is https://zenodo.org/badge/ latestdoi/232307189 When will the dataset be distributed?
The dataset was first released in June 2020.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
The crawled data copyright belongs to the authors of the reviews unless otherwise stated. There is no license, but there is a request to cite the corresponding paper if the dataset is used:

G.7 Maintenance
Who is supporting/hosting/maintaining the dataset?
How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
art12@ic.ac.uk and rrg27@cam.ac.uk Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?
All relevant updates to the dataset will be publicised via GitHub and major revision via Twitter.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.
Interested parties should contact the original authors about incorporating updates.

G.9 Author Statement
We state, as authors that we bear all responsibility in case of violation of rights, etc., and confirmation of the data license.

G.10 Hosting, Licensing and Maintenance Plan
The dataset is hosted on GitHub under an MIT licence and will be maintained by Aditya Raymond Thawani and Ryan-Rhys Griffiths.