Ryan-Rhys
Griffiths‡
*a,
Jake L.
Greenfield‡
bc,
Aditya R.
Thawani‡
b,
Arian R.
Jamasb
d,
Henry B.
Moss
e,
Anthony
Bourached
f,
Penelope
Jones
a,
William
McCorkindale
a,
Alexander A.
Aldrick
a,
Matthew J.
Fuchter
b and
Alpha A.
Lee
*a
aThe Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, UK. E-mail: rrg27@cam.ac.uk; aal44@cam.ac.uk
bMolecular Sciences Research Hub, Department of Chemistry, Imperial College London, London W12 0BZ, UK
cCenter for Nanosystems Chemistry (CNC), Institut für Organische Chemie, Universität Würzburg, Würzburg 97074, Germany
dThe Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK
eSecondmind.ai, Cambridge CB2 1LA, UK
fThe Institute of Neurology, Department of Neurology, University College London, London WC1N 3BG, UK
First published on 10th November 2022
Photoswitchable molecules display two or more isomeric forms that may be accessed using light. Separating the electronic absorption bands of these isomers is key to selectively addressing a specific isomer and achieving high photostationary states whilst overall red-shifting the absorption bands serves to limit material damage due to UV-exposure and increases penetration depth in photopharmacological applications. Engineering these properties into a system through synthetic design however, remains a challenge. Here, we present a data-driven discovery pipeline for molecular photoswitches underpinned by dataset curation and multitask learning with Gaussian processes. In the prediction of electronic transition wavelengths, we demonstrate that a multioutput Gaussian process (MOGP) trained using labels from four photoswitch transition wavelengths yields the strongest predictive performance relative to single-task models as well as operationally outperforming time-dependent density functional theory (TD-DFT) in terms of the wall-clock time for prediction. We validate our proposed approach experimentally by screening a library of commercially available photoswitchable molecules. Through this screen, we identified several motifs that displayed separated electronic absorption bands of their isomers, exhibited red-shifted absorptions, and are suited for information transfer and photopharmacological applications. Our curated dataset, code, as well as all models are made available at https://github.com/Ryan-Rhys/The-Photoswitch-Dataset.
Fig. 1 Photoswitchable molecules undergo reversible structural changes between multiple states upon irradiation with light. |
Azobenzene-based photoswitches switch about their NN bond giving rise to two isomeric forms, cis–trans or E–Z isomers. These photoswitches are commonly employed in applications seeking to exploit the significant change in structure, dipole moment, or conductivity of their isomeric forms.20,21 Recently, azoheteroarenes, where one or more of the phenyl rings of azobenzene are replaced by heteroarene rings, have emerged as a promising subclass of the azobenzene photoswitch.18 Azoheteroarenes demonstrate an expansive structural–property tunability of their photophysical properties. These properties include the degree of photoswitching induced by a specified wavelength, quantified by the photostationary state (PSS), and the thermal half-life of the metastable photogenerated state.
Factors that can determine an azoarene's usefulness in a particular application include the thermal half-life of the metastable isomer, quantum yields of photoswitching and the steady-state distribution of a given isomer at a particular irradiation wavelength (PSS). The ideal thermal half-life is dependent on the targeted application, for example, information transfer requires photoswitches with short thermal half-lives11 whilst energy storage applications benefit from photoswitches with long thermal half-lives.10 Achieving a high PSS and separated electronic absorption bands of the isomers is generally desirable, however, as these properties determine the addressability of each isomeric form. Through chemical design, the π–π* and n–π* bands of the E and Z isomers can be tuned to ensure minimal spectral overlap for a given irradiation wavelength, maximising the composition of a specific isomer at said PSS. Moreover, red-shifting the absorption spectra away from the UV region is also beneficial; use of low energy light reduces photo-induced degradation of materials, and also increases the penetration depth in tissue.5 Taken together, azoarene photoswitches have been harnessed in a myriad of applications including photopharmacology,22 organocatalysis,23 molecular solar thermal energy storage,24,25 data storage, real-time information transfer,26 MRI contrast agents,27 and chemical sensing.28
To date, structural features that dictate the photophysical properties of these systems are typically post-rationalised following the synthesis and characterisation of a novel structure19,29–31 or predicted using quantum chemical calculations such as density functional theory (DFT) and time-dependent density functional theory (TD-DFT).19,31 Both of these approaches are limited by the time it takes to perform the synthesis or the calculation in silico, although it should be noted that high-throughput DFT approaches may have potential to mitigate the wall-clock time to some extent in the future.32–34 In light of this, human intuition remains the guide for candidate selection in many photoswitch chemistry laboratories. Advances in molecular machine learning however, have taken great strides in recent years in areas such as molecule generation,35–42 chemical reaction prediction,43–46 and molecular property prediction.47–54 In particular, machine learning property prediction has the potential to cut the attrition rate in the discovery of novel and impactful molecules by virtue of its short inference time. A rapid, accessible, and accurate machine learning prediction of a photoswitch's properties prior to synthesis would allow promising structures to be prioritised, facilitating photoswitch discovery as well as revealing new structure–property relationships.
Recently work by Lopez and co-workers55 employed machine learning to accelerate a quantum chemistry screening workflow for photoswitches. The screening library in this case is generated from 29 known azoarenes and their derivatives yielding a virtual library of 255991 azoarenes in total. The authors observed that screening using active search tripled the discovery rate of photoswitches compared to random search according to a binary labelling system which assigns a positive label to a molecule possessing a λmax > 450 nm and a negative label otherwise. The approach highlights the potential for active learning and Bayesian optimisation methodology to accelerate DFT-based screening. Nonetheless, to our knowledge, the application of machine learning to predict experimental photophysical properties, and the prospective experimental validation of machine learning predictions, remain key open questions.
In this paper we present an experimentally validated framework for molecular photoswitch discovery based on curating a large dataset of experimental photophysical data, and multitask learning using multioutput Gaussian processes. This framework was designed with the goals of: (i) performing faster prediction relative to TD-DFT and directly trained on experimental data; (ii) obtaining improved accuracy relative to human experts; (iii) operationalising model predictions in the context of laboratory synthesis.
To achieve these goals, a dataset of the electronic absorption properties of 405 photoswitches in their E and Z isomeric forms was curated, a full description of the dataset and collated properties is provided in Section 2. Following an extensive benchmark study, we identified an appropriate machine learning model and molecular representation for prediction, as detailed in Section 3. A key feature of this model is that it is performant in the small data regime as photoswitch properties (data labels) obtained via laboratory measurement are expensive to collect in both cost and time. Our model uses a multioutput Gaussian processes (MOGPs) approach due to its ability to operate in the multitask learning setting, amalgamating information obtained from molecules with multiple labels. In Section 4 we show that the MOGP model trained on the curated dataset obtains comparable predictive accuracy to TD-DFT (at the CAM-B3LYP level of theory) and only suffers slight degradations in accuracy relative to TD-DFT methods with data-driven linear corrections whilst maintaining inference time on the order of seconds. A further benchmark against a cohort of human experts as well as a study on how the predictive performance varies as a function of the dataset used for model training is provided in the ESI.† In Section 6 we use our approach to screen a set of commercially available azoarenes, and identify several motifs that display separated electronic absorption bands of their isomers, exhibit red-shifted absorptions, and are suited for information transfer and photopharmacological applications.
The dataset includes properties for 405 photoswitches. The molecular structures of these switches are denoted according to the simplified molecular input line entry system (SMILES).56 A full list of references for the data sources is provided in Section A of the ESI.† The following properties were collated from the literature, where available. (i) The rate of thermal isomerisation (units = s−1), which is a measure of the thermal stability of the metastable isomer in solution. This corresponds to the Z isomer for non-cyclic azophotoswitches and the E isomer for cyclic azophotoswitches. (ii) The PSS of the stated isomer at the given photoirradiation wavelength. These values are typically obtained by continuous irradiation of the photoswitch in solution until a steady state distribution of the E and Z isomers is obtained. The reported PSS values correspond to solution-phase measurements performed in the stated solvents. (iii) The irradiation wavelength, reported in nanometers. This corresponds to the specific wavelength of light used to irradiate samples from E–Z or Z–E such that a PSS is obtained, in the stated solvent. (iv) The experimental transition wavelengths, reported in nanometers. These values correspond to the wavelength at which the π–π*/n–π* electronic transition has a maximum for the stated isomer. This data was collated from solution-phase experiments in the solvent stated. (v) DFT-Computed Transition Wavelengths, reported in nanometers. These values were obtained using solvent continuum TD-DFT methods and correspond to the predicted π–π*/n–π* electronic transition maximum for the stated isomer. (vi) The extinction coefficient (in units of M−1 cm−1), corresponding to how strongly a molecular species absorbs light, in the stated solvent. (vii) The theoretically-computed Wiberg Index57 (through the analysis of the SCF density calculated at the PBE0/6-31G** level of theory30), which is a measure of the bond order of the NN bond in an azo-based photoswitch, giving an indication of the ‘strength’ of the azo bond.
Using the data collated in this dataset, we focus on using our model to predict the four experimentally-determined transition wavelengths below. We focus on these four properties as they are the core determinants of quantitative, bidirectional photoswitching.58 These include, the π–π* transition wavelength of the E isomer (data labels for 392 molecules exist in our dataset). The n–π* transition wavelength of the E isomer (data labels for 141 molecules exist in our dataset). The π–π* transition wavelength of the Z isomer (data labels for 93 molecules exist in our dataset). Finally, the n–π* transition wavelength of the Z isomer (data labels for 123 molecules exist in our dataset). We would like to emphasise that other photophysical or thermal properties could also be investigated using machine learning approaches, notably the thermal half-life of the metastable state. However, there are fewer reports of experimentally-derived thermal half-lives significantly reducing the data that we can train our machine learning models on; these other properties will be investigated in future studies.
An individual box is computed using the mean values of the MAE for the four models for the representation indicated by the associated colour and shows the range in addition to the upper and lower quartiles of the error distribution. The plot indicates that fragprints are the best representation on the E isomer π–π* prediction task and RDKit fragments alone are disfavoured across all tasks.
In terms of the choice of representation we evaluate three commonly-used descriptors: RDKit fragment features,61 ECFP fingerprints62 as well as a hybrid ‘fragprints’ representation formed by concatenating the Morgan fingerprint and fragment feature vectors. The performance of the RDKit fragment, ECFP fingerprint and fragprint representations on the wavelength prediction tasks is visualised in Fig. 2 where aggregation is performed over the RF, GP, MOGP and ANP models. This analysis motivated our use of the fragprints representation in conjunction with the MOGP to take forward to the TD-DFT comparison and experimental screening. We now briefly describe Gaussian processes and in particular the multioutput Gaussian process with Tanimoto kernel that we employ for prediction.
When using GPs for molecular property regression tasks we seek to perform Bayesian inference over a latent function f that represents the mapping between the inputs {x1, …, xN} and their property values {f(x1), …, f(xN)}. In practice we receive the inputs together with potentially noise-corrupted observations of their property values {y1, …, yN}. The mean function m(x) is typically set to zero following standardisation of the data. The kernel function k(x, x′) computes the similarity between molecules x and x′. In all our experiments we use bit/count vectors to represent molecules and hence we choose the Tanimoto kernel66 defined as
(1) |
(2) |
The joint prior in eqn (2) may be conditioned on the observations through which enforces that the joint prior agree with the observed target values y. The predictive distribution is given as with the predictive mean at test locations X* being ∗ = K(X∗, X)[K(X, X) + σ2yI] − 1y and the predictive uncertainty being cov(f∗) = K(X∗, X∗) − K(X∗, X)[K(X, X) + σ2yI]−1K(X, X∗). The predictive mean is the quantity used for prediction while the predictive uncertainty can inform us as to the model's prediction confidence. The GP hyperparameters are learned through the optimisation of the marginal likelihood where N is the number of observations and the subscript notation on the kernel matrix Kθ(X, X′) is chosen to indicate the dependence on the set of hyperparameters θ. The two terms in the expression for the marginal likelihood represent the Occam factor68 in their preference for selecting models of intermediate capacity. In practical applications, GPs have been primarily employed for their high quality uncertainty estimates across applications
(3) |
To construct a multioutput GP we compute a new kernel function k(x, x′)·B[i, j] where B is a positive semidefinite P × P matrix, where the (i, j)th entry of the matrix B multiplies the covariance of the i-th function at x and the j-th function at x′. Such a multioutput GP is termed the intrinsic model of coregionalisation (ICM).74 Inference proceeds in the same manner as for vanilla GPs, substituting the new expression for the kernel into the equations for the predictive mean and variance. Positive semi-definiteness of B may be guaranteed through parametrising the Cholesky decomposition LLT where L is a lower triangular matrix and the parameters may be learned alongside the kernel hyperparameters through maximising the marginal likelihood in eqn (3) substituting the appropriate kernel. While it has been widely cited that GPs scale poorly to large datasets due to the O(N3) cost of training, where N is the number of datapoints,63 recent advances have seen GPs scale to millions of data points using multi GPU parallelisation.75 Nonetheless, on CPU hardware scaling GPs to datasets on the order of 10000 data points can prove challenging. For the applications we consider however, we are unlikely to be fortunate enough to encounter datasets of relevant experimental measurements on the order of tens of thousands of data points and so CPU hardware is sufficient for this study.
In Table 1 we present the performance comparison against 99 molecules and 114 molecules for CAM-B3LYP and PBE0 respectively both using the 6-31G** basis set taken from the results of a benchmark quantum chemistry study80 to which the reader is referred for all information pertaining to the details of the calculations.§ We elect to include an additional 15 molecules in the test set for PBE0. These additional molecules are not featured in the study by Jacquemin et al.80 but are reported in ref. 30 using the same basis set. It should also be noted that the data presented in Jacquemin et al.80 contains measurements for the same molecules under different solvents. In our work we absorb solvent effects into the noise. Specifically, we do not treat the solvent as part of the molecular representation. As such, for duplicated molecules we choose a single solvent measurement at random. We report the mean absolute error (MAE) and additionally the mean signed error (MSE) in order to assess systematic deviations in predictive performance for the TD-DFT methods. For the MOGP model, we perform leave-one-out validation, testing on a single molecule and training on the others in addition to the experimentally-determined property values for molecules acquired from synthesis journal papers. We then average the prediction errors and report the standard error.
Method | Accuracy metric (nm) | CPU runtime (↓) | ||
---|---|---|---|---|
MAE (↓) | MSE | |||
PBE0 benchmark | ||||
MOGP | 15.5 ± 1.3 | 0.0 ± 2.0 | <1 minute | |
PBE0 | Uncorrected | 26.0 ± 1.8 | − 19.1 ± 2.5 | |
Linear correction | 12.4 ± 1.3 | − 1.2 ± 1.8 | ca. 228 days | |
CAM-B3LYP benchmark | ||||
MOGP | 15.3 ± 1.4 | − 0.2 ± 2.1 | <1 minute | |
CAM-B3LYP | Uncorrected | 16.5 ± 1.6 | 6.7 ± 2.2 | |
Linear correction | 10.7 ± 1.2 | 0.0 ± 1.6 | ca. 396 days |
The MOGP model outperforms PBE0 by a large margin and provides comparable performance to CAM-B3LYP. In terms of runtime, there is no contest. The MSE values for the TD-DFT methods however indicate that there is systematic deviation in the TD-DFT predictions. This motivates the addition of a data-driven correction to the TD-DFT predictions. As such, we train a Lasso model with an L1 multiplier of 0.1 on the prediction errors of the TD-DFT methods and apply this correction when evaluating the TD-DFT methods on the heldout set in leave-one-out validation. We choose to use Lasso as empirically it outperforms linear regression in fitting the errors due to inducing sparsity in the high-dimensional fragprint feature vectors. We show the Spearman rank-order correlation coefficients of all methods and the error distributions in the ESI.† There, it is observed that an improvement is obtained in the correlation between TD-DFT predictions on applying the linear correction. Furthermore, the error distribution becomes more symmetric on applying the correction.
In all instances, those polled have received formal training in the fundamentals of UV-vis spectroscopy. We note that one of the limitations of this study is that the human chemists are not provided with the full dataset of 405 photoswitch molecules in advance of making their predictions. Our goal in constructing the study in this fashion was to enable a comparison of the benefits of dataset curation, together with a machine learning model to internalise the information contained in the data, against the experience acquired over a photoswitch chemist's research career. Analysing the MAE across all humans per molecule Fig. 3, we note that the humans perform worse than the MOGP model in all instances. In going from molecule A to E, the number of point changes on the molecule increases steadily, thus, increasing the difficulty of prediction. Noticeably, the human performance is approximately five-fold worse on molecule E (three point changes) relative to molecule A (one point change). This highlights the fact that in instances of multiple functional group modifications, human experts are unable to reliably predict the impact on the E isomer π–π* transition wavelength. The full results breakdown is provided in the ESI.†
Fig. 3 A performance comparison between human experts (orange) and the MOGP-fragprints model (blue). MAEs are computed on a per molecule basis across all human participants. |
1. A π–π* maximum in the range of 450–600 nm for the E isomer.
2. A separation greater than 40 nm between the π–π* of the E isomer and the π–π* of the Z isomer.
The first criterion was chosen so as to limit UV-included damage to materials and improve tissue penetration depths and the second criterion was chosen by analogy to azopyrazole photoswitches reported previously29 where the specified level of band separation provided complete bidirectional photoswitching; this degree of energetic separation between the π–π* bands of the isomers enables one isomer to be selectively addressed using light emitting diodes (LEDs), which are commonly used for their low power consumption but often express broad emission profiles relative to laser diodes, see ESI.†
Fig. 4 The chemical structures of the 11 commercially available azo-based photoswitches that were predicted to meet the criteria. |
We compare the model predictions against the experimentally-determined values in Table 2 The MOGP MAE on the E isomer π–π* wavelength prediction task was 22.7 nm and 21.6 nm on the Z isomer π–π* wavelength prediction task, comparable for the E isomer π–π* and slightly higher for the Z isomer π–π* relative to the benchmark study in Section 3, reflecting the challenge of achieving strong generalisation performance when extrapolating to large regions of chemical space. The first criterion, is a requirement on the absolute rather than the relative value of the π–π* transition wavelengths and so the experimental values may be subject to shifts depending on the solvent. Molecules can display solvatochromism in that the dielectric of the solvent, as well as hydrogen-bonding interactions, can influence the electronic transitions giving rise to hypsochromic or bathochromic shifts in the absorption spectra. This can manifest as changes in the position, intensity and shape of the UV-vis absorption spectrum. As such, the 450 nm criterion could be considered a rough guide and candidates that are just short of the threshold may fulfill the criterion in a different solvent. Nonetheless, given that the MOGP model is trained on just a few hundred data points and is asked to extrapolate to several thousand structures, the accuracy is promising with the advent of further experimental data. In terms of satisfying the pre-specified criteria, 7 of the 11 molecules possessed an E isomer π–π* wavelength greater than 450 nm, 10 of the 11 molecules possessed a separation between the E and Z isomer π–π* wavelengths of greater than 40 nm and 6 of the 11 molecules satisfied both criteria. Compound 7 did not photoswitch under irradiation.
Switch | Model | Experimental | ||||
---|---|---|---|---|---|---|
E π–π* | Z π–π* | E π–π* | Z π–π* | Z PSS (%) | ca. t 1/2 (s) | |
1 | 456 | 368 | 446 | 355 | 90 (405 nm) | <5 |
2 | 459 | 377 | 441 | 356 | 96 (405 nm) | <1 |
3 | 457 | 377 | 399 | 331 | 66 (405 nm) | <10 |
4 | 463 | 373 | 445 | 357 | 94 (405 nm) | <1 |
5 | 471 | 381 | 450 | 370 | 68 (450 nm) | <1 |
6 | 460 | 368 | 451 | 360 | 92 (405 nm) | <30 |
7 | 467 | 369 | 534 | n/a | n/a | n/a |
8 | 450 | 359 | 465 | 376 | 87 (405 nm) | <10 |
9 | 453 | 369 | 468 | 399 | 60 (450 nm) | <10 |
10 | 453 | 363 | 471 | 398 | 15 (450 nm) | <1 |
11 | 453 | 360 | 452 | 379 | 88 (405 nm) | <1 |
The comparison between the ML-predicted electronic absorption bands and the experimental data shown in Table 2 clearly highlights the strength and utility of our model in identifying photoswitchable molecules with red-shifted and energetically separated π–π* transitions. However, it should be highlighted that several of the switches display low PSS compositions of the metastable isomer at the irradiation wavelengths used; these low PSS values of the Z isomer are attributed to a degree of overlap of broad electronic transitions of the isomeric forms. We envisage that the composition of the Z isomer at the PSS can be increased by expanding our compiled dataset to consider the full-width-at-half-max (FWHM) of the electronic absorption bands. Moreover, the thermal half-lives of the switches shown in Table 2 are short, less than 1 min. This rapid thermal relaxation is to be expected for the push–pull type photoswitches the ML model predicted. Despite showing some possible applications for information transfer, we hope to include a consideration of the thermal half-life properties in future work. This will undoubtedly improve the suitability of the predicted switches for a given application. Taken together, we anticipate that the ML model detailed here will be of use for synthetic chemists working to design photoswitchable molecules with red-shifted absorption bands and hope to incorporate additional photophysical and photochemical considerations in the future.
Fig. 5 The experimental UV-vis absorption spectrum of switches 1–11 measured at 25 μM in DMSO and shown as the molar extinction coefficient (M−1 cm−1). Different irradiation wavelengths were employed in order to predict the “pure” Z spectra using the procedure detailed by Fischer.81 The chemical structures of these switches are shown in Fig. 4 above. |
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2sc04306h |
‡ Equal contribution. |
§ The TD-DFT CPU runtime estimates are taken from ref. 79 and hence represent a ballpark figure that is liable to decrease with advances in high performance computing. |
This journal is © The Royal Society of Chemistry 2022 |