Combining machine learning and high-throughput experimentation to discover photocatalytically active organic molecules

Light-absorbing organic molecules are useful components in photocatalysts, but it is difficult to formulate reliable structure–property design rules. More than 100 million unique chemical compounds are documented in the PubChem database, and a significant sub-set of these are π-conjugated, light-absorbing molecules that might in principle act as photocatalysts. Nature has used natural selection to evolve photosynthetic assemblies; by contrast, our ability to navigate the enormous potential search space of organic photocatalysts in the laboratory is limited. Here, we integrate experiment, computation, and machine learning to address this challenge. A library of 572 aromatic organic molecules was assembled with diverse compositions and structures, selected on the basis of availability in our laboratory, rather than more sophisticated criteria. This training library was then assessed experimentally for sacrificial photocatalytic hydrogen evolution using a high-throughput, automated method. Quantum chemical calculations and machine learning were used to visualise, interpret, and ultimately to predict the photocatalytic activities of these molecules, covering a much broader chemical space than for previous polymer photocatalyst libraries. By applying unsupervised learning to the molecular structures, we identified structural features that were common in molecules with high catalytic activity. Further analysis using calculated molecular descriptors within a suite of supervised classification algorithms revealed that light absorption, exciton electron affinity, electron affinity, exciton binding energy, and singlet–triplet energy gap had correlations with the photocatalytic performance. These trained predictive models can be used in future studies as filters to deprioritise or discard would-be low-activity candidate molecules from experiments, and to prioritize more favourable candidates. As a demonstration, we used virtual in silico experiments to show that it was possible to halve the experimental cost of finding 50% of the most active photocatalysts by using the machine learning model as an experimental advisor. We further showed that the ML advisor trained on the 572-molecule library could be used to make predictions for an unseen set of 96 molecules, achieving equivalent predictive accuracies to those in the initial training set. This marks a step toward the machine-learning assisted discovery of molecular organic photocatalysts and the approach might also be applied to problems beyond photocatalytic hydrogen evolution, such as CO2 reduction and photoredox chemistry.


General methods
Small organic molecules, other reagents, and solvents were purchased from commercial suppliers (Manchester Organics, Sigma Aldrich, Alfa Aesar, Fluorochem, Ark Pharm, Apollo, Combi-Blocks, TCI Europe, Carbosynth, TRC Canada, etc.) and used with no further purification. PCN 1 and CTF 2 were prepared using previous methods. Water for the hydrogen evolution experiments was purified using an ELGA LabWater system with a Purelab Option S filtration and ion exchange column (ρ = 15 MΩ cm) without pH level adjustment.

Photocatalytic tests
High-Throughput Hydrogen Evolution Experiments. Agilent Technologies vials (10 mL) were charged with 5.0 ± 0.1 mg of small molecules and transferred to a Chemspeed Accelerator SWING robot for liquid transfer. Degassed jars with triethylamine, methanol, and a stock solution of H 2 PtCl 6 were loaded into the automated liquid handling platform. The system was then closed and purged for 4 h with nitrogen. The automated liquid handling platform then dispensed the liquids as specified, which were degassed aqueous H 2 PtCl 6 solution (1.7 mL, 3wt % Pt to small molecules), triethylamine (1.7 mL), and methanol (1.7 mL). The pH of the solution was around 11.5. The vials were then capped using the capper/crimper tool under inert conditions. Once capped, the samples were taken out, shaken briefly, and transferred to an ultrasonic bath to disperse the photocatalysts. An Oriel Solar Simulator 94123A with an output of 1.0 sun was then used to illuminate the vials on a Stuart roller bar SRT9 for the time specified (classification IEC 60904-9 2007 spectral match A, uniformity classification A, temporal stability A, 1600 W xenon light source, 12 × 12 in.2 output beam, air mass 1.5 G filter, 350−1000 nm). After photocatalysis, the gaseous products of the samples were measured on an Agilent gas connected to a headspace sampler (HS) and Shimadzu GC-HS. No hydrogen evolution was observed for mixtures of water/triethylamine/methanol or water/triethylamine/methanol/H 2 PtCl 6 under the identical conditions.

Structure generation
To generate atomistic structures for the molecules, the simplified molecular-input line-entry system (SMILES) representation of each molecule was converted to a 3D structure using Open Babel. 3 The gen3d operation was used, which starts with 250 steps of steepest-descent geometry optimization with the MMFF94 forcefield, followed by 200 iterations of a Weighted Rotor conformational search, before a final 250-step conjugated-gradient geometry optimization. The resulting 3D structure was subjected to a further conformer search, generating 50 conformers. The lowest-energy conformer from the search was finally geometry-optimized at the B3LYP/6-31G* level of theory; this structure was then used as the starting geometry for the molecule in all following computational and machine-learning studies.
IP, EA, EA* and IP* are standard reduction potentials of half-reactions for free electrons/holes and excitons and were calculated using (TD-)DFT (see Methods section for details). The exciton binding energy, E eb , was calculated as E eb = IP -EA* or E eb = IP* -EA. It is clear that IP and IP* are related to EA* and EA, respectively, through E eb . Therefore, only EA, EA* and E eb were included in the descriptor vectors for machine learning.
S r , Δσ, H CT , and ΔD are descriptors from quantitative characterization of hole and electron distributions in real space, performed for the first singlet (S 1 ) state on the optimized, ground-state geometry, using the Multiwfn software. Briefly: S r index quantifies the overlap between the hole distribution (ρ hole (r)) and the electron distribution (ρ electron (r)). S r varies between 0 (no overlap) and 1 (complete overlap); the larger the value is, the greater the extent of overlap is.
Δσ index is the difference between σ electron and σ hole , given by where σ electron and σ hole are a measure of the sparsity of ρ electron (r) and ρ hole (r), respectively. Δσ index can be positive or negative for different molecules.
H CT is the average of σ electron and σ hole in the charge-transfer (CT) direction, given by where H = (σ electron + σ hole )/2 and u CT is the unit vector along the CT direction.
ΔD is the difference in dipole moment between the excited-state and the ground-state of the molecule-i.e., ΔD = D excited-state -D ground-state -a measure of the extent of charge redistribution between the two states.
ΔE S1→S0 is the energy difference between the S 1 state and the S 0 state. ΔE S1→S0 is also referred to as the calculated optical gap in this study.
ΔE S1→T1 is the energy difference between the S 1 state and the first triplet (T 1 ) state. The smaller the ΔE S1→T1 value, the larger the spin-orbital coupling, and ultimately the more probable and faster the intersystem crossing.
E sol is an approximation of the solvation energy, simply given by where E solvated and E gas are the total energy of the molecule in solvation (water, PCM/SMD) and gas phase, respectively. The molecular geometry was relaxed in each state.
E b is the binding energy between two molecules of the same identity-intended as an indicator of the molecule's propensity for aggregating-which is given by where E dimer and E monomer are the total energy of the dimer and that of an isolated molecule, respectively. Stable binding conformations of a particular dimer were searched by a grid-based approach, together with the Amber forcefield, as implemented in the Autodock software. 4 The most stable dimer from the Autodock search was then geometry-optimized using the GFN-xTB semiempirical tight binding method, 5 with implicit water solvation. The isolated molecule in water solvation was geometry-optimized using the same computational settings. ΔE S1→S0 and ΔE S1→T1 in eV.     Figure S9. 2D UMAP embedding of the chemical space of the photocatalyst library, as defined by Morgan fingerprints, together with Tanimoto index as the similarity measure, in (a) or defined by the (TD)-DFT-calculated molecular descriptors, together with Euclidean distance as the similarity measure, in (b). Symbol size denotes the experimental hydrogen evolution rate; symbol colour denotes the k-means cluster as shown in Figure 2a in the main text, where the chemical space of the library is defined by the SOAP descriptors, together with a REMatch kernel as the similarity measure. Figure S10. Distribution of the molecular photocatalyst library (572 molecules) as a function of the number of rotatable bonds within individual molecules: the number of molecules (a) or the hydrogen evolution rate (HER; b) is plotted against the number of rotatable bonds. For the ten molecules highlighted in red in (b), the effect of different conformers representing the same molecule on the resultant SOAP descriptors for the molecule was investigated. For each of these molecules, a conformer search was performed to screen all torsion angles that were not in a ring or with terminal hydrogen atoms. A Boltzmann jump search method was used, together with filtering the generated conformers by imposing the minimum variation in the root-mean-square of all sampled torsion angles being larger than 15°. After filtering, the five lowest-energy conformers were selected for representing the molecule for calculations of SOAP descriptors. The COMPASSIII force field was used. All conformer searches were performed in BIOVIA Materials Studio 2020. All the conformers thus generated were then geometry-optimized using B3LYP/6-3G*. Different 2D UMAP embeddings using the different conformers of the ten molecules are shown in Figure S11. Figure S11. Aligned UMAP plots for the 572-molecule library, with the ten highlighted molecules (in blue) represented by the different conformers of them (see Figure S10) in the different plots; all the other molecules (in grey) were represented by the same molecular geometries across the different UMAP plots. The molecular geometries used throughout this work are shown in the UMAP plot in the top-left panel, while each of the other five panels shows a UMAP plot using a different local-minimum conformer for the ten highlighted molecules. This comparison shows that the 2D UMAP embeddings of the SOAP space of the 572 molecules are not markedly sensitive to the choice of molecular conformation for the ten highlighted molecules that are among the molecules in the library having a large number of rotatable bonds.                Table S1. b Morgan fingerprints with a radius = 2, generated by RDKit (http://www.rdkit.org); similarity measure: Tanimoto index. c SOAP descriptors with r = 6.0 Å, n = 8, l = 6, generated by DScribe (https://singroup.github.io/dscribe); similarity measure: regularized entropy match (REMatch) kernel. The similarity matrix for the 572 molecules used here is the same as the one used for Figure Figure S28. Virtual experiments comparing an adaptive ML approach and a random sampling approach: (a) Molecules were encoded by molecular descriptors (MD) and trained with MLP models; (b) molecules were encoded by SOAP descriptors and trained with KNN models. Active samples: HER > 1.07 µmol h -1 ; high-activity samples: HER > 12.5 µmol h -1 . 200 in silico experiments, each with a different random starting point, were carried out for each case to obtain the average results shown. Each batch comprised 16 samples (molecules) -note that the experiments shown in the main text (Fig. 4) uses a batch size of 48, rather than 16. The performance increase attained for the smaller batch size of 16 is marginal.       Figure S35. Confusion matrix for binary and ternary classifiers based on k-nearest neighbours and SOAP descriptors.