Open Access Article
Connor
Forster
and
Carolin
Müller
*
FAU Erlangen-Nürnberg, Computer Chemistry Center, Nägelsbachstraße 25, 91052, Erlangen. E-mail: carolin.cpc.mueller@fau.de
First published on 6th November 2025
Accurate prediction of electronic absorption spectra is essential for the rational design of photofunctional molecules. While ab initio quantum chemical methods provide reliable results, their high computational cost often precludes their application in high-throughput or resource-constrained screening workflows. Data-driven alternatives can offer improved efficiency but typically require large, high-quality datasets and may lack interpretability. In this work, we present a low-cost, interpretable approach for predicting absorption maxima (λmax) based on digitized and extended empirical rules originally proposed by R. B. Woodward, M. Fieser, L. Fieser and H. Kuhn. These rule sets estimate ππ* transition energies through additive contributions from base chromophores and position dependent contributions of certain structural features and substituents. Our implementation enables direct prediction of λmax from SMILES input for three representative compound classes: (i) α, β-unsaturated carbonyl compounds, for which we introduce a refined rule set, (ii) dienes and polyenes, and (iii) 3,4,6-substituted coumarin derivatives. For the latter, we define an entirely new set of empirical rules based on literature data. The resulting workflow offers a computationally efficient and chemically interpretable alternative for early-stage molecular screening and design, bridging historical empirical knowledge with modern cheminformatics.
One common approach for predicting absorption properties involves ab initio quantum chemical simulations, which give rise to vertical excitation energies, oscillator strengths, and the character of electronic transitions and thus enable a direct assignment of experimental absorption bands.1,2 Among these, time-dependent density functional theory (TD-DFT) has emerged as the workhorse, particularly in the context of data-driven photochemistry, due to its favorable balance between accuracy and computational cost.3–5 This is reflected in the widespread use of TD-DFT for generating datasets of UV-vis absorption properties.6–14
Despite their success in generating high-quality datasets, ab initio methods remain computationally demanding, particularly for large-scale screening or rapid exploration of chemical space during early-stage molecular design. This limitation has motivated the development of fast alternatives, such as machine learning models, which leverage structural representations, such as SMILES, molecular fingerprints or graph-based matrix descriptors, to predict λmax from structure-property relationships.10–19 While these models can achieve high accuracy at low computational cost, their black-box nature limits interpretability, reducing their utility for rational design where understanding the influence of specific substituents or electronic effects is crucial.19 Recent studies have employed Shapley additive explanations (SHAP) to identify molecular descriptors governing absorption and emission properties.20–22 These analyses revealed that features such as the number of aliphatic heterocycles, the fraction of sp3-hybridized carbons, the presence of primary amine groups, and fingerprints of specific structural fragments contribute significantly to the optical properties and improve model transparency. Nevertheless, these descriptor-level insights do not explicitly relate the identified features to their spatial or electronic context within the molecule, which limits their relevance for rational molecular design.
Empirical rules offer a solution that retains the advantages of low computational costs and fast predictions while providing distinct chemical insights. Among the earliest and most influential of these are the additive rules developed by R. B. Woodward, M. Fieser, L. Fieser, and H. Kuhn in the mid-20th century.23–29 These rules relate specific structural features, such as the number of conjugated double bonds and the nature of substituents, to shifts in λmax (cf. Section 2). Formulated for dienes, α, β-unsaturated carbonyls, and linear polyenes with more than four conjugated double bonds, the Woodward–Fieser (WF)23–26 and Fieser–Kuhn (FK)27,28 rules have historically offered chemists a simple and interpretable heuristic framework for estimating λmax of low-energy ππ* absorption bands of conjugated organic chromophores.
These additive rules represent an early example of empirical modeling, grounded in well-curated experimental data and systematic analysis – a principle that underlies many modern cheminformatics and machine learning approaches. Despite their interpretability and demonstrated predictive utility, the WF and FK rules remain largely absent from contemporary computational workflows, being primarily applied in educational contexts where predictions are performed manually using tabulated values from textbooks.28,30,31 Notably, to the best of our knowledge, they have not been integrated as features, priors, or constraints in data-driven models, although conversely, a few studies have suggested that their data-driven approaches would have the potential to inform the development of rules for calculating λmax based on substructures.18,19 Of particular note in this context is the approach taken by Joung et al.,19 who draw inspiration from the WF framework to develop an interpretable deep learning model that can predict a range of optical properties, including λmax, emission maxima, quantum yields and excited state lifetimes. Their model quantitatively reproduced classical substituent increments, for example, predicting contributions of ethyl (+5 nm), methoxy (+4 nm), and ethylamine (+70 nm) in cyclohexane, closely matching the original WF values of +5, +6, and +60 nm for diene systems (see Table 1 in Section 2).19,26 While the model captures the electronic effects of substituents, it accounts for the position of substituents indirectly. For example, to quantify the effect of the cyano group in 3-hydroxy-7-cyano-coumarin, a reference molecule (3-hydroxy-coumarin) is required to isolate the substituent contribution. In contrast, the WF framework incorporates these effects systematically through position and type dependent increments. Thus, although this approach illustrates the enduring value of chemically interpretable additive models, it does not explicitly extend, refine or digitize the WF rules themselves.
| Enones ([#6]=[#6]–[#6]=[#8]) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Compound category | Base value | Conjugated features | Increment | Substituent | α | β | γ | >γ |
| α, β-Unsaturated aldehyde | 210 (218) | Conjugated double bond | +30 | –Alkyl | +10 (+11) | +12 (+19) | +18 | +18 |
| α, β-Unsaturated ketone | 215 (212) | Exocyclic double bond | +5 | –Cl | +15 (+28) | +12 (+22) | +12 | +12 |
| α, β-Unsaturated acid | 195 (196) | Homoannular cyclodiene | +39 | –Br | +25 (+38) | +30 (+33) | +0 | +0 |
| α, β-Unsaturated ester | 195 | –OH | +35 (+38) | +30 (+14) | +50 | +0 | ||
| Cyclohexenone | 215 (206) | –O-alkyl | +35 (+29) | +30 (+22) | +17 | +31 | ||
| Cyclopentenone | 202 (191) | –O-acyl | +12 | +12 | +12 | +12 | ||
| Dienes ([#6]=[#6]–[#6]=[#6]) | |||||
|---|---|---|---|---|---|
| Compound category | Base value | Conjugated features | Increment | Substituent | Increment |
| Acyclic diene | 217 | Additional double bond | +30 | –Alkyl | +5 |
| Homoannular cyclic diene | 253 | Exocyclic double bond | +5 | –Cl/–Br | +10 |
| Heteroannular cyclic diene | 214 | –O-alkyl | +6 | ||
| –N(alkyl)2 | +60 | ||||
| –O-phen | +18 | ||||
| –S-alkyl | +30 | ||||
We attribute this limitation to the lack of programmatic, high-throughput implementations of the WF and FK rules: manual lookup and structural interpretation impede their use in automated workflows and large-scale screening. To address this, we introduce ChromoPredict, a Python package that encodes the WF and FK rules for direct estimation of λmax from SMILES inputs (see Section 3.1). By formalizing these empirical rules digitally, ChromoPredict preserves their inherent interpretability, providing transparent insights into how specific substituents and structural motifs modulate absorption maxima.
Herein, we systematically evaluate and refine the empirical rules using a curated computational dataset of 720 α, β-unsaturated carbonyl compounds, including aldehydes, ketones, carboxylic acids, cyclopentenones, and cyclohexenones, and experimental datasets of additional 28 enones and 36 coumarins, with the individual molecules in both datasets bearing methyl, methoxy, hydroxy, chloro, or bromo substituents (see Section 3.2). WF predictions for α, β-unsaturated carbonyl compounds generated with ChromoPredict are benchmarked against TD-DFT reference calculations (see Section 3.2.1), and the refined rules are further compared to random forest models trained on molecular fingerprints (see Section 3.2.2). This analysis delineates the predictive strengths and limitations of additive rules and highlights opportunities for hybrid approaches that integrate mechanistic insight with data-driven modeling. Finally, we extend the WF rules to 3-, 4-, or 6-substituted coumarin derivatives, illustrating the flexibility and scalability of ChromoPredict (see Section 3.2.3).
The earliest systematic work was carried out by Robert B. Woodward in the 1940s, who focused on α, β-unsaturated carbonyl compounds, including acyclic enals, ketones, acids, esters, as well as cyclic enones such as cyclopentenone and cyclohexenone.23,25,29 Based on the analysis of numerous UV-vis spectra, Woodward identified clusters of λmax corresponding to the degree and position of substitution: α- or β-mono-substituted (225 ± 5 nm), α, β- or β, β′-di-substituted (239 ± 5 nm), and α, β, β′-tri-substituted (254 ± 5 nm) systems.23 Subsequent refinements classified di- and tri-substituted molecules according to the presence of exocyclic bonds.25 In parallel, Woodward extended his analysis to normal conjugated dienes, defining base values for symmetric dienes (e.g., butadiene: 217 nm) and introducing additive increments of +5 nm for each substituent or exocyclic double bond. The λmax of asymmetric dienes was then estimated as the average of the corresponding symmetric systems.24
Building on Woodward's foundation, Louis and Mary Fieser introduced a systematic increment scheme to predict λmax for α, β-unsaturated carbonyls and conjugated dienes.26 Their approach assigned base values to core chromophores (see left column in Table 1) and increment values for substituents, distinguishing contributions according to type (e.g., alkyl, chloro, bromo, hydroxy, alkoxy, acyloxy) of substituent and for α, β-unsaturated carbonyl compounds also on the position of substituents (α, β, γ, or higher).26 Additional increments accounted for extended conjugation: linear double bonds (+30 nm), homoannular cyclodienes (+39 nm), and exocyclic double bonds (+5 nm).25,26 An overview of base values and increments of the WF rules is summarized in Table 1, with representative structures and calculation examples illustrated in Fig. 1, S2 and S3.
As the study of molecular systems expanded, particularly in the context of natural pigments like β-carotin, the limitations of the original WF rules became evident. These rules, while effective for small chromophores, were not suited for extended polyene systems containing five or more conjugated double bonds. To address this gap, Louis Fieser and Harold Kuhn, developed the Fieser–Kuhn (FK) rules specifically for linear polyenes with extended conjugation.27,28
Unlike the earlier chromophore-specific formulations, the Fieser–Kuhn rules adopt a parametric approach, allowing λmax to be estimated based on structural features that scale with conjugation length. The empirical model predicts λmax as
| λmax = 114 + 5m + 48n(1 − 1.7n) − 16.5⋅Rendo − 10⋅Rexo, |
C double bonds: [#6]=[#6]–[#6]=[#6]–[#6]=[#6]–[#6]=[#6]). The base chromophore is automatically detected for the inputed SMILES. Alternative formulations (e.g., the original WF rules, the extended rules by Kang and co-workers,17,33,34 or the refined rules introduced herein) can be explicitly requested through the chromlib parameter. For example, cp.predict(smiles = ”C
CC(=O)C″, chromlib = ”woodward_extended”) predicts the λmax of but-3-en-2-one (E01, Table S2) with the Kang extension.34
Stepwise example calculations illustrating base value assignment (Step 2), structural features (Step 3), and substituent increments (Step 4) are shown for representative α, β-unsaturated ketones and conjugated dienes in Fig. 1 and S2–S4.
Only a few studies have partially addressed this gap.33–36 Kang and co-workers33–35 proposed extended WF rules for enones, expressing λmax as a function of the number of substituents and exocyclic double bonds. Their study, however, was limited to 17 enones bearing only alkyl and O-acyl substituents, which contribute similar increments in the original formulation – explaining the observed linear relationship. As a result, the derived expression is applicable only within an even narrower chemical space. In another effort by Wathelet et al.,36 TD-DFT calculations were performed for 213 systematically generated α, β-unsaturated aldehydes, ketones, and acids with bromo, chloro, hydroxy, alkoxy, or methyl substituents. Although substitution patterns included mono- (α or β), di- (α, β or β, β′), and tri- (α, β, β′), only uniform substitution was considered. This precluded analysis of mixed substitution effects, such as the interplay of resonance and inductive contributions from methoxy and chloro groups.
Here, we extend these efforts by combining 213 mono-substituted molecules reported by Wathelet et al.36 with 435 compounds bearing mixed substitution patterns and 72 cyclic enones. From these, we assembled a comprehensive dataset of SMILES strings and corresponding ππ* absorption maxima obtained at the TD-DFT level of theory (Section 4.1), explicitly considering cis/trans isomers where applicable. This dataset was used to evaluate and refine the original WF rules for predicting λmax.
For each molecule, the base chromophore type, α- and β-substituents, and stereochemistry were extracted. Two approaches were analyzed: one in which stereochemical information was explicitly encoded in the base chromophore definition (e.g., cis-aldehyde), and another in which stereochemistry was not considered in defining the base structure. The corresponding refined increment values for the base chromophores and α/β-substituents are reported in Table 1 (parentheses) and summarized in Fig. S9 for the stereochemistry-explicit approach. Across all 720 reference compounds, the refined rules reduced the mean absolute error (MAE) to 8 nm, compared to 13 nm for the original WF formulation (see Fig. S6).
Fig. 2 shows violin plots of the prediction accuracy for the two refined schemes: following the original WF framework (purple) and with explicit stereochemical encoding (green), separated by compound class (α, β-unsaturated aldehydes, acids, ketones, as well as cyclopentenones and cyclohexenones). As evident, explicit inclusion of double-bond stereochemistry – omitted in the original WF rules – does not substantially improve accuracy, with both approaches yielding comparable MAEs (Fig. S6–S8). For example, the MAE across all trans-configured compounds is 7 nm following both approaches (Fig. S7).
This finding can be rationalized by the original rules' implicit treatment of stereochemistry. In Woodward's formulation, the β-substituent is defined as the group with the larger increment on the same side as the carbonyl group.23,25,26 Consequently, depending on the nature of the α-substituent, the resulting base values already represent a statistical mixture of cis and trans isomers. Any mismatch between the assumed and actual stereochemistry is therefore random rather than systematic. As such, explicitly encoding stereochemistry does not improve predictive performance, as this variability is already embedded in the empirical design of the original rules. Our global optimization including stereochemistry further shows that the base values of the isomers are nearly identical (e.g., 214 nm for trans-enones, 211 nm for cis-enones, and 215 nm otherwise), yielding an average of 213 nm that aligns closely with the stereochemistry-independent refined enone base value of 212 nm (cf. values in Table 1a and Fig. S9).
In summary, the digital implementation of the WF rules in cp enabled a systematic and efficient refinement, yielding improved predictive accuracy across the explored chemical space (cf.Fig. 3a). All subsequent analyses are therefore based on these refined WF increments (Section 3.2.2). The refined rules have been integrated into cp and can be accessed via the chromlib = ’woodward_refine’ option in the cp.predict function (cf. Section 3.1).
To benchmark whether machine learning (ML) models trained on molecular fingerprints can reproduce the classic scenario described by Woodward and Fieser – where a chromophore embedded in a larger molecular framework retains local control over the absorption – we compared the refined WF rules with random forest (RF) regression models. RF was chosen based on previous studies demonstrating its effectiveness for predicting λmax from structural descriptors.17,20 For training and testing, we used 288 enones from TD-B3LYP calculations (see Fig. 3a), comprising mono-, di-, and tri-substituted acyclic enones, cyclopentenones, and cyclohexenones (Sections 3.2.1 and 4.1). The RF models were trained on 80
% of this dataset using four distinct encodings: topological torsion fingerprints (TTFP, 2048 bits), feature Morgan fingerprints (FMFP, radius 2, 1024 bits), MACCS keys, and rooted fingerprints (RFP). The latter restricts TTFP generation to α, β-unsaturated carbonyls and their substituents, thereby paralleling the scope of the WF rules. To probe generalizability, we curated an inference set of 26 enones (E01–E26, see Fig. S10) from experimental data (see values in Table S2).35–38 These compounds feature fused rings and bulky substituents but no additional conjugated double bonds relative to the enone core and thus display λmax in the same range as the training and test data (see Fig. 3b).
On the enone test set, all RF models achieved mean absolute errors (MAEs) between 9 and 12 nm, comparable to the refined WF rules (MAE: 8 nm, see Fig. S11a–d). On the inference set, however, performance diverged: the RF models yielded MAEs of 16 nm (RFP and MACCS), 10 nm (TTFP), and 8 nm (FMFP), whereas the WF rules maintained a substantially lower MAE of 5 nm (see Fig. S11e–h). Thus, among purely fingerprint-based models, FMFP proved most robust (see Fig. 3c).
To assess whether explicit rule-based descriptors enhance fingerprint models, we augmented FMFP with WF-derived features (increments for exocyclic double bonds, α- and β-substituents, and the total number of substituents). This hybrid model reduced the MAE to 8 nm on the test set, but showed slightly decreased accuracy on the inference set (11 nm, see Fig. 3d and S12). Closer inspection revealed that for molecules E04, E06, E20, and E26, the hybrid model outperformed the models trained on FMFP alone. These systems bear bulky substituents (e.g., isopropyl or spiropyran groups) that do not contribute to the chromophore absorption, suggesting that fingerprint encodings of these groups introduced spurious correlations leading to underestimation of λmax. Noteworthy, RF models trained exclusively on WF-derived features achieved superior accuracy for half of the inference molecules compared to the hybrid models (see Fig. 3d and S12), including comparably accurate predictions (within ±5 nm) for bulky systems (E04, E06, E26) and α, β, β′-substituted cases (E18, E21 and E22). The latter underscores the importance of substituent-counting features, consistent with the extended WF rule formulations by Kang and co-workers.33–35
To contextualize the efficiency and accuracy of the WF predictions, we compared them against ab initio results. Structures of compounds E01–E26 were optimized at either the B3LYP39,40 or xTB41 level, followed by linear-response TD-B3LYP simulations to determine the λmax of the π–π* transition. These calculations required in total approximately 700–3000 CPU hours, highlighting the substantial computational cost compared to the near-instantaneous predictions of ChromoPredict (≈0 CPU hours). Despite their simplicity, the refined WF rules yielded mean absolute errors (MAEs) within the historical uncertainty reported by Woodward and Fieser (±5 nm).23,26 Random forest models trained on 230 data points achieved comparable accuracy, while incorporating WF-derived features further improved performance for systems with bulky or non-conjugated substituents. In contrast, TD-DFT-predicted λmax values exhibited broader error distributions, with MAEs of roughly 15 nm (see Fig. S14). These results underscore that explicit chromophore-based descriptors remain both computationally efficient and chemically interpretable, maintaining robustness in off-domain regimes and providing a valuable complement to data-driven and ab initio approaches.
The unsubstituted coumarin core (C01) displays the characteristic enone ππ* absorption maximum (λmax,1) at approximately 311 nm (generally between 300 and 330 nm) and a stronger benzoid ππ* absorption band between 250 and 300 nm (λmax,2).42,43 This suggests that the WF-based predictions can be used to estimate λmax,1, but deviations due to ring conjugation and substitution patterns required a detailed analysis of the experimental trends:42,43 Substituents in the benzene ring (5-, 7-, or 8-position) with positive mesomeric effects (+M) generally induce bathochromic shifts of λmax,1, with the effect being most pronounced at the 7-position due to extended conjugation in the para-position relative to the enone. Substituents at the 6-position shift λmax,1 bathochromically regardless of electronic character, without substantially affecting λmax,2. Substituents at the α- and β-positions (3- and 4-positions) affect λmax,1 depending on their electronic properties: substituents that withdraw electron density from the carbonyl carbon by mesomeric or inductive mechanism (−M or –I effect) induce bathochromic shifts, whereas electron-donating groups with +M or + I effect lead to hypsochromic shifts due to destabilization of the π*-acceptor orbital. Steric interactions at positions 4 and 5 further modulate both bands, often resulting in hypsochromic shifts.
Guided by these trends, we focused on coumarins substituted exclusively at the 3-, 4-, and 6-positions, corresponding to the α-, β-, and higher substituent sites influencing the enone chromophore, while excluding substitutions at positions 5, 7, and 8, which introduce steric or extended conjugation effects not captured by standard WF increments. Applying these criteria, we constructed a dataset of 36 mono-, di-, and tri-substituted coumarins, combining experimentally reported λmax,1 values from the literature43–55 with corresponding TD-B3LYP predictions of their absorption maxima (see Tables S4, S5, Fig, S13, S14 and Section 4.2).
Using the experimental λmax,1 values of coumarins C01–C26 (Table S3 and Fig. S10), we refined the WF increments with the unsubstituted coumarin (C01) as the reference chromophore, defining the base structure (SMARTS: [#6]1=[#6][#6]=[#6]2[#6](=[#6]1)[#6]=[#6][#6]([#8]2)=[#8]) and base value (312 nm). To capture both hypsochromic and bathochromic shifts relative to C01, positive and negative contributions were allowed during global optimization, which was performed for substituents at the α (3-), β (4-), and higher (6-) positions, considering chloro, bromo, hydroxy, methoxy, and methyl groups (15 parameters in total). The refined increments are summarized in Table S5.
Fig. 4 displays the correlation between WF-predicted and experimental λmax,1 values, with an analogous pairplot and violin plots including the TD-B3LYP predicted values provided in Fig. S17. For the fitted compounds C01–C26, the mean absolute error (MAE) is 4 nm (purple filled circles in Fig. 4), while application to the ten unseen coumarins C27–C36 yields a MAE of 5 nm (purple triangles). This analysis demonstrates that the refined increments capture the dominant electronic effects of substituents on the coumarin chromophore. The largest deviations arise for hydroxy- and alkoxy-substituted coumarins at the 6-position (see structures C16, C31, C35 in Fig. 4), which are underrepresented in the training set (three hydroxy- and five alkoxy-substituted analogues). Notably, TD-DFT predictions also overestimate the bathochromic shifts induced by hydroxy or alkoxy groups at the 6-position (Tables S3 and S4).
In direct comparison with TD-DFT (cf. Fig. S17), the refined WF rules reflect important substituent effects with greater accuracy. For example, the rules assign increments of −1 and +3 for a hydroxy group in the α (3-) and β (4-) positions, respectively. This is consistent with the stronger destabilization of the carbonyl π* orbital in 3-hydroxycoumarin (C13, λexp: 310 nm) compared to 4-hydroxycoumarin (C12, λexp: 317 nm).43 In contrast, TD-B3LYP predicts maxima at 304 nm (C13) and 287 nm (C12), thereby inverting the experimental trend. A similar inversion occurs for methoxy substitution: TD-B3LYP predicts λmax values of 303 nm (C18) and 284 nm (C17) for the 3- and 4-methoxy derivatives, respectively. Thus, TD-B3LYP overestimates the +M effect of the methoxy and hydroxy groups on the enon moeiety at the α-position and underestimates it at the β-position. By contrast, both WF and TD-DFT reproduce the experimental λmax trends for the structurally more complex coumarins C27–C36 (Table S4).
Overall, the error distribution of the TD-B3LYP predictions for C01–C36 is larger than for the WF estimations, as reflected in the violin plots in Fig. S17 and an MAE of 9 nm, which is comparable to the WF estimates (5 nm) that were fitted to C01–C26 and thus are expected to have a smaller MAE. This demonstrates that the refined WF scheme reliably predicts λmax,1 for simple 3-, 4-, and 6-substituted coumarins. Moreover, it reproduces all experimental trends covered by C01–C36, including 3-/4-methoxy and hydroxy derivatives, for which TD-DFT gives incorrect estimates of the absorption energies.
A total of 720 molecules were generated: 216 per acyclic compound class (ketone, aldehyde, and carboxylic acid), and 36 per cyclic enone (see Fig. S5). For the acyclic compounds, all possible permutations of mono- (α or β), di- (α, β), and tri-substitution (α, β, β′) were systematically constructed. Where applicable, both cis/trans stereoisomers were explicitly included, specifically for β-, α, β-, and β, β′-substituted derivatives. For the cyclic enones, due to the presence of only one accessible β-position, mono- and di-substitution patterns were generated, covering α-, β-, and α, β-substitution. All possible combinations of the selected substituents across the available positions were enumerated and represented as isomeric SMILES, ensuring that E/Z-isomerism was captured.
For the resulting SMILES, 3D molecular geometries were generated using the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) method56 as implemented in RDKit, followed by geometry optimization using density functional theory (DFT) at the B3LYP39,40/def2-TZVP57/D4 (ref. 58 and 59) level of theory. Subsequently, time-dependent DFT (TD-DFT) calculations were performed to simulate the 10 lowest singlet excited states. The S1 and S2 states were assigned as nπ* and ππ* states, respectively. In accordance with the scope of the Woodward rules, only the vertical excitation energies of the S0 → S2 (ππ*) transitions were considered for further analysis. Nonetheless, the information on the first five excitations and their oscillator strengths are available in the dataset provided with the ChromoPredict code on Github.32 All (TD-)DFT simulations were performed using the VeloxChem software.60
SMILES strings were converted to 3D structures using the ETKDG algorithm.56 Geometry optimizations were carried out at B3LYP/def2-TZVP/D4 (ref. 40 and 57–59) level of theory. Vertical excitation energies and oscillator strengths of the lowest 10 singlet states were obtained from TD-DFT calculations at the same level of theory. In all calculations solvent effects (ethanol, ε = 24.852) were modeled using the conductor-like polarizable continuum model (CPCM).61 Analyses focused on the lowest-energy absorption maximum with predominant ππ* character (S0 → S2 transitions).
Each enone molecule was encoded using three categorical descriptors: the base chromophore, the α-substituent, and the β-substituent. For coumarins, an additional descriptor was included to account for substitution at the 6-position, denoted by the categorical variable higher. To incorporate stereochemical influences, which are absent from the original formulation, we extended the base chromophore category to distinguish between aldehydes, ketones, and carboxylic acids, each further subdivided into cis, trans, or non-stereospecific configurations. Initial estimates for these base chromophore values were adopted from classical Woodward–Fieser increments (e.g., 210 nm for aldehydes), assigning equivalent starting values to all stereoisomeric variants.
All categorical features were one-hot encoded to preserve the independence of the base and substituent values and construct a design matrix X, with corresponding target values ŷ representing vertical excitation energies derived either from TD-DFT calculations (for enones) or from experimental absorption maxima (for coumarins). The model assumes a linear additive form, ŷ = X·x, where x is the vector of unknown coefficients representing the contributions of each feature (X). Parameter estimation was formulated as a minimization problem over the sum of squared residuals:
To retain the interpretability of rule-based additive values, the coefficient vector x was constrained to integer values through rounding within the loss function. Optimization was carried out using the global optimization using the dual annealing algorithm as implemented in SciPy, which is well-suited for navigating non-convex parameter landscapes.
Parameter bounds were informed by chemical considerations. In enones, all substituents are known to induce bathochromic shifts, whereas in coumarins, both bathochromic and hypsochromic shifts can occur, depending on substituent type and position. Accordingly, for enones, base chromophore values were restricted to the range between 150 and 300 nm, hydrogen substituents were fixed to zero, and remaining substituent increments were allowed to vary between 1 and 70 nm. For the coumarins, the base chromophore value was fixed to 312 nm, hydrogen substituents were fixed to zero, and remaining substituent increments were allowed to vary between −70 and 70 nm.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00382b.
| This journal is © The Royal Society of Chemistry 2026 |