Open Access Article
Xiaoyu Fana,
Lin Guo
a,
Ruizhen Jiaa,
Yang Tiana,
Zhihao Yanga,
Weihao Lib and
Boxue Tian
*a
aMOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, Beijing Frontier Research Center for Biological Structure, School of Pharmaceutical Sciences, Tsinghua University, Beijing, 100084, China. E-mail: boxuetian@mail.tsinghua.edu.cn
bDepartment of Statistics and Data Science, Tsinghua University, Beijing, 100084, China
First published on 20th June 2026
Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM's loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6–33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss yielded relative improvements on activity-cliff molecules and on a melting-point property-OOD task; however, the absolute prediction error in the OOD setting remained high, indicating that property-OOD generalization remains an open challenge. In a controlled molecular-weight extrapolation experiment, MolRuleLoss reduced the test RMSE of a GEM model from 29.507 to 0.007; because molecular weight is an exact linear function of atomic composition, this result should not be extrapolated to noisy experimental properties. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs are positively correlated with an MPRM's error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.
Classification models, a subtype of MPPMs, can predict a compound's status in terms of various categories (e.g., whether a compound is toxic or non-toxic; whether it can pass the blood–brain barrier). Molecular Property Regression Models (MPRMs), which are another subtype of MPPM that often use Simplified Molecular Input Line Entry System (SMILES) string,16 molecular graph,17,18 and 3D coordinates19,20 as input, can predict a continuous numerical value (e.g., the solubility of a compound; the lipophilicity of a compound). While data scarcity is common in ADMET predictions, MPRMs have achieved high accuracy when using pre-training with large-scale datasets of unlabeled compounds followed by fine-tuning using datasets specific to desired target properties. For example, the GEM model uses both bond length and angle information for pretraining and the Uni-Mol model incorporates atomic distance information for pretraining.21,22
Given that MPRMs are now integral throughout chemometrics, the ability to improve their prediction performance would be greatly welcomed. However, DL-based MPRMs suffer from overfitting on small datasets.23 This issue manifests in two scenarios: for in-distribution (ID) data and for out-of-distribution (OOD) data.24–26 The ID overfitting scenario often relates to structurally similar molecules having large variance in properties, a situation termed “activity cliffs”.27,28 Common solutions for addressing activity cliffs include self-supervised pretraining,29,30 data augmentation,31,32 and adding domain knowledge.33,34 The OOD overfitting scenario can be summarized as poor predictive accuracy of DL-based MPRMs on OOD data vs. ID data.35,36 This performance gap is evident across diverse DL architectures and for various molecular property prediction tasks.37–39
Despite the strong promise of MPRMs, in practice it is still common for chemists to harness their practical domain knowledge when optimizing ADMET properties in a given real-world project. A typical human domain knowledge is heuristics that predict a potential property change upon substituting one substructure with another. For example, to enhance solubility, a chemist might substitute a hydrophobic group like a phenyl with a piperazine; that is, by assuming that the Δsolubility for phenyl → piperazine will predictably increase. Analogous to these heuristics, the concept of substructure substitution rules (SSRs) in chemoinformatics is formalized through defining specific transforms (or reaction rules) that dictate how molecular substructures can be interchanged while maintaining or altering desired properties. We speculated that the prediction performance of MPRMs could potentially be improved by incorporating SSRs.
In this study, we introduce MolRuleLoss, a substructure-substitution-rule- informed framework designed to enhance the accuracy and generalizability of MPRMs by incorporating constraints for SSRs into an MPRM's loss function. Recent molecular representation learning methods also incorporate inter-molecular substructure variation, such as MolCLR40 and MolMCL.41 In contrast to these representation-level contrastive frameworks, MolRuleLoss encodes MMPA-derived quantitative rules as derivative constraints directly within the supervised objective, offering a complementary formulation. MolRuleLoss boosts the accuracy of state-of-the-art (SOTA) MPRMs (GEM and UniMol) for diverse molecular property prediction tasks, including predicting lipophilicity, water solubility and solvation-free energy. We show that both the number and quality of rules contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. Moreover, MolRuleLoss improves the generalizability of MPRMs for “activity cliff” and OOD molecules. In an extreme scenario for molecular weight prediction addressing the OOD problem, MolRuleLoss reduces the root mean squared error (RMSE) of a GEM model from 29.507 to 0.007. Finally, upon considering the prediction performance gain data across the multiple tasks we examined in this study, we discovered an informative and practically useful trend—that model error increases along with increases in the STDmax value (i.e., property change variation)—which we formalized as our rule-variation error bound observation with a mathematical proof. Our study shows that adding the MolRuleLoss framework enables more accurate MPRM predictions without changing the original MPRM's architecture, like a bolt-on to support diverse MPRMs for a wide range of molecular property prediction applications in cheminformatics.
MolRuleLoss comprises an MSE term and an SSR term for quantifying property changes for the replacement of one molecular substructure with another (Fig. 1a). Lambda (λ) weighs the SSR term relative to the MSE term in the loss function (eqn (1)). Notably, our experience ultimately showed that λ should be adjusted according to the specific desired prediction task. The default value of λ in MolRuleLoss is 0.3. When implementing an SSR in the loss function, we relate the output value of a DL model, y, to the change in the number of xi and xj substructures (see eqn (3)–(5) in Methods). The n–D vector saves the number of substructures for each molecule, which is converted into SSRs automatically within the MolRuleLoss framework (Fig. 1a).
| Loss = LossMSE + λ × LossSSR | (1) |
![]() | ||
Fig. 1 Overview of the MolRuleLoss framework. Our MolRuleLoss framework leverages a set of substructure substitution rules (SSRs) derived from the matched molecular pair analysis (MMPA) method, seeking to enhance the accuracy of MPRMs for multiple prediction tasks. (a) MolRuleLoss comprises an MSE term and an SSR term for quantifying the property changes of SSRs. λ controls the relative importance of SSRs for MolRuleLoss's MSE term (eqn (1)). When implementing an SSR in the loss function of an MPRM, we relate the output value of a DL model, y, to the change in the number of xi and xj substructures (see eqn (3)–(5) in Methods). Extracted SSRs will be converted to rules automatically within the MolRuleLoss framework. (b) Examples of SSRs. The specific SSRs from the MMPA method are based on pairwise comparison of two molecules in the training dataset; these two molecules are identical except for a single substructure substitution. The structural change from A to B induces a property change (ΔP = P[R-A] − P[R-B]), like solubility or lipophilicity (log P). | ||
The specific SSRs we implemented in MolRuleLoss are based on pairwise comparison of two molecules in the training dataset (that is, excluded from the test set, seeking to prevent data leaking); these two molecules are identical except for a single substructure substitution. For instance, with a common molecular scaffold “R”, one molecule might have “A” attached (R-A) and another “B” (R-B) (Fig. 1b). The structural change from A to B induces a property change (ΔP = P[R-A] − P[R-B]), like solubility or lipophilicity (log
P), which yields a quantifiable rule for the substitution of substructure A with substructure B that corresponds to a property change (ΔP).
Considering that the property changes caused by substructure transformations vary across different molecules, we introduce an adaptive unit. For each rule, a learnable parameter vector with the same dimensionality as the molecular features is added. The dot product between the molecular feature vector and this learnable parameter vector serves as the adaptive unit, which is incorporated into the SSR term in the loss function. Mathematically, such SSRs can be modeled using partial derivatives, where the derivative of the molecular property with respect to the count of substructure A, minus the derivative with respect to substructure B, equals ΔP. Statistical insights, such as the average property change and its variability (defined by the STD) can be calculated (Fig. 1a). MolRuleLoss is a framework that can be applied as a bolt-on to diverse MPRMs.
P value. We divided this dataset into training, validation, and testing sets at an 8
:
1
:
1 ratio using the same scaffold split method employed for the development of previously reported MPRMs.49,50 We used GEM (Geometry-Enhanced Molecular Representation Learning) as our base model; GEM uses a purpose-built geometry-based graph neural network architecture and a set of dedicated geometry-level self-supervised learning strategies to capture molecular geometry knowledge.17 Note that the results we report are the average over five random seeds. For the test set, the RMSE of the GEM model with MolRuleLoss was 0.587, compared to 0.660 without it, indicating that MolRuleLoss improves the predictive accuracy of the GEM model (Fig. 2a).
We next trained UniMol and GEM models for predicting lipophilicity, water solubility and solvation-free energy (using the lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), all with and without MolRuleLoss (Fig. 2b and c). For the GEM model on the three tasks, the RMSE values with and without MolRuleLoss were 0.587 ± 0.014 vs. 0.660 ± 0.010, 0.777 ± 0.018 vs. 0.798 ± 0.020, and 1.252 ± 0.130 vs. 1.877 ± 0.150, representing 2.6% to 33.3% performance improvements (P < 0.001 for the lipophilicity and freeSolv tasks; P < 0.05 for the ESOL task, Fig. 2b). Note that Uni-Mol is a pretrained model with the SE(3) transformer architecture that achieved SOTA performance in multiple MPPM contests.19 For the three tasks with the UniMol model, the RMSE values with and without MolRuleLoss were 0.555 vs. 0.603, 0.773 vs. 0.788, and 1.293 vs. 1.480 (P < 0.001 for all three tasks, 1.9% to 12.6% performance improvements; Fig. 2c). These results show that MolRuleLoss boosts the accuracy of multiple MPRMs for diverse molecular property prediction tasks. Importantly, using a randomly initialized GIN backbone without pretraining, MolRuleLoss reduced the lipophilicity test RMSE from 0.721 ± 0.016 to 0.686 ± 0.005 (paired t-test P = 0.015), indicating that the observed improvement is not contingent on GEM pretraining (Table S1). It bears emphasis that, given that UniMol and GEM are highly optimized MPPMs with recognized SOTA performance, the increases gained by adding MolRuleLoss would be expected to manifest as substantive improvements for the set of downstream applications (for example in ADMET optimizations) that use the property prediction values as their foundation.
Given that not all molecules in the testing set possess substructure transformations defined by our SSR set, we expect that the number and quality of rules (defined by STDmax, with a low value indicating high quality) will affect the accuracy of MPRMs with MolRuleLoss. To this end, we evaluated the influence of STDmax (at λ = 0.3) with GEM as the base model and using the Lipo dataset. The RMSE values were 0.607 and 0.734 for STDmax = 0.2 and STDmax = 0.6 (Fig. 2e). When STDmax is small, fewer rules are obtained; when STDmax is large, low quality rules are obtained. These results show that both the number and quality of rules affect the accuracy of MPRMs. The adaptive unit in MolRuleLoss introduces a single shared learnable matrix
(≈30 × 32 ≈ 103 parameters, ∼1% of the GEM backbone); a parameter-matched PM-GEM control yields nearly identical RMSE to GEM (0.658 ± 0.012 vs. 0.660 ± 0.010), indicating that the observed improvement arises from the rule-based constraint rather than additional model capacity (Table S2).
We first trained GEM models without MolRuleLoss at different data volumes, including 0.5 K, 5 K, 50 K, and 100 K (Fig. 3c). The RMSE values for GEM without MolRuleLoss on the 175 K test set were, respectively, 51.372, 46.700, 39.705, and 38.291. These results show that expanding training data dimensionality boosts the accuracy. We then trained GEM models with MolRuleLoss on the 175 K test set, and the RMSE values for GEM with MolRuleLoss were 49.082, 43.290, 38.748, and 37.422 (λ = 0.3, STDmax = 0.27) (Fig. 3c). These results show that MolRuleLoss consistently boosts the accuracy of GEM, even upon massively expanding the dimensionality of training data.
Considering that the SSR set derived from a small training dataset is relatively sparse (the number of SSRs for 0.5 K is 44, while the number of SSRs for 100 K is 175), we speculated that adding more SSRs from another source, e.g., from another dataset for the same property, might improve the prediction accuracy of MPRMs. This is conceptually analogous to “transfer learning”.52 To test this, we transferred 175 SSRs obtained with MolRuleLoss upon training with the MP100 K dataset to an MP0.5 K dataset. For the MP0.5 K dataset, the GEM model without MolRuleLoss yielded an RMSE of 51.372. When 44 SSRs extracted from MP0.5 K were applied through MolRuleLoss, the RMSE improved by 4.4% to 49.082. Using 175 MP100 K SSRs as constraints led to an even greater improvement, reducing the RMSE to 46.948, an 8.6% improvement over the GEM without MolRuleLoss (Fig. 3e). These results indicate that expanding the number of rules can improve the accuracy of MPRMs with MolRuleLoss. Notably, while data in drug discovery is generally expensive and time-consuming to obtain, our results show that using rules can apparently help address the data scarcity issue in ADMET optimization efforts.
log
P = 1.0, an order of magnitude difference). We then put one molecule from each pair into the training set and the other into the test set; the final test set comprised 290 molecules. The RMSE values of the GEM model with and without MolRuleLoss were 0.627 ± 0.018 and 0.674 ± 0.015 (Fig. 4c). These results demonstrate that MolRuleLoss improves the generalizability of an MPRM for the AC scenario.
Existing MPRMs frequently exhibit overfitting on OOD test sets for properties like permeability and binding affinity.53,54 Building on MolRuleLoss's demonstrated performance boost based on extracted SSRs, we investigated whether the SSRs learned and selected for ID training data are generalizable for OOD prediction tasks. We used the aforementioned MP prediction task, specifically using molecules with an MP lower than 231 °C (representing the top 10% highest MP molecules) for training and using molecules with an MP greater than 231 °C for the test set (Fig. 4d). To minimize the influence of SSR errors, we specifically investigated whether incorporating SSRs with low STD values (STDmax = 0.27) via MolRuleLoss enhances MPRM accuracy. The RMSE values for the GEM model with and without MolRuleLoss on the OOD test set were 93.189 and 98.959 (Fig. 4e). Note that although the model with MolRuleLoss did outperform the model without it, the performances of both were catastrophically bad for the OOD molecules. These results support that low-STDmax SSRs derived from a narrow training dataset enable an MPRM to “extrapolate” to a broader data scope.
As our low-STDmax rule set is neither complete nor deterministic (STDmax = 0), we sought to design a task in which the rules are complete and deterministic, to quantitatively explore potential error reduction. We here employed a molecular weight (MW) prediction, which is calculated deterministically from a molecule's SMILES string by summing atomic masses, making it an inherently error-free molecular property. MW prediction is distinct from the aforementioned lipophilicity, solubility and MP properties, which are determined using experimental measurements, and thus inevitably contain errors. We constructed an MW dataset comprising 2700 molecules (2160 for training and validation) by sampling 5 molecules per integer MW from 160 to 700 dalton55 (an MW range for drug-like molecules) from the ChEMBL dataset, with entries ranging from 160–600 dalton as the training dataset, but ranging from 600 dalton to 700 dalton for the test set (i.e., an OOD test set).
The RMSE value of the GEM model without MolRuleLoss was 29.507 ± 3.254, which is catastrophically high. The 66 SSRs derived from our MW dataset represents the difference between the atomic masses of elements (delta is an exact value, STDmax = 0.000), which are deterministic. We found that the RMSE of the GEM model with MolRuleLoss dramatically decreased, from 29.507 ± 3.254 to 0.007 ± 0.001, and the R2 improved from −0.050 to 1.000 (Fig. 5a and b). We also trained GEM models with and without MolRuleLoss at different data ratios, showing that much less data are required when using MolRuleLoss than without it (Fig. 5c). Additionally, when artificial noise is added to the atomic mass, which increases STDmax, we observe that the RMSE values for the model with MolRuleLoss increase nearly linearly with the noise, suggesting that STDmax defines the upper performance limit of our model (Fig. 5d). These results based on GEM models support that the challenges AI models currently face in handling OOD data can be addressed by acquiring SSRs from ID data and leveraging them for extrapolation to OOD contexts.
In theory, a test set of synthesizable organic compounds could be obtained via applying SSRs to the molecules from the training dataset (Fig. 6). Our rule-variation error bound observation establishes a theoretical foundation for the MolRuleLoss framework, formally justifying the relevance of SSRs and offering guidance for future efforts to leverage such rules to improve the performance of MPRMs. Briefly, this trend emphasizes the positive utility of adding SSRs under bounded-error conditions; moreover, there is a direct, consistent link between rule variation (STDmax) and model performance, offering a strong justification for using SSRs to improve model prediction accuracy. Specifically, a high STDmax value for SSRs of a given property indicates that an MPRM prediction for this property would have large errors56 (e.g., as observed with binding affinity).
In a melting point prediction task, we show that both the dimensionality and quality of rules each contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. Our results support that MolRuleLoss improves the generalizability of an MPRM for “activity cliff” and OOD molecules. In an OOD evaluation for MW prediction, we observed a dramatic reduction in RMSE, suggesting that MPRMs with MolRuleLoss can achieve accurate predictions when deterministic rules are included. We find that the STDmax values of the SSRs are positively correlated with the MPRM's error, and we provide a mathematical proof for this rule-variation error bound observation (Text S1). Under the assumptions of model accuracy |P(c) − F(c)| ≤ e and model stability |F(r(c)) − F(c)| ≤ e, the triangle inequality gives |P(r(c)) − P(c)| ≤ 3e. The constant 3 is tight in the worst case but likely conservative in practice; distributional refinements are left for future investigation. Ultimately, MolRuleLoss offers a framework that can be applied as a bolt-on to diverse MPRMs, supporting a wide range of applications including data scarcity scenarios in ADMET optimizations.
In principle, the MolRuleLoss approach for boosting prediction performance should be applicable for other molecular properties, such as permeability and binding affinity. Domain-knowledge integration methods operate at distinct stages of molecular learning pipelines. MolRuleLoss complements data augmentation, multitask learning, and contrastive-pretraining approaches by acting directly at the supervised objective level, and further improves activity-cliff prediction when combined with MolCLR and MolMCL (Table S3). While advances such as AlphaFold3 (ref. 57) now enable prediction of protein-ligand complex structures, drug discovery efforts still require reliable ranking of candidate small molecules by binding affinity.58,59 Data augmentation strategies, e.g., increasing data volume and adding synthetic data, have been shown to offer improvements in accuracy for MPRMs.60,61 However, data augmentation strategies do not necessarily improve the generalizability involving property-OOD molecules.35,37–39 Rule constraints seem likely to improve the generalizability of MPRMs for property predictions of OOD molecules in permeability and binding affinity tasks. Beyond GEM and UniMol, on the lipophilicity activity-cliff benchmark, integrating MolRuleLoss into MolCLR (GIN backbone) and MolMCL (GPS backbone) further reduces the 5-seed test RMSE in both cases (MolCLR: 0.649 ± 0.032 → 0.629 ± 0.041; MolMCL: 0.664 ± 0.016 → 0.645 ± 0.033), demonstrating that substructure-aware contrastive pretraining and SSR-informed loss constraints provide complementary, rather than redundant, inductive biases (Table S4). We anticipate that MolRuleLoss will likely improve generalizability when added onto other MPRMs such as VisNet and attentive FP.20,62
Our work highlights several directions for enhancing the accuracy of MPRMs through rule-based partial derivative constraints. We have so far focused on SSRs derived from the MMPA method, and the number of rules is limited when the training dataset size is small. The effectiveness of MolRuleLoss depends on the availability and reliability of SSRs. On the melting-point task, transferring 175 SSRs from MP_100 K to MP_0.5 K improved RMSE by 8.6%, compared with 4.4% using only the 44 SSRs extracted in-domain. Future work may explore the generation or learning of additional SSR-like constraints from larger datasets or model-assisted substructure searches. Future extensions could incorporate alternative rule sources. Notably, while we have currently focused solely on pairwise substructure substitution, the MolRuleLoss framework should accommodate other rule formats, for example those involving substitution of more than one substructure. Beyond the domain of molecular property prediction, MolRuleLoss may be applied to other regression tasks (e.g., oral bioavailability). Finally, MolRuleLoss may be applicable to molecular entities including proteins or nucleic acids. To achieve this expanded application scope, one could define substructures of proteins and use AI models to learn potential SSRs from a supervised model.
P, n = 4200), ESOL solubility (log
S, n = 1150), and FreeSolv hydration free energies (ΔG_solv, n = 650). Molecular structures were standardized using RDKit (v2022.09)63 through a multi-step pipeline that included charge neutralization, salt removal, and aromaticity normalization to Kekulé form. To rigorously evaluate out-of-distribution (OOD) generalization, we implemented two complementary splitting strategies. For scaffold-based OOD evaluation, molecules were clustered by Bemis-Murcko scaffolds using Butina clustering with a Tanimoto similarity cutoff of 0.7 based on Morgan fingerprints (radius 2), with the 10% least populated clusters reserved for testing to maximize structural novelty. For property-based OOD evaluation, molecules were sorted by target property values, with the top and bottom 10% extremes forming the test set to simulate prediction challenges for property outliers. To specifically test extreme extrapolation capabilities, we constructed a dedicated molecular weight prediction dataset from ChEMBL v30,64 spanning 160–700 Da with uniform sampling, where the test set exclusively contained molecules >600 Da (n = 300) and inputs were restricted to counts of 12 common atoms (H, C, N, O, F, P, S, Cl, Br, I, B, Si).
532 records. For compounds with multiple valid measurements, the value closest to the median was retained to preserve data integrity and patent traceability. Molecules failing descriptor calculation were excluded. Experimental accuracy was estimated at σ = 35 °C based on 18
058 duplicate measurements, accounting for polymorphism and extraction variances. For the drug-like space (50–250 °C, covering >90% of compounds), error refined to σ = 32 °C. The final curated dataset provides a diverse MP collection for predictive modeling. Overlap with established benchmarks was removed to ensure test set independence.
![]() | (2) |
This generated a symmetric similarity matrix, subsequently converted to a distance matrix via D = 1 − Tc.
Molecules were ranked by their number of neighbors (compounds within a similarity threshold). The molecule with the most unassigned neighbors was selected as a centroid, and all its neighbors were assigned to the new cluster. These members were then excluded from future centroid selection or cluster membership. Remaining unassigned molecules were classified as singletons. This algorithm efficiently produces clusters with high internal structural consistency.
P, log
S, ΔG_solv). The core innovation of PINNs lies in their composite loss function, which simultaneously optimizes both data fidelity and physical consistency.
Two state-of-the-art architectures served as baselines for molecular representation learning. The Geometry-Enhanced Molecular GNN (GEM)17 processes 2D molecular graphs through an 8-layer Graph Isomorphism Network (GIN) with 32-dimensional hidden states, using atom features that encode atomic number, degree, hybridization, implicit valence, and formal charge. Pretrained via self-supervised bond length and angle prediction tasks, GEM employs a two-layer MLP (32 → 32 → 1) with dropout (p = 0.1) for property regression. In contrast, UniMol (Universal Molecular Transformer)19 operates on 3D conformers generated using RDKit's ETKDG method with MMFF94 optimization. This SE(3)-equivariant Transformer architecture uses 512-dimensional embeddings with 8 attention heads and rotary positional embeddings, pretrained on 209 million molecular conformations from QM9 and GEOM datasets. For property prediction, UniMol reduces dimensions through a 512 → 128 → 1 MLP with LayerNorm. Both models were implemented in PyTorch with mixed-precision training on NVIDIA A100 GPUs.
log
P = −0.82 ± 0.21) and –OH → –NH2 (Δ
log
S = +0.47 ± 0.33). The MolRuleLoss function combines conventional prediction error with a rule consistency term.
For a rule r = (x_i → x_j), let Δ_r denote the mean property shift obtained from matched molecular pairs, and let h_m represent the molecular graph embedding of molecule m. The adaptive rule term is defined as
| Δradaptive(m) = Δr + hmTθxi→xj | (3) |
is a shared learnable parameter matrix indexed over the substructure vocabulary (|V| is the substructure vocabulary size, d is the embedding dimension). The rule residual for molecule m is then expressed as
![]() | (4) |
The SSR objective is computed as
![]() | (5) |
| Loss = LossMSE + λ·LossSSR | (1) |
Recognizing that the impact of identical substructural substitutions on molecular properties can be significantly modulated by their distinct chemical environments, we introduce an adaptive rule refinement algorithm. This approach addresses the critical limitation of conventional matched molecular pair (MMP) analysis, wherein static transformation rules fail to account for context-dependent effects on property changes.
Our methodology dynamically adjusts rule-based predictions by integrating learned molecular representations with the specific contextual environment of each molecule. The implementation proceeds as follows: for a given substructural transformation rule with an initial predicted property change value Δ, we generate a high-dimensional molecular representation vector h using a pre-trained molecular graph neural network (GNN). This representation comprehensively encodes the structural and electronic features of the entire molecular scaffold surrounding the substitution site.
Simultaneously, we initialize a learnable parameter vector θ of identical dimension to the molecular representation. The adaptive adjustment term is computed as the dot product between the molecular representation vector and the parameter vector:
| δ = h·θ | (6) |
This term is then added to the base rule value to yield the context-sensitive prediction:
| ΔAdaptive = Δ + δ | (7) |
During the model training phase, both the molecular representation network (if fine-tuned) and the adaptive parameter vectors are optimized jointly using backpropagation to minimize the difference between predicted and true property changes. This learning process enables the model to capture systematic variations in rule applicability across different chemical environments.
Upon convergence, the learned parameter vectors become fixed and are utilized in all subsequent predictions. This approach effectively transforms static, context-agnostic rules into dynamic, adaptive rules that respond to the specific chemical environment of each molecular transformation, thereby significantly enhancing prediction accuracy for complex molecular systems.
log
P > 1.0) and molecular weight extrapolation (train: 160–600 dalton, test: 600–700 dalton). MolRuleLoss-specific hyperparameters (λ and STD_max) were optimized using a standard Optuna TPE search on the train/validation split while holding out the test set for final evaluation; the released script src/search_lipo_gin.py provides a reusable automated search pipeline for new datasets. Full reproducibility is ensured through publicly available code, datasets, and configuration files that document all preprocessing steps, model implementations, and hyperparameter choices.
| m(l)u→v = MSG(l)(h(l−1)u, h(l−1)v, euv) | (8) |
| M(l)v = AGGREGATE(l)({m(l)u→v:u ∈ N(v)}) | (9) |
We employ sum aggregation followed by a multilayer perceptron (MLP), similar to the Graph Isomorphism Network (GIN) approach:
![]() | (10) |
Update: the node representation is updated through a COMBINE function:
| h(l)v = COMBINE(l)(h(l−1)v, M(l)v) = MLP(l)(a(l)v) | (11) |
The model consists of 8 GNN layers with hidden dimension 32, providing sufficient depth for capturing complex molecular patterns. Layer normalization applied before each message passing step. Graph size normalization to address variable graph sizes. Residual connections to facilitate gradient flow in deep architectures.
For graph-level prediction tasks, we generate a global graph representation through a readout function that aggregates all node representations from the final GNN layer:
| hG = READOUT({h(L)v:v ∈ V}) | (12) |
We employ global mean pooling as the readout function, resulting in a 32-dimensional graph embedding.
To enhance predictive performance, we integrate domain-specific molecular substructure features with the learned graph representation. Let fsubstructure denote the m-dimensional vector encoding the counts of various molecular substructures. The final integrated representation is obtained by concatenation:
| hfinal = Concat(hG, fsubstructure) | (13) |
This combined representation is then processed through a prediction head consisting of a 3-layer MLP with hidden size of 128:
| ŷ = MLPprediction(hfinal) | (14) |
Compounds in SMILES format were processed using the mmpdb fragmentation algorithm, which cuts 1–3 non-ring bonds to decompose structures into a constant core fragment and variable R-group fragments. This captures all possible matched molecular pairs (MMPs). Fragmented representations were indexed to identify compound pairs sharing identical constant cores but differing in one variable fragment. This generated a database of structural transformations, with each transformation cataloged alongside its experimental property measurements. Each transformation was characterized by its rule environment using two representations: (1) SMARTS patterns (RDKit circular fingerprints, radius 2) for machine-readable environment definitions; (2) Pseudo-SMILES strings for human-interpretable representations.
This enables differentiation of transformations across distinct chemical contexts. For each transformation–environment combination, we computed mean property change, standard deviation, standard error, and fraction of pairs exceeding property change thresholds. Transformations with ≥10 matched pairs and well-defined distributions (low standard error) were retained as predictive rules.
Code availability: the code used for this paper can be found in the Github repository at https://github.com/fanxiaoyu0/MolRuleLoss.
Supplementary information (SI): theoretical proof of substructure substitution rule error bounds (Text S1), a comprehensive out-of-distribution analysis for molecular property prediction (Text S2), three supporting figures illustrating dataset distribution and model generalization, and four benchmark tables comparing multiple graph learning models with and without MolRuleLoss on lipophilicity and activity-cliff tasks. See DOI: https://doi.org/10.1039/d6sc01813k.
| This journal is © The Royal Society of Chemistry 2026 |