Open Access Article
Peter Fichtelmann
a and
Julia Westermayr
*ab
aWilhelm-Ostwald Institute of Physical and Theoretical Chemistry, Leipzig University, Linnéstraße 2, Leipzig 04103, Germany
bCenter for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Humboldtstraße 25, Leipzig 04103, Germany. E-mail: julia.westermayr@uni-leipzig.de
First published on 14th May 2026
Predicting olfactory perception directly from molecular structure is central to product design in a wide range of industries, such as perfumery, food and beverage, and health care. Among olfactory attributes, odor strength is a key factor in shaping odor perception, but its modeling has been impeded by scarce and fragmented intensity data. In this work, we introduce an ordinal odor strength data set of more than 2300 molecules by integrating two different public sources, mapping structures to odorless, low, medium, and high categories. Across several molecular encodings and supervised learning algorithms we compared different prediction strategies. Dimensionality reduction and SHAP analysis identified molecular shape, size and polarity as primary drivers, consistent with mass-transport constraints on volatility, sorption, and receptor access. This scalable ordinal framework enables reliable odor-strength estimation for novel molecules and provides a foundation for in silico fragrance design.
Underlying the fragrance perception is the fact that the chemical space of odorous compounds is inherently restricted with mass transport largely determining whether a molecule is odorous or not.2 To be perceived, molecules must be sufficiently volatile to evaporate, travel the nose, and reach the olfactory epithelium. Yet, they also must possess the right balance of polarity and hydrophobicity to traverse the mucous layer and interact with olfactory receptors to trigger an olfactory receptor neuron. This process is illustrated in Fig. 1.
While simple rules delineate which molecules can be perceived as odorous at all,2 no comparable general rules with reliable predictive performance exist for more complex qualities of smell, such as odor similarity, intensity, or character, the latter often being described with words like fruity, floral, or rose. Addressing these high dimensional structure-perception relationships requires data-driven approaches, and has motivated the application of machine learning to predict odor similarity3 or character4–15 so far. For example, Lee et al.4 developed a principal odor map based on a message-passing neural network. The model was trained on odor character descriptions and generalized on odor thresholds and odor similarity. Sisson et al.15 extended this approach to binary perfumery blends. As these studies show, the predominant focus in this direction is on the character of the odor rather than its strength even though odor intensity is a decisive factor in the perception of odor. This lack of studies is further reflected in the larger data sets of thousands of compounds available containing descriptive language of odorants, such as data sets like Good Scents16 or Leffingwell.17 Only a few studies recorded odor intensity-perception data of molecules with less than 600 different investigated substances in total.18–24
With respect to the scarcity of intensity data, current state-of-the-art approximations are simple models to predict the psychophysical intensity curve of an individual odorant. Examples are linear (odor values),25 exponential (Stevens' law),26 or parabolic (e.g. Hill's model)27 approaches. However, such models face several limitations, such as limited accuracy for predictions at high concentrations of odorants due to the missing modeling of receptor saturation (linear and exponential) and being based on highly variable odor detection thresholds (linear and exponential).28 The latter does not necessarily equal perceived intensity.29 Recent work has begun to address this issues by predicting parabolic psychophysical curve parameters and extending these predictions to mixtures, using a set of 62 distinct molecules.30
To overcome the scarcity of data and allow for machine learning training on odor strength, we introduce the first odor strength data set containing over 2300 molecules that allows for generalization across a variety of odor components. Therefore, data from two different sources, namely the Good Scents Company16 and PubChem,31 were curated and combined. Using these data, we further investigated the capacity of different descriptors and regressors to predict the odor strength. To our knowledge, these are the first machine learning-based models to predict odor strength categories as an estimate of the odor intensity from molecular structures. An overview of the process is illustrated in Fig. 2.
The total data set contains 2393 molecules. The odor strength distribution for each category, including representative example molecules, is illustrated in Fig. 3a. Each block in the figure corresponds to 50 data instances. The exact counts are provided in the (SI) in Table S1. Orange blocks represent data from Good Scents and green blocks indicate data from PubChem. The combination of data should balance the different categories for learning as Good Scents contains mainly medium and high odor strength molecules while PubChem contains mainly low or odorless strength components. The datasets showed a high inter-annotator agreement. We calculated two established chance-corrected metrics: (1) a quadratically weighted Cohen's kappa33 of 0.83 on the intersection of the Good Scents and PubChem dataset and (2) a Krippendorff's alpha34 of 0.81 (range 0 to 1, where 1 is perfect agreement) on the entire datasets. The later metric considers the label probability across the entire datasets and not only across the intersection. Both metrics range from 0 to 1, where 1 is perfect agreement. Consequently, we did not risk major subjective inconsistencies and our keyword-mapping approach is sufficiently reliable. Despite combining sources to improve balance, medium intensity remains the majority class and low intensity accounts for 12% of the total, which reflects the underlying availability of annotated strength labels rather than curation bias. A higher balance in odor strength is expected to increase the robustness of the trained model's performance. Additional descriptor repositories (such as those from Leffingwell17 or Thiboud35) were not merged because they provide odor character and performance notes but little to no odor strength descriptions, making harmonization across ordinal categories infeasible without unverifiable assumptions. Similarly, psychophysical intensity data sets were excluded because their ratings are explicit functions of concentration, solvent, and panel protocol; mixing them without a shared concentration scale or covariate model would confound structure-perception relationships and inject systematic bias into ordinal labels.18–24 This conservative choice defines a single-label, concentration-agnostic ordinal task anchored in molecular structure, while deferring integration of concentration- and solvent-explicit studies to future work where dilution and solvents can be modeled as covariates informed by established psychophysical laws of odor intensity.
![]() | ||
Fig. 3 Data set representations. (a) Amount of data for each odor strength. Each square corresponds to about 50 data instances. For each odor strength category, an example molecule is shown. The table with values is shown in the SI in Table S1. (b) 2D PCA of the RDKit descriptors of our curated data set colored by their odor strength and the odorous background data set consisting of 52 457 molecules (grey) obtained from a downsample of the GDB-17 database32 with a predicted odor probability of 50% or more according to the best-performing model from Mayhew et al.2 Glucagon is not shown for better visability. A PCA including Glucagon is provided in Fig. S2a in the SI. | ||
Potentially one of the most critical concerns in most odor literature data is the purity of fragrance chemicals. Impurities can alter the odor even in very low concentrations.36 A recent study by Mayhew et al.2 using gas chromatography-olfactometry reported that 22% of the supposedly odorous molecules investigated were, in fact, odorless.2 We compared the 70 intersecting molecules between the GC-analyzed substance samples of the study and the compounds of our dataset. The value count of the label-label-pairs is shown in Fig. S1 in the SI. Only 2 of 57 as odorous labelled molecules by Good Scents/PubChem were odorless according to Mayhew et al.2 That are 3.5% misclassifications compared to the reported 22% odorless rate.
To characterize how curated molecules populate an odorous chemical space, the data set was embedded with principal component analysis (PCA)37,38 on RDKit descriptors,39 with projections shown in Fig. 3b. RDKit descriptors, in this case, comprised 217 structural, physicochemical, and topological parameters of molecules, such as molecular weight, octanol–water partition coefficient (log
P), or the number of heteroatoms. PCA maps this correlated descriptor space to orthogonal principal components ranked by explained variance, enabling faithful visualization on a reduced set of axes for inspection. For context, an odorous background of 52
457 compounds was constructed by downsampling the GDB-17 database and retaining molecules with predicted odorous probability larger than 50% according to the best-performing model of Mayhew et al.2 Comparable qualitative structure is recovered when (i) excluding the odorous background, (ii) substituting UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction),40 a non-linear dimensionality reduction method, for PCA, and (iii) swapping RDKit descriptors for circular fingerprints (Morgan bit- and count-based), with results of all alternatives reported in the SI in Fig. S2–S4. As can be seen from Fig. 3b some odorless and low-strength entries fall outside the background space, consistent with mass-transport constraints that bound olfactory space. Coverage is broad but not uniform; PubChem- and Good Scents derived entries concentrate in specific areas, reflecting the data set's deliberate emphasis on perfumery-relevant chemotypes. The PCA and UMAP representations visualized by data source are shown in the SI in Fig. S4.
Another salient pattern is that odor strength categories do not resolve into distinct clusters but instead overlap substantially in the descriptor space. This makes conventional clustering algorithms, that do not use labels but cluster data solely on input features, inaccurate for separating molecules by their odor strength. We justified this assumption by evaluating several clustering algorithms, including K-means,41 Gaussian mixture models,42 density-based spatial clustering of applications with noise (DBSCAN),43 spectral,44,45 and agglomerative46 clustering. No groups corresponding to odor strength were formed. Additional information can be found in the SI in Section S1.3, including evaluation metrics (adjusted rand index, normalized and adjusted mutual information) in Table S2 and the best clustering result in Fig. S7.
To analyze which features are most important for data separation, the feature importance of the principal components was analyzed. The corresponding PCA loadings indicate that the first principal component is primarily influenced by features, which describe molecular weight, size, and connectivity. The second principal component reflects contributions from heteroatoms and polarity. The exact distribution of the top 15 features to the different principal components can be found in the SI in Tables S3 and S4. These features are major factors governing mass transport of molecules from odor source to olfactory receptors. This observation aligns with findings of Mayhew et al., who showed that the mass transport is key for determining whether molecules are odorous or not. Consequently, the chemical space occupied by odorous molecules becomes progressively narrower as odor strength increases, which reflects the tighter constraints imposed by volatility and polarity. Further visualization plots underlying this claim can be found in the SI in Fig. S5 and S6.
To predict odor strength based on molecular representations, two distinct modeling strategies were tested: First, we used a “direct approach”. For this, we defined four classes, i.e., odorless, low, medium, and high odor strength, followed by training a model on all of these classes. Second, we separated odorless molecules from odorous substances. This approach then requires two predictions, one that decides whether a molecule smells or not and the second should categorize the strength of odor (low, medium, high). Both strategies were evaluated to test complementary hypotheses about the structure-to-perception pathway: a single-task ordinal learner might best capture global trade-offs across all classes, while a hierarchical, two-step pipeline could exploit different molecular features for mass transport first and for receptor interaction second.
Five regression algorithms were tested, in particular classical logistic regression,48 random forest,49 extreme gradient boosting (XGBoost),50 multi-layer-perceptrons (MLP),51,52 and consistent rank logits (CORAL),53 an MLP architecture designed for ordinal regression. Each algorithm was tested with seven widely used molecular encoding strategies to represent a molecule. This breadth allowed us to assess the predictive performance in fundamentally different feature spaces. We calculated structural, topological, and physicochemical descriptors (RDKit descriptors39) and employed classical fingerprints that highlight subtle structural differences. These included the circular substructure Morgan fingerprint,54,55 the predefined substructure MACCS keys fingerprint,56 and several topological fingerprints that capture sequences of atom connectivity for longer-range relationships, such as the RDKit fingerprint,39 the topological torsion fingerprint,57 and the atom pair fingerprint.58 Beyond these classical approaches, we evaluated more recent representation learning techniques. ChemBERTa-2,59 a language model pretrained on 77 million SMILES strings provided data-driven embeddings. Additionally, graph-based encodings using message passing neural networks have recently demonstrated promising results to predict odor character4 and similarity.3 Consequently, we applied ChemProp,60,61 a framework for message passing neural networks. Recognizing the potential of pretraining to potentially improve downstream model performance,62 we further tested CheMeleon,63 a foundational ChemProp model pretrained on Mordred descriptors of 77 million molecules. All model hyperparameters were optimized using 10 times repeated 10-fold cross validation. The high number of repetitions reduced the impact of noise in the data. The hyperparameters that were optimized are provided in the SI in section S2 (Tables S5–S14) for the different models used. We chose the macro averaged mean squared error (macro MSE) as the average of the MSEs computed for each odor strength categories. The equation is provided in the Computational Details section in eqn (1). The macro MSE penalizes larger errors more heavily while equally weighting all odor strength categories, regardless of how many samples they contain. In contrast, the common micro MSE is calculated globally across all classes and dominated by the majority classes. We report further metrics (micro MSE, F1 macro, F1 micro/accuracy and receiver operating characteristic area under the curve (ROC AUC, measuring a models ability to distinguish between classes across all thresholds)) on the test set for the direct approach in Table S17 and for the indirect approach in Table S18 in the SI. The prediction of the indirect model steps was combined as follows: (1) the first model predicts if the compounds are odorous or not, (2) the second model predicts the odor strength of only those compounds which were predicted as odorous in the previous step. Error metrics are computed based on the final results.
The results of the indirect approach steps are provided in the SI in Fig. S7–S11 and are generally less accurate. The difference between direct and indirect approach regarding the macro MSE values for each model and descriptor combined is plotted in Fig. S11. We performed 5 × 2 cross-validation paired t-tests between the direct and indirect approach for all descriptor–predictor combinations to test the significance of the performance differences. The t- and p-values are provided in Table S15. The direct approach outperformed the indirect one in 19 out of 30 descriptor–predictor combinations (confidence interval 95%). Given the reduced potential for error propagation, lower computational resource consumption during both training and prediction and generally higher performance across models and descriptors, the remainder of this study will focus on the direct approach.
We conducted additional 5 × 2 cross-validation paired t-tests between the four direct models with the lowest macro MSE in hyperparameter optimization. The results are shown in Table S16 in the SI. No significant differences were observed between these models (confidence interval 95%). Due to the better interpretability of RDKit Descriptors, we focus the rest of the study on these models. The RDKit Descriptor MLP, Random Forest and XGB models were combined into an ensemble and predictions were calculated as average of individual predictions. The normed confusion matrix of this direct ensemble model is shown in Fig. 4b. Notably, the highest accuracy was observed for the medium odor strength class, which was most prevalent in both the training and test set. In contrast, the least frequent low odor strength class showed the worst performance. Overall, most misclassifications occurred between adjacent odor strength categories. Our ensemble correctly classified 72% of the test instances within their respective categories. MSE, F1, Accuracy, and ROC AUC including class-specific scores are reported in Table S19 in the SI.
![]() | ||
| Fig. 4 Model performance for the direct prediction approach. (a) Macro averaged mean squared error (MSE) across odor strength categories in the test set for all combinations of molecule descriptors (bottom) and predictors (left). MLP is multi-layer-perceptron and FP fingerprint. (b) Confusion matrix normed by the number of test samples for the direct ensemble model on the test set, averaged over 10 random-seeded training runs. (c) Area-normed violin plots of the direct ensemble model predictions for novel molecules from Keller et al.20 compared with their experimentally rated odor intensities (from 0 to 100; 13–108 ratings per molecule) at 10−3 dilution. (d) Global SHAP (SHapley Additive exPlanations)47 feature importance of the most influential feature groups of the direct ensemble model. The RDKit descriptor features were grouped using agglomerative clustering based on their feature value correlation (threshold: 0.75 maximizing the silhouette score (Fig. S14 in SI)). The absolute SHAP values within each group were summed. | ||
Finally, we assessed the model performance on another literature-derived test set that contains experimental odor intensity ratings.20 The molecules were entirely distinct from the data set derived and used above, with no close structural similarities evaluated by the Tanimoto similarity of the bit-based Morgan fingerprints. Fig. 4c displays the results using a violin plot that shows the mean of the rated intensity and its distribution among the intensity ratings (white line and black box) in addition to the distribution (shape, color) of the predictions at 10−3 dilution. As can be seen, predicted odor strengths correlate well with the rated odor intensities (ranging from 0 to 100). Despite considerable variability among individual ratings our model provides a reasonable approximation of perceived odor strength. This trend is further confirmed when combining dilutions ranging from 10−3 to 10−7 with a concentration difference of up to 104 magnitude. The predictions of odor strength of the direct ensemble model correlate unexpectedly well even without explicit modeling of odorant concentration, as shown in Fig. S12a in the SI. However, our model is limited to lower concentrations. The ensemble model's performance degrades at a dilution level of 10−1 (Fig. S12b).
In agreement with the loadings of the first two principal components of the described PCA (Fig. 3b), features related to molecular polarity, such as the number of hydrogen acceptors or donors and the number of heteroatoms, as well as features describing molecular weight and shape, such as molecular weight or Chi descriptors,66 exhibit the highest importance in the direct ensemble model's predictions. Additionally, nitrogen-related polarity, the presence of alcohol groups and Morgan fingerprint density, which measures the number of non-idential subgroups, contributed substantially to the model predictions. Notably, the cumulative impact of the remaining feature groups was markedly higher, reflecting the complexity of the model's decision process. Furthermore, the feature importance per odor strength is shown in the SI in Fig. S20. No clear trends that specific properties are more relevant for higher or lower odor strengths could be observed. Representative examples of local feature group contributions for four molecules representing each of the odor strength classes are shown in the SI in Fig. S21. These examples are in line with results found globally and illustrate that polarity, molecular weight and shape substantially influence odor strength.
One of the main challenges in the prediction of odor strengths remains the labeling of molecules, which is highly subjective and requires evaluations of many individuals for robust results. However, odor perception is highly variable between individuals.19 Moreover, the discrete classification itself neglects continuous variations in odor intensity within an odor strength category. To further advance odor intensity modeling, there is a need for comprehensive odor intensity data covering a wider range of molecules and mixtures, measured at multiple concentrations and with well-characterized impurities. Such data would support models that better capture the empirically known shape of the monotonic, sigmoidal relationship between the logarithm of odorant concentration and perceived odor intensity, whose slopes vary for different odorants.67 While odorousness2 and odor strength can be reliably predicted from structure-related transport properties, incorporating biological information, such as receptor responses, may yield even more accurate and mechanistically grounded models. Overall, this work provides a step towards data-driven in silico fragrance design.
Only molecules with valid canonicalized SMILES were retained. Duplicated SMILES entries from PubChem were removed. A total of 332 SMILES containing dots were identified. Dots in SMILES are not part of a molecule's covalent structure, but indicate separate disconnected fragments, e.g. ions or isomers. The SMILES with dots and the ambiguous entries of ‘carob bean absolute’ and ‘galbanum resinoid’ were excluded. A total of 2393 data points were collected with 1678 entries from Good Scents and 715 entries from PubChem.
In addition, another independent hold-out test set was generated using data from Keller et al.,20 downloaded via Pyrfume.72 Molecules with 80% or more respective Tanimoto Similarity of their bit-based Morgan fingerprints (radius = 3, nBits = 2048) to at least one molecule in the train set were removed to ensure that the test set contained only novel compounds with proper dissimilarity to trained compounds. Four different dilution levels of compounds were available. Each molecule was rated between 13 and 108 times by independent laymen panelists providing a mean and standard deviation of rated intensities for each molecule per dilution.
We performed 5 × 2-cross validation paired t-tests. We adapted the corresponding code from the Python package mlxtend.73
The macro MSE computed across each odor strength category was used as evaluation metric, following the equation:
![]() | (1) |
Supplementary information (SI): further plots and tables about the curated dataset, hyperparameter ranges, model performance, model validation and SHAP feature importance analysis. See DOI: https://doi.org/10.1039/d6ra01805j.
| This journal is © The Royal Society of Chemistry 2026 |