Open Access Article
Luis Eduardo Ramirez Cardenas*a,
Rachid Ouaretb,
Vincent Gerbaud
b,
Ivonne Rodriguez Donisa and
Sophie Thiebaud-Roux
*a
aUniv. Toulouse, Toulouse INP, INRAE, Laboratoire de Chimie Agro-Industrielle (LCA), Toulouse, France. E-mail: luiseduardo.ramirezcardenas@toulouse-inp.fr
bUniv. Toulouse, CNRS, Toulouse INP, Laboratoire de Génie Chimique (LGC), Toulouse, France
First published on 19th February 2026
Within the context of a transition towards greener and safer solvents, we describe a framework facilitating solvent screening. Traditional approaches rely on experimental solubility data or computational methods such as COSMO-RS. In parallel, similarity maps can be helpful to explore alternative molecules similar to working solvents. For developing solvent maps, Principal Component Analysis (PCA) offers limited applicability when dealing with complex molecular descriptors such as the σ-potential derived from COSMO-RS theory. In this study, we propose the application of Functional Principal Component Analysis (FPCA) as a more suitable dimensionality reduction technique for solvent mapping, leveraging the functional nature of σ-potentials. A database of 1588 solvents was analyzed, extending previous reported datasets with the inclusion of industrially relevant and sustainable candidates. FPCA enables a two-dimensional representation of the solvent space with minimal information loss (0.5%), directly associating the principal components with electron donor and acceptor characteristics. In this space, solvent clustering naturally emerges, facilitating the identification of structurally and functionally similar solvents. Three case studies are presented to illustrate the practical implications of the approach. Overall, this methodology provides a suitable framework for solvent substitution, whether as a preliminary screening step or as a part of computer-aided solvent design tools, contributing to more sustainable chemical practices.
There exist many other techniques to find alternative solvents including (1) comparing solvents in basis of their required properties or descriptors similarity, (2) through experimental trial and error, or with (3) Computer-Aided Molecular Design (CAMD). Solvent selection typically involves solving a multi-criteria optimization problem, where factors related to key functionality in the process (solubility, phase transition temperatures) must be balanced with other practical factors such as chemical stability, transport properties (viscosity, density, surface tension), energy-related properties (phase change enthalpy, specific heat), and economic factors like cost and availability. In addition to these functional and practical properties, environmental, health and safety (EHS) criteria are becoming increasingly important in the selection process. These factors have been considered to map similarity between solvents before coupled with dimensionality reduction and clustering techniques8,15–18 to decrease correlation between these factors and outline base behaviors that may arise when considering large dimensional spaces.
In this study we apply Functional Principal Component Analysis (FPCA),19,20 to create an effective mapping, suitable to the functional nature curve-like descriptors obtained with the Conductor Like Screening MOdel for Real Solvents (COSMO-RS).21 Our study encompasses 1588 molecules (liquid at room temperature), expanding on the list established by Moity et al.18 by incorporating additional liquid compounds from the COSMOtherm database, green solvents suggested by Clark et al.,22 and new alternative solvents selected based on in-house data. Ultimately, the effectiveness of this method as a tool for supporting solvent substitution with the required solubilizing properties is assessed using three case studies from different application domains. This methodology could be integrated as a preliminary step in computer-aided solvent conception tools to identify groups of closely similar solvents before the application molecular design.
Efforts to transition directly from the chemical structure and chemical group interactions to molecular properties go back to the 1940s, when one of the earliest semi-empirical methods, the group contribution (GC), was developed.28 Since then, advances in property prediction methodologies have been driven by the need to move beyond traditional trial-and-error approaches for solvent substitution towards solvent selection tools. GC methods are the most commonly used semi-empirical models as Quantitative Structure–Property Relationships (QSPR) in CAMD because they offer a straightforward way to estimate pure compound properties based on the contributions of individual structural chemical groups. Hukkerikar et al.29 updated and improved the parameters of 18 properties of the GC+ models, combining both group-contribution and atom connectivity index methods and using a large experimental data-set. Within this set of models, certain approaches enable the estimation of both the Hildebrand solubility parameter and the Hansen solubility parameters. The prediction of environmental, health, and safety (EHS) properties is commonly based on Quantitative Structure–Activity Relationships (QSAR). Molecular descriptors, commonly including physicochemical characteristics, structural features, and electronic attributes are correlated with the EHS properties using experimental data.30 Software platforms like VEGA-QSAR31 exemplify this approach by bundling over 90 QSAR models, enabling in silico prediction of toxicological, ecotoxicological, environmental and physico chemical properties without additional empirical data. However, a major limitation of both GC and QSAR methods is their dependence on quality and diversity of experimental datasets, which often lacks sufficient chemical diversity, leaving many parameters unavailable and thus limiting the design space in CAMD applications.
Full predictive theoretical methods have advanced rapidly since the early 21st century, offering a means to compare solvents without relying on experimental data. Early approaches relied on ab initio and density functional theory (DFT) calculations to describe the electronic structure of molecules, providing fundamental insights into solute–solvent interactions. However, the direct application of these methods to bulk liquids was limited by their high computational cost and the complexity of capturing collective solvent effects. To overcome these challenges, hybrid models such as continuum solvation frameworks were introduced, representing the solvent as a dielectric medium surrounding the solute. Building on this foundation, more sophisticated approaches like the Conductor-like Screening Model (COSMO)32 and its extension, COSMO-RS,21 incorporated statistical thermodynamics to connect quantum chemical data with macroscopic properties, enabling accurate predictions of solubility, activity coefficients, and partitioning behaviour. These advances have established quantum chemistry–based models as powerful tools for rational solvent design and the exploration of environmentally benign alternatives.
In the context of solvent substitution, several molecular descriptors are often taken into consideration, obstructing the search for alternative molecules, due to computational cost, and information overlap. Dimensionality reduction techniques such as PCA are frequently combined with clustering techniques to reduce redundant information, indicate similarity and guide the selection or design of substituent solvents. Chastrette15 pioneered a solvent classification system using a multi-parametric statistical approach considering 83 substances. This approach is based on six experimental physical properties (the Kirkwood function (K), molecular refraction (MR), molecular dipole moment (µ), the δ parameter of Hildebrand, index refraction (n), boiling point (bp)) along with the HOMO and LUMO predicted energies. PCA was employed to reduce the original eight-dimensional space to a three-dimensional space. Since the first decade of the 21st century, several data-driven tools have been developed integrating experimental data with PCA for solvent selection and substitution. Launched in 2009 as part of a European industry-academic collaboration, the SOLVSAFE tool applied PCA to a dataset of 347 molecules encompassing 11 chemical families and utilizing 52 structural descriptors. Integrating PCA results with predicted toxicity and ecotoxicity profiles allowed the identification of safer solvent alternatives.33 Similarly, both the AstraZeneca8 and Syngenta34 applied PCA to create simplified “maps” of solvent space, enabling visual representation and comparison of solvents through multivariate statistical analysis of their properties. AstraZeneca's tool8 assessed 272 solvents, representing a wide range of chemical types, based on seven normalized EHS criteria while the Syngenta tool34 allowed the users to select parameters for identifying potential solvents from 209 molecules. Katritzky et al.16 developed QSPR models to predict 127 polarity scales using 168 theoretical descriptors. These descriptors reflect key intermolecular interactions involved in solvation, including cavity formation, electrostatic polarization, dispersion forces, and hydrogen bonding.
Descriptors derived from COSMO-RS, called σ-moments, were first proposed by Klamt and Eckert.35 This approach transforms the σ-potential into a Taylor-series expansion with respect to the surface charge density σ. The coefficients obtained from this expansion capture individual physical information. M0 represents the total molecular area, M1 the negative total charge, M2 the electrostatic energy, and M3 the skewness or asymmetry of the profile.36 This methodology allows for an expansion up to 6 components, although standard applications truncate this series up to 3 or 4 components. Hydrogen bonding behaviour is treated separately based on specific thresholds to define donor and acceptor regions.37 This approach has already been tested for property prediction by several authors.35,36,38
A novel solvent classification approach was proposed by Durand et al.,17 based solely on molecular structure and COSMO-RS.21 Expanding on earlier work by Chastrette et al.,15 the method aimed for a solvent classification from theoretical descriptors by analysing the σ-potential curves of 153 compounds using PCA. Solvents were then classified into ten families using k-means. Following the 3D representation, Moity et al.18 mapped 138 sustainable solvents across the ten solvent classes, based on the principle that solvents with σ-potential curves are theoretically capable of dissolving the same solute in the absence of ionic interactions. The study highlighted the potential of this method as a tool for guiding solvent replacement and the design of new solvents with the desired solubilizing capabilities. However, PCA, the dimensionality reduction technique used for the σ-potential curves, is not well-suited to this type of data. The resulting 3D representation is difficult to interpret, as solvent classes often overlap and are poorly distinguished along the third dimension, with a low information load compared to the other two more important dimensions. These limitations hinder the effective identification of suitable replacement solvents.
Additionally, FPCA offers a powerful statistical technique for extracting dominant patterns from functional data such as the sigma–potential curves generated by COSMO-RS. An overview of each technique is presented below.
While COSMO provides an initial estimate of solvation free energy by modeling the solute in a conductor-like medium, COSMO-RS refines these predictions by incorporating a statistical thermodynamic framework that accounts for real solvent effects and molecular interactions. COSMO-RS integrates the quantum chemical COSMO calculations of individual molecules with a statistical thermodynamic treatment of their pairwise interactions. Fig. 1 shows the surface charge distribution of dimethylformamide (DMF) from COSMO calculations, subsequently converted into a ‘σ-profile’ (Fig. 2), a histogram of surface polarization charge densities across the molecule. The geometrical information from the ab initio computations is dropped, since only segment–segment interactions matter from a thermodynamic point of view.
The σ-potential describes the interaction energy between segments of surface charge density σ and the molecule. The σ-potential curve encodes the electrostatic information of a molecule, effectively representing the affinity of the solvent for a surface charge σ. Because the calculations required for the σ-potential are purely ab initio, no experimental data are required. Another advantage is that all molecules share the same reference state of a conducting medium, which can later be scaled to more accurately represent the dielectric properties of the real environment surrounding the molecule. This makes the σ-potential a suitable candidate as a universal molecular descriptor. However, it does not capture all molecular information. Combinatorial contributions arising from the molecule relative volume and area are considered in COSMO-RS as separate contributions during the calculation of thermodynamic phase equilibria.
Fig. 2 illustrates the σ-profiles and σ-potentials of four different molecules. Apolar solvents like hexane are characterized by centered σ-profiles describing large neutral charge surfaces, and U-shaped σ-potentials where minimal interaction energy can be found only for neutral σ (−1 e nm−2 ≤ σ ≤ 1 e nm−2). Polar amphiprotic solvents are also fairly symmetrical but with many segments in the most polar zones of the σ-profile, and with σ-potentials characterized by m-shaped curves where the interaction energies reach the lowest points at the extremes of σ values indicating a very low interaction capacity with apolar solvents and solutes. Aprotic polar solvents are asymmetrical in both their σ-profiles, with many segments with high positive σ values, and σ-potentials having “S” shaped curves with the best interactions occurring for negative σ values (σ ≤ −1 e nm−2). The behavior of the solvent can be inferred from the shape of the σ-potential. Further treatment of the σ-potential with reduction dimentionality techniques can improve the interpretation of the meaning behind this descriptor.
Indeed, PCA is a statistical technique commonly used to reduce the dimensionality of complex datasets containing correlated variables, while preserving as much variability as possible. It works by transforming the original set of variables into a smaller set of uncorrelated variables, known as principal components (PCs), which are linear combinations of the original variables. The first principal component captures the maximum variance in the data, with each subsequent component accounting for the largest remaining variance, while being orthogonal to the preceding components.
Although PCA is well suited for classical molecular descriptors expressed as discrete variables the situation considered by Durand et al.17 departs from this standard framework. Their analysis relies on the σ-potential, which is inherently functional rather than discrete. Treating such a curve as a long vector of sampled values allows PCA to be applied formally, but it overlooks the structural nature of the data. In practice, this creates several methodological limitations. Functional observations contain strongly correlated values across the domain, which artificially inflates dimensionality and exposes PCA to issues such as sparsity, unstable loadings, and overfitting. Moreover, the discretization required to convert a function into a vector introduces an arbitrary dependence on the sampling resolution; different grids may lead to different principal components and, consequently, to different interpretations. These limitations highlight that conventional PCA is not theoretically aligned with the continuous nature of σ-potentials, and motivate the use of an approach that explicitly accounts for the functional structure of the data.
After applying PCA, results similar to those found by Durand et al.17 were obtained. Fig. 3 shows that a 98.6% of the variance is captured using the first 5 principal components of these results. Indeed, we found that the 98.8% of the variance was represented with 5 components. The first two principal components obtained by PCA were tied to the electron donor and acceptor character of the molecule accounting for the majority of the variance in the data, approximately 75% (Fig. 3). The third PC was attributed to the lipophilicity of the molecules. However, as the lipophilic character of a molecule is primarily determined by the central region of the σ-potential curve, where variation is minimal, the differentiation captured by this principal component has a limited effect. This can be seen in the clustering of solvents in the proposed classification by Durand et al.17 where several families are overlapped due to the third dimension. The remaining components were not tied to other meaningful physical property.
![]() | ||
| Fig. 3 Information on the Principal Components obtained with PCA by Durand et al.17 | ||
A key difference separating classical Principal Component Analysis (PCA) from its functional counterpart (FPCA) rests in both their mathematical foundations and the type of data they are meant to handle. Classical PCA works with finite-dimensional vectors, where each observation is a fixed set of discrete measurements. Its objective is to find linear combinations of these variables that capture the largest possible share of the total variance. This method inherently treats the data as a collection of isolated points, with no assumed relationship, such as continuity or smoothness between one measurement and the next, meaning that the underlying process generating the data lacks abrupt, irregular, or jagged changes.
FPCA, on the other hand, originates from functional data analysis. In this case, each observation is treated as a smooth function defined over a continuous domain, such as time, wavelength, or, as in our case, the σ-potential. Instead of computing a covariance matrix for discrete variables, FPCA estimates the covariance structure across the entire functional space and performs a spectral decomposition. The result is a set of smooth, orthonormal eigenfunctions that describe the main modes of variation in the data. Unlike PCA loadings, which assign weights to separate data points, FPCA eigenfunctions represent coherent functional patterns such as overall shifts, changes in shape, or smooth deformations across the domain.
This distinction becomes crucial when analysing data that are inherently continuous. Take the example of σ-potential: these are not simply lists of independent descriptors, but smooth curves representing the distribution of molecular surface charge density across the σ-scale. Their physical interpretation relies on the shape and smoothness of σ-potential, i.e., adjacent values reflect gradual shifts in local polarity, and the overall curve conveys meaningful information about a molecule's electrostatic character. For applying classical PCA to a discretized version of a σ-potential ignores this functional nature, treating neighbouring σ-values as statistically independent and thereby overlooking the continuity that gives the profile its chemical relevance. In such cases, apparent “patterns” may arise as artifacts of the discretization grid rather than as genuine physicochemical features. (For example, for sigma between 2.999 and 3 [e nm−2], the potential variation is very slight).
FPCA is designed to respect the functional form of σ-potentials. In fact, FPCA is an extension of PCA to situations where data consist of functions, not vectors. It explicitly models smoothness, reduces noise from discretization or measurement, and produces a low-dimensional representation grounded in chemically interpretable variations. For example, the leading functional principal components might correspond to a systematic shift in polarity towards more positive or negative σ-regions, a broadening or narrowing of the central charge distribution, or a change in the balance between polar and non-polar surface regions. These modes align with established physicochemical principles and offer a more intuitive framework for characterizing solvent diversity.
Therefore, when the aim is to reduce dimensionality while retaining the intrinsic continuous structure of σ-potentials, FPCA provides a theoretically sound and scientifically coherent alternative to classical PCA. It bridges statistical methodology with the physical reality of molecular surface properties, making it not merely a technical choice, but a conceptually appropriate one for this type of data.
FPCA calculations were performed on the standardized values of the database matrix of σ-potentials in python software using the scikit-fda,46 a package offering support for Functional Data Analysis (FDA). The shape of the functional components as well as the coordinates of each solvent in the new FPCA space were recuperated for further analysis.
FPCA represents the principal components as the positive and negative effect on the average curve from the dataset. Fig. 5 illustrates the effect of each Principal Component (PC), with the first component having a high dependency of the electronic charge density (ECD) on electron-donating sites (δ−), which can be related to the hydrogen bond accepting capacity. Because most of the solvents in our database are aprotic ranging from low to high polarity, the largest percentage lies in the first PC associated with the ECD. The second PC reflects on the electron-deficient sites (δ+) and the positive charge density (PCD), can then be associated with the Brønsted and Lewis acidity character of the solvents.
![]() | ||
| Fig. 5 Mean σ-potential (black), positive (red), and negative (blue) effect of the Principal Components obtained with FPCA. | ||
Considering only the first two PC results in a minimal information loss of 0.5% (Fig. 6). Applying PCA to the same dataset results in a 98.4% of the variance being captured by the first five principal components, which is consistent with the findings reported by Durand et al.17 There is an increase on the first component suggesting that there are more solvents with high ECD yet the meaning of the PC remain the same.
![]() | ||
| Fig. 6 Variance capture by the Principal components obtained with FPCA and compared against PCA for our organic solvent database. | ||
After the application of FPCA molecules can be represented in a simple two-dimensional plane, facilitating easy interpretation (Fig. 7). As solvents are located further to the left their ECD increases, whereas moving to the right it gradually decreases until it disappears completely. On the other hand, the acidity increases downwards and diminishes upwards. Consequently, groups of solvents can be identified based on their relative positions in the FPCA plane. Apolar solvents, like hexane, toluene, benzene, are found in the upper right corner, with the lowest ability to either give or accept electrons. Aprotic polar solvents, such as dioxane, pyridine and DMF, are located in the upper section of the plane with an increasing ECD as they move closer to the left. Protic compounds capable of donating a proton to form a hydrogen bond are placed in the lower section of the plane with an increasing acid character towards the lower zones in the y-axis, and can be divided in two groups: hydrogen bond acceptor such as alcohols, water, and formamide, and poor acceptor such as phenol and trifluoracetic acid.
![]() | ||
| Fig. 7 Solvents of our database in the FPCA space. Solvents have been separated by the clusters assigned by Moity et al. for comparison. The area within the red rectangle is visualized in Fig. 7. | ||
The use of only two components to fully represent the initial data represents an advantage in respect with previous studies. Chastrette et al.15 used eight molecular descriptors for a database of 83 solvents. The highly correlated descriptors could be represented by three principal components with an information loss of 18%. Alan R. Katritzky et al.49 further expanded this approach, initially considering 40 descriptors for 40 solvents, and later 100 descriptors for 774 solvents,16 all descriptors calculated with QSPR. Resulting in a representation of the solvents with three principal components accounting for over 60% of the variation. Stairs et al.50 applied PCA on solvent spectra and equilibrium rates of reaction. The methodology advanced to a stage where refinement was necessary, with the main challenge being the sparsity of data in certain regions of the solvent spectrum. Diorazio et al.8 addressed this issue via the implementation of an interactive tool reducing the solvent representation from 17 descriptors to six PCs while capturing 87.9% of the variation. Durand et al.17 followed the same approach on the σ-potential. However, the decorrelation process in PCA is not sufficiently robust to fully decompose curves.50 The need for 5 PCs to capture 98.6% of the variation (Fig. 3) limits its usefulness for solvent visualization, clustering and the interpretation of the resulting PCs. The effects on the clusters proposed by Durand et al.17 and later reworked by Moity et al.18 are presented in Fig. 7. The clusters appear correctly positioned within the FPCA mapping: apolar solvents reside in the top-left corner, while aprotic and pair-donor bases are situated at the top, shifting leftward as polarity increases. Solvents characterized by dual positive and negative surface charge screening (amphiprotic) are correctly located in the center.
However, the amphiprotic and polar protic clusters show significant overlap, indicating that they share the same σ-potential shape and that distinguishing these zones into two separate families may be unnecessary. Asymmetric halogenated hydrocarbons span a wide region, as their σ-potentials vary from acidic to basic, a nuance that is visualized more effectively in the FPCA mapping than in previous models.
While there is also overlap among aprotic families, the displacement to the left in the FPCA map clearly distinguishes molecules with higher ECD more effectively than standard PCA. These limitations in the original classification stem from Durand et al.'s use of the third principal component to represent lipophilicity, which inadvertently results in a loss of differentiation regarding polarization.
Furthermore, applying clustering to the results of standard PCA leads to the misclassification of several solvents that exhibit behavior distinct from their assigned clusters. For example, decamethylcyclopentasiloxane, r,r-diisopropyleneglycol, and a series of esters near the aprotic dipolar family are incorrectly classified as apolar. Triethyleneglycol is misidentified as a weak electron-pair donor base, while aniline is grouped with asymmetric halogenated hydrocarbons. Additionally, two molecules from the amphiprotic family are superimposed onto other families: tetraethylene glycol is found near the electron-pair donor bases, and oleic acid is positioned as an organic acidic compound.
The positioning of these solvents within the FPCA mapping shows an improvement on the representation of their polarization profile and highlight the ability of FPCA as a better tool for dimensionality reduction compared to PCA besides requiring only two dimensions to capture the information contained in the σ-potential.
Furthermore, the use of a more extensive database compared to Moity et al. (solvents in gray had not been used in Durand's study) illustrates the lack of a defined families. Instead, a continuum is observed and the description of regions that gradually change their polarization behavior.
The effects of lipophilicity, reflected by the amount of apolar segments in the σ-profile/surfaces, can be observed in the FPCA plane (Fig. 8) with the alcohols and the effect of the alkyl length on their position in the FPCA plane. As the alkyl chain length increases in higher alcohols, the polarization of the O–H bond, and consequently the partial charges (δ−) on oxygen and (δ+) on hydrogens gradually decrease resulting in lowering acidity. This trend is attributed to the increasing electron-donating inductive effect (+I) of the increasingly longer alkyl chain in the molecule. It is worth noting that phenol does not appear in the region corresponding to alcohols (Fig. 8) as its higher acidity results from the delocalization of the oxygen electron pair into the phenyl cycle through an electron-donating mesomeric. This better electronic distribution throughout the molecule also explained the decrease in ECD compared to acyclic alcohols.
![]() | ||
| Fig. 8 Visualization of alcohols in the FPCA space and the effect of the inductive effect of the alkyl chain in partial charge. | ||
The treatment of the data by PCA as vectors of discrete observations, results in considering the σ-potential as a series of individual points, without accounting for the correlation along the whole curve. This can explain the observed information loss. FPCA avoids this issue by considering the σ-potential not as a set of discrete points but as a single functional object. This approach is, however, limited by how the σ-potential is constructed. Since the geometric information of the molecule is lost when the σ-surface is transformed into the σ-profile and σ-potential, steric effects are not fully captured. Molecules lacking accessible polar regions can display either the expected behavior or the opposite. For example, the tertiary amine tributylamine is known to behave as a weaker base than its primary and secondary homologues due to the steric hindrance imposed by its three butyl chains. Nonetheless, as these chains do not completely shield the region above the nitrogen atom, the molecule is represented as having a high electron charge density (ECD), although it is not readily accessible to capture a proton (Fig. 9).
Since activity coefficients can be derived directly from the σ-potential, solvents with similar σ-potentials may serve as effective substitutes. Accordingly, the distance between points in the FPCA space provides a valuable methodology to identify alternatives, as solvents located closer together are more likely to share similar σ-potentials and solvent–solute affinities. This methodology represents a preliminary tool for identifying solvent substitutes. It also serves as an alternative to COSMO-RS relative solubility, particularly when solutes are too complex for the calculation of required COSMO surface input, a limitation that is not present in our methodology.
Experimental results revealed that solvents capable of dissolving nitrocellulose grouped in the same regions as the clusters described by Durand et al.17 The most effective solvents are aprotic ranging from fairly to quite polar located in the upper section of the FPCA space like DMSO, THF and propylene carbonate (Fig. 11). In contrast, those in the apolar region have no effect on the solubilization of nitrocellulose. This behavior was attributed to their apolar surface area, which cannot effectively interact with the charged nitro groups present in nitrocellulose. Some protic solvents in the center of the FPCA plane exhibited good solubility. Further analysis of σ-profiles revealed that polar solvents with smaller apolar surface area like methanol, 2-methoxyethanol, and diisopropylene glycol, interact more effectively with the polar nitro groups (–NO2) and hydroxyl sites present in nitrocellulose. On the other hand, solvents with poor solubility in the aprotic polar region are cases where although there are polarized regions those regions are in symmetrical positions cancelling their dipolar moments, as in the case of dioxane and diethyl sulfide. Aniline is another interesting case, while it presents electron-donor and electron accepting regions, aniline tends to form molecular aggregation in liquid state due to hydrogen bonds,52 reducing its availability to solvate nitrocellulose chains.
![]() | ||
| Fig. 12 (a) Chemical structure of molecules composing Ester Gum. (b) σ-surface of abietic acid monoglycerol. | ||
Hansen's solubility data47 of Ester Gum BL (Hercules Incorporated) in various solvents (complete list in SI) was determined, with 80 of these solvents included in the present classification (Fig. 13). For the experiments, 0.5 g of polymer was placed in a laboratory glass tube with 5 mL of solvent. Hansen's qualitative solubility scoring was simplified into two categories: high (scores 1–2) and low (scores 3–6) solvent–polymer affinities.
Ester Gum is mostly apolar despite the presence of one or two –OH groups which allows it to be soluble in polar and aprotic solvents (diethyl ether, pyridine). Some protic solvents (2-octanol, 2-ethyl-1-butanol) can also dissolve ester gum, provided they have a sufficient proportion of non-polar surface area. For example, short-chain alcohols (such as methanol, ethanol, tert-butanol, and butanol, etc.) are unable to dissolve this molecule. In this case, the diversity of solvent regions covered by ester gum on the FPCA map (Fig. 13) reflects its versatility, which opens the possibility of applying additional filters (less ecotoxicity, bio-based content, specific interaction with other ingredients, etc.) to narrow down the list of candidate solvents depending on the target application.
As shown in Fig. 14, triolein, which accounts for around 70% of the total fat content of rapeseed oil, has large low-polarized areas due to its three long alkenyl chains. Thus, aprotic apolar like cyclohexane indicated on the FPCA map of Fig. 15, are good candidates for dissolving the oil due to interaction with the long carbon chains in triolein. The presence of three ester groups attracts electron density through mesomeric effect and generating electron-depleted zones around neighboring hydrogens. These zones act as anchoring points for aprotic polar solvents with mild to quite high polarity. Consequently, molecules located in the upper region of the PCA space (Fig. 15), with a sufficiently large apolar surface area to interact with carbon chains represent promising solvent candidates, such as cyclohexanone, dioxolane, and 2-methyltetrahydrofuran.
Greener solvents located near regions densely populated by “good toxic solvents” may offer promising alternatives in industrial applications to comply with international regulations, such as REACH, thereby reducing adverse effects on human health and the environment. Greener solvents, which often include bio-based, recyclable, eco-friendly and/or low-toxicity compounds, are increasingly incorporated into manufacturing processes, including pharmaceuticals and agrochemicals for coatings processes and polymers production. Their adoption not only supports sustainability goals but also enhances process safety, reduces waste and improves occupational health conditions. Alternative solvents can be found using this cartography and then selected according to the properties required for the intended application other than those of solubilization.
Comparisons with earlier approaches that used PCA on multiple discrete descriptors confirm the superiority of FPCA in both interpretability and data fidelity capturing 99.5% of the variance with only two components. By contrast, PCA applied to the same σ-potential dataset required four components to retain ∼96% variance and relied on a low-variance dimension as a major component for solvent differentiation, which led to overlapping clusters.
The utility of the FPCA-derived solvent space was validated through three case studies: nitrocellulose, ester gum, and rapeseed oil. These examples demonstrate FPCA's predictive capacity for solvent efficacy, as the relative positioning of good and poor solvents consistently confirmed the explanatory power of the FPCA dimensions. For instance, solvents effective in dissolving nitrocellulose clustered in the zone of moderately high ECD and mild acidity, consistent with its polar nature. By contrast, the dissolution behaviour of ester gum and rapeseed oil, both more apolar in nature, was predominantly governed by solvent lipophilicity and auxiliary polarity features, as reflected by their distinct spatial distribution within the FPCA framework.
Because the FPCA space preserves the functional shape and nature of the σ-potential, solvents located in close proximity can be expected to behave similarly, even when not previously tested experimentally. This opens promising opportunities for identifying greener and safer alternatives to traditional solvents using distance metrics within this reduced two-dimensional space. The clustering of known “green” solvents in proximity to functional equivalents suggests that the FPCA space can serve as a preselection tool for candidates that align with regulatory (e.g., REACH, CHEM21) and sustainability goals. The addition of new solvents, either as pure compounds or as multicomponent mixtures, can be easily carried out once their associated σ-potentials have been obtained using COSMO-RS. These solvents can then be added to the database, and FPCA can be reapplied to obtain their corresponding coordinates in the 2D-mapping.
The direct relationship between activity coefficients and σ-potentials confirms that solvents with similar profiles may serve as substitutes. Our results demonstrate that the distance in FPCA space successfully captures this similarity, providing a method to identify alternatives based on proximity. This framework represents a novel preliminary tool for selecting alternative solvents. Crucially, it provides a solution for systems where COSMO-RS relative solubility cannot be applied due to the complexity of generating solute COSMO surfaces, as our methodology does not require these inputs.
Nonetheless, this approach has limitations. Although FPCA captures the shape of the chemical potential across the solvent surface, it does not explicitly encode molecular geometry which can displace molecules from their expected behaviour as in the case of tributylamine. The mapping obtained from FPCA also does not account for intermolecular interactions leading to the formation of dimers or aggregates. Moreover, the transformation from σ-surface to σ-profile and σ-potential entails information loss, omitting specific three-dimensional conformational and polarization details that may influence solvent behaviour in complex systems. In such cases, the respective contributions of conformer selection and specific hydrogen-bond coordination must be evaluated to ensure an optimal application of the methodology. These limitations highlight the need for future extensions that incorporate 3D geometric data to capture combinatorial contributions or explicit consideration of the σ-surface and σ-profile, thereby transforming the plane into a three-dimensional representation.
The methodology also needs to be validated in predictive modelling in QSPR models or ML approaches to transition from a qualitative description to a quantitative predictive capacity. Finally, σ-moments descriptors that have got renewed interest38 would be a suitable benchmark to compare our FPCA approach although the σ-moments relate to σ-potential and σ-profiles while our PC1 and PC2 relate only to σ-potential.
| This journal is © The Royal Society of Chemistry 2026 |