Sigma Profiles in Deep Learning: Towards a Universal Molecular Descriptor

This work showcases the remarkable ability of sigma profiles to function as molecular descriptors in deep learning. The sigma profiles of 1432 compounds are used to train convolutional neural networks that accurately correlate and predict a wide range of physicochemical properties. The architectures developed are then exploited to include temperature as an additional feature.


S1. The Sigma Profile
This section briefly examines the concept of the sigma profile (σ-profile) and summarizes the database developed by Mullins et al., 1 which was used throughout this work. Although some of the properties of the σ-profile are illustrated in this section, a complete and detailed description of the COSMO and COSMO-RS theories is beyond the scope of this work and the interested reader is directed to seminal works in the field. 2,3 The screened charges mentioned in the previous paragraph are calculated using quantum chemistry methods, in particular density-functional theory (DFT). Briefly, a cavity representing the surface of a molecule is constructed and embedded in a continuum solvent with infinite permittivity (COSMO solvation model). The geometry of this molecule is then optimized and, at each step, the screened charge of each area segment is computed. Because the molecule is optimized in a continuum solvent, rather than vacuum, its σ-profile is a more faithful representation of its geometry, polarity, and potential for non-covalent interactions in condensed phases.

S1.1 Sigma Profiles as Molecular Descriptors
In the main text of this work, some claims were made regarding the advantages of σ-profiles as molecular descriptors in deep learning. These claims are now examined in more detail. To illustrate the type of chemical information encoded in σ-profiles, the σ-profiles of ethane, acetaldehyde, and ethanol, taken from the Mullins et al. 1 , are depicted in Figure S1. Note that the three compounds form a series with increasing polarity but similar structural backbones. Also note that P(σ)•A values  The σ-profile of ethane, being a fully apolar molecule, is located at σ values around 0 e/Å. Going from ethane to acetaldehyde, which represents the replacement of one proton by an oxygen, a new σ-profile region arises for σ values larger than 0.0082 e/Å 2 . This is to be expected, since the oxygen adds a significant negatively charged surface area (or, in other words, a positive screened charge) to the molecule. Perhaps more interesting is the deviation of the main apolar peak, which for ethane is located around a σ value of 0 e/Å while for acetaldehyde is located around -0.006 e/Å. This is the result of electron density asymmetry in the carbon-oxygen double bond; because oxygen is more electronegative than carbon, a dipole is formed due to electron transfer, with the carbon becoming slightly more positive than its ethane counterpart. Finally, a new peak in the positive area of the molecule (σ < -0.0082 e/Å 2 ) arises for ethanol, as expected, while the apolar peak is located at a position intermediate of that of ethane and acetaldehyde. Note that the oxygen peak is located at a larger σ value, indicating a more pronounced negative polarity, which is due to a pronounced electron density transfer across the O-H bond.
The previous paragraph illustrates some of the chemical properties encoded within the σ-profile, including subtle effects such as electron density asymmetries. Because of this, the σ-profile captures quite well the polarity of each local substructure of the molecule, which is the reason behind its high performance in describing non-covalent interactions. Having demonstrated the type of chemical information captured by the σ-profile, its usefulness with regards to a fixed input size for machine learning methodologies is now discussed. Figure S2 depicts the σ-profiles, taken from the Mullins et al., 1 Figure S2 shows that the main difference between the σ-profiles of decane, pentadecane, and icosane is the size of their apolar peak (the σ-profile peak around a σ value of 0 e/Å). This is quite important for machine learning methodologies. The size of most of the molecular descriptors mentioned in the main text (SMILES and SELFIES strings, fingerprints, graph representations, etc.) increases with the size of the molecule. 5 Due to the inherent architecture of neural networks (and other machine learning methodologies), the size of their input must be fixed. Thus, if the size of the molecular descriptor increases with increasing size of the molecule, the neural network input size must be fixed and equal to the largest molecule available. However, if the σ-profile is used as a feature set in deep learning, the size constraint of this input is no longer the structural size of the molecule, as this is captured by the value of the σ-profile at each σ value. In other words, because the screened charges of virtually all neutral molecules are contained in the range between σ = -0.025 e/Å 2 and σ = 0.025 e/Å 2 , the size of a σ-profile input to a neural network depends only on the resolution chosen for the σ-profile (i.e., number of points selected from the σ-profile plot).

S1.2 Sigma Profile Database
All σ-profiles used in this work were obtained from the database developed by Mullins  The data contained in the Mullins et al. 1 database was read using an in-house Python script and converted into a single csv file, which is included in the GitHub repository associated with this work.
This Python script calculates the molar mass of each compound and validates its database entry by cross-checking with the Chemical Identifier Resolver (CIR) database, through its Python wrapper (CIRpy). 6 All changes to the original database were registered and are listed in the GitHub repository.
In particular, the following typos were identified: Index 78: CAS number changed from "4390-04-09 00:00:00" to "4390-04-9" Index 158: CAS number changed from "6094-06-2" to "6094-02-6" Index 162: CAS number changed from "7642-10-06 00:00:00" to "7642-10-6" Index 426: CAS number changed from "15798-64-8" to "4170-30-3" Index 1000: CAS number changed from "288761" to "2690-08-6" Index 1057: CAS number changed from "136911" to "2274-11-5" Index 1077: CAS number changed from "2148878" to "7783-06-4" Index 1080: CAS number changed from "2125683" to "7719-12-2" Index 1106: CAS number changed from "2125597" to "7719-09-7" Index 1240: CAS number changed from "2998-08-05 00:00:00" to "2998-08-5" Index 1259: CAS number changed from "4445-07-02 00:00:00" to "4445-07-2" Index 1692: CAS number changed from "91352" to "2150-02-9" Each σ-profile extracted from the database developed by Mullins et al. 1 is composed of 51 data points. These represent the σ-profile values at 51 different σ values, from σ = -0.025 e/Å 2 to σ = 0.025 e/Å 2 , in intervals of 0.001 e/Å 2 . All 51 σ-profile values were used as the input for the convolutional neural networks developed in this work. This is illustrated in Figure S3 for three assorted compounds (6-aminohexanol, chlorosulfonic acid, and metylamine). As usual in the machine learning literature, the input data to a neural network should be normalized, as it enhances the convergence smoothness of the training of the network and allows for the use of advanced fitting techniques such as regularization. This means that the σ-profile data at each σ value must be normalized. However, these data do not follow a typical normal distribution, as illustrated in Figure S4.  Figure S4 illustrates the distribution of the σ-profile at different σ values, showing that the data follows more closely log-normal probability distributions than normal distributions. This means that the logarithm of the data, not the data itself, follows a normal distribution. Thus, the data are normalized in this work using log-standardization: where ( ) ′ represents a normalized σ-profile series for a specific σ, whereas 〈ln[ ( ) + 10 −3 ]〉 and ln[ ( ) +10 −3 ] are the average and standard deviation of the ln[ ( ) + 10 −3 ] data for a specific σ value. Note that the σ-profile can take the value of zero at any σ value; as such, a small buffer (10 -3 ) is added to allow for the logarithmization of the data. The data depicted in Figure S4 are now presented in Figure S5 after being normalized using Equation S2. As Figure S5 clearly shows, the normalization procedure represented by Equation S2 works well for σ-profile series at small σ values but not for larger σ values. However, this normalization procedure is still more accurate than standardization alone and was, thus, employed throughout this work.

S2. Physicochemical Properties
This section examines the physicochemical properties used in this work, particularly their distributions and normalization procedures. The properties studied in this work are: -Molar mass -Normal boiling temperature -Vapor pressure at 25 ºC -Density (at 20 ºC or temperature-dependent) -Refractive index at wavelength 589 nm (at 20 ºC or temperature-dependent) -Aqueous solubility (at 25 ºC or temperature-dependent) Except for molar mass, which was calculated from the molecular structure of each compound, the data for all properties were taken, when available, from the CRC Handbook of Chemistry and Physics (Internet version). 7 Note that density, refractive index, and aqueous solubility are available at different temperatures. In a first instance of this work, these three properties were selected at a specific temperature (20 ºC and 25 ºC). Then, for these three properties, the entire dataset was used, and temperature was explored as a feature in the neural networks developed in this work.

S2.1 No Temperature Input
The properties listed above were chosen to demonstrate the ability of σ-profiles to encode the necessary information to accurately fit and predict several different material properties. In particular, molar mass was chosen because i) it is available for all compounds included in the σprofile database and ii) it shows that σ-profiles also encode molecule structure information that can be retrieved (this is not a trivial result, given how the position of different polarity peaks changes with the local structural environment of the molecule, as demonstrated in Figure S2). The sizes and data ranges of the datasets for each fixed-temperature property are reported in Table S1. The datasets are depicted as histograms in Figure S6, in a similar fashion to the analysis performed in the previous section for the σ-profile dataset. Besides summarizing the structure of the datasets available, Figure S6 sheds light on the best normalization procedures for each property. Because these properties are well described by normal distributions, with the exceptions of vapor pressure and aqueous solubility, they were normalized using standardization: where and ′ represent a generic property and its normalized version, respectively, while 〈 〉 and are the average and standard deviation of the dataset of . Because vapor pressure and aqueous solubility are more accurately described by a log-normal probability distribution, they were normalized using log-standardization, akin to the procedure carried out for the σ-profile datasets but without using a non-zero buffer: where 〈ln( )〉 and ln( ) represent the average and standard deviation of the logarithmized dataset of . The normalized data using these procedures (Equations S3 and S4) are depicted in Figure S7.  Figure S7 shows that the normalization methods used on the datasets available are appropriate, as they are all well described by the standard probability distribution. It is particularly interesting to note the diversity of vapor pressure and aqueous solubility data, which was not as easy to see in Figure S6 given the fact that their values span across several orders of magnitude.

S2.2 Temperature as an Additional Feature
As mentioned in the previous section, some physicochemical properties (density, refractive index, and aqueous solubility) were also studied at all temperatures available in the CRC Handbook of Chemistry and Physics (Internet version). 7 The data are summarized in Table S2 and Figure S8, while the results of the normalization procedure are depicted in Figure S9. Not surprisingly, the data depicted in Figure S8 follow the same pattern as the data in Figure S6.

S3. Computational Details
This section provides details regarding the deep learning methodologies used throughout this work.
The Python code mentioned in this section can be accessed through the GitHub repository. The Keras module of the open source Python package TensorFlow (V. 2.6.0) 8 is the main machine learning engine used in this work.

S3.1 Neural Network Architecture
As mentioned in the main text, given the relatively small size of the datasets used in this work, a typical densely connected deep neural network could not be used, as this would lead to neural networks with an enormous number of fitting parameters. Thus, convolutional neural networks were used instead. Convolution is a technique useful for large inputs where data are somewhat ordered, such as temporal plots, where the values around a given time instance are highly correlated, or images, where pixels in each region of the image are also highly correlated. 9 The generic architecture of each convolutional neural network (CNN) developed is depicted in Figure 1 of the main text. Note that a different CNN was developed to fit and predict each different property.
This subsection provides details about their general architecture.
The main input of each CNN is a 1-dimensional vector composed of 51 entries (the 51 σ-profile data points of each molecule). This input is passed through two sets of convolution (TensorFlow module Conv1D) and pooling layers (TensorFlow module AveragePooling1D or MaxPool1D). The number of filters, kernel size, and number of strides in each convolution layer are hyperparameters to be tuned, as will be discussed in section S3.3. The pool size of the pooling layers also is set as a hyperparameter to be tuned and the number of strides is set to be the same as the pool size. In both convolution and pooling layers, padding is allowed (TensorFlow keyword same). Thus, if the ratio between the input size of the layer and its kernel or pool size is not an integer, zeros are evenly added at the beginning and end of the input until this ratio becomes a whole number. After the two sets of convolution and pooling layers, the output is flattened and passed through two densely connected layers (TensorFlow module Dense). The number of nodes on each of these layers is a hyperparameter to be tuned.
The activation function of all nodes, be it in a convolution or dense layer, is set as a hyperparameter to be tuned. Because the work here developed is regression deep learning, rather than classification, two types of activation functions were examined: ReLU and Swish. Swish is a novel activation function that mitigates some of the well-known disadvantages of the ReLU activation function, outperforming it in most deep learning methodologies. 10

S3.2 Neural Network Fitting
To ensure an efficient fitting of the networks and to maximize their generalization capability, several techniques were used, namely stratified data splitting, regularization, and early stopping, as detailed below.
The dataset of each property was initially split into a fitting dataset (90%) and a testing dataset (10%). This testing dataset was never used in any of the tuning or fitting procedures described below. Instead, it was employed to independently evaluate the final model obtained for each property. The fitting dataset of each property was further split into training (80%) and validation (20%) sets using stratified splitting. The stratification was based on the value of the labels. In each data split performed, the labels of the dataset were divided into equidistant bins. The number of bins was selected as the minimum number which ensured that each bin possessed at least five data points. There were no restrictions regarding the maximum number of data points per bin. Finally, the splitting was performed such that the training/validation ratio for each bin corresponds approximately to the same overall ratio (80/20). This procedure ensures that data lying in the edges of each dataset range is still well represented in the validation set.
L2 regularization was employed on all layers that contain trainable parameters (convolution and dense layers) for both the weights and biases. The lambda parameter of the regularization was left as a hyperparameter to be tuned. The use of L2 regularization is a great asset to prevent overfitting of the neural networks and to maximize their generalization capability. In particular, L2 regularization, as opposed to L1 regularization, is better suited to feature sets that present a high degree of multicollinearity (local correlation), such as those used in this work (σ-profiles).
The Adam optimizer 11 with a mean-squared-error loss function was used to fit all neural networks developed in this work. The learning rate and its exponential decay rates (β1 and β2) were left as hyperparameters to be tuned. Each neural network was fitted for an arbitrary number of epochs, and early-stopping was used with a patience of 500 epochs (number of epochs with no improvement after which training is stopped) to halt the fitting when the mean-absolute-error of the validation set stopped improving. Note that the batch size during fitting was also left as an hyperparameter to be tuned. Finally, the initial values of the biases were set to zero and the initial guesses for all weights of each CNN were obtained using the methodology developed by He et al. 12

(TensorFlow keyword
HeUniform), which is particularly efficient at initializing the weights of convolutional neural networks.

S3.3 Hyperparameter Optimization
The hyperparameters of the convolutional neural networks to be optimized, which were mentioned throughout the previous two sections, are listed in Table S3. The hyperparameter tuning was performed using the open source Python package Sherpa (V. 1.0.6). 13 Because of the tremendous number of possible hyperparameter combinations, which renders any attempt to test each combination computationally infeasible, a two-fold tuning strategy was adopted by first using Bayesian optimization and then a local search algorithm, as will be detailed below. Regardless of the tuning algorithm, the objective function ( ) was defined as follows: where 2 is the coefficient of determination of the validation set, is the total number of trainable parameters of the CNN (not to be confused with hyperparameters), and is total number of data points in the training set. Note that this objective function greatly limits the number of trainable parameters of the CNN, which is a technique here employed to prevent overfitting and maximize generalization capability.

Architecture Hyperparameters Activation Function
ReLU As stated above, the first stage of the hyperparameter tuning is Bayesian optimization. During this step, the fitting hyperparameters were fixed (a learning rate of 10 -3 , β1=0.99, β2=0.999, λ=10-3, and a batch size of 16) and the architecture hyperparameters were tuned. A schematic representation of this procedure is depicted in Figure S10. A Local Search algorithm was employed after the initial Bayesian optimization step. This algorithm takes a configuration seed (a hyperparameter set that has been pre-optimized) and makes small, stepwise changes to its parameters, trying to find a slightly different configuration that performs better. This algorithm was first applied to the architecture hyperparameter space using the best Bayesian optimization hyperparameter set as the configuration seed. Then, the results of this first iteration were used as the next configuration seed, and the fitting hyperparameters were tuned. This process was repeated back and forth five times, as illustrated in Figure S11. During this entire process, all random processes were fixed using seeds. Finally, once the best architecture and fitting parameters were found, the random seed was allowed to change, and several attempts were performed to fit the final network.

S4. Additional Results
This section presents the intermediate results of this work, including the results of the hyperparameter tuning and the performance of the convolutional neural networks. The results are divided between those relating to the temperature-independent datasets and those relating to the networks where temperature is used as a feature.

S4.1 No Temperature Input
The results of the Bayesian optimization (the first step in the hyperparameter optimization) for the temperature independent datasets are reported in Figure S12. As explained in section S3.3, the best result of the Bayesian optimization for each dataset was used as the initial configuration seed for the Local Search algorithm. The results of this hyperparameter tuning step are reported in Figure S13. The best hyperparameter set, obtained using the aforementioned procedures, are reported in Table S4 for each CNN developed. The performances of each CNN are depicted in Figure 2 of the main text and Figure S14. Note that the performance of the CNN for density in the testing set is particularly biased due to an outlier. If this outlier is removed, the coefficient of determination changes from 0.68 to 0.79 and the mean absolute error changes from 0.09 g/mL to 0.07.

S4.2 Temperature as an Additional Feature
The results of the Bayesian optimization (the first step in the hyperparameter optimization) for the convolutional neural networks that use temperature as an additional input are reported in Figure S15. Akin to the procedure carried out in the previous section, the best result of the Bayesian optimization for each dataset was used as the initial configuration seed for the Local Search algorithm. The results of this hyperparameter tuning step are reported in Figure S16. The best hyperparameter set, obtained using the aforementioned procedures, are reported in Table S5 for each CNN developed. a) 2 nd convolution layer is not used. b) 2 nd dense layer is not used.