Re ﬁ nement and extension of COSMO-RS-trained fragment contribution models for predicting the partition properties of C 10 – 20 chlorinated para ﬃ n congeners †

COSMO-RS-trained fragment contribution models (FCMs) to predict the partition properties of chlorinated para ﬃ n (CP) congeners were re ﬁ ned and extended. The improvement includes (i) the use of an improved conformer generation method for COSMO-RS, (ii) extension of training and validation sets for FCMs up to C 20 congeners covering short-chain (SCCPs), medium-chain (MCCPs) and long-chain CPs (LCCPs), and (iii) more realistic simulation of industrial CP mixture compositions by using a stochastic algorithm. Extension of the training set markedly improved the accuracy of model predictions for MCCPs and LCCPs, as compared to the previous study. The predicted values of the log octanol/water partition coe ﬃ cients ( K ow ) for CP mixtures agreed well with experimentally determined values from the literature. Using the established FCMs, this study provided a set of quantum chemically based predictions for 193 congener groups (C 10 – 20 and Cl 0 – 21 ) regarding K ow , air/water ( K aw ), and octanol/air ( K oa ) partition coe ﬃ cients, subcooled liquid vapor pressure (VP) and aqueous solubility ( S w ) in a temperature range of 5 – 45 (cid:1) C as well as the respective enthalpy and internal energy changes. The partition properties for chlorinated para ﬃ ns (CPs) are di ﬃ cult to determine both experimentally and theoretically due to the extremely high complexity of CP mixtures. COSMO-RS-trained fragment contribution models re  ned in this study enabled rapid calculations of partition properties for an extended set of individual CP congeners. Using FCMs, this work generated a large data set of quantum-chemically based predictions of partition properties for 193 congener groups ( i.e. , homologue sets). These data can be used for, e.g. , environmental fate and bioaccumulation models or be compared to  eld monitoring data to understand the environmental behavior of CPs.


Introduction
Chlorinated paraffins (CPs) are mixtures of polychlorinated nalkanes with different carbon chain lengths and chlorination patterns. Depending on the length of the carbon chain, CPs are classied into short-chain (SCCPs, C 10-13 ), medium-chain (MCCPs, C [14][15][16][17] ) and long-chain CPs (LCCPs, C 18+ ). SCCPs are restricted under the Stockholm Convention due to their recognized persistent, bioaccumulative and toxic properties. 1 MCCPs and LCCPs are also of environmental concern because of their frequent detection in various environmental and biological samples [2][3][4][5] including samples from humans. 6,7 Environmental property values such as equilibrium partition coefficients (K) need to be known for environmental fate and risk assessments, although the extremely high complexity of CP mixtures hampers the use of conventional experimental and theoretical approaches to obtain their property data.
In the previous study, 8 we proposed to calibrate a fragment contribution model (FCM) with log K values predicted by the COSMO-RS (conductor-like screening model for real solvents) method for developing property prediction models for individual CP congeners. This approach combines the advantages of both FCM and COSMO-RS methods. Thus, FCMs are an empirical model that is computationally simple and fast but need to be trained with a large and diverse set of data. FCMs are usually trained with experimental data, 9,10 which however are not sufficiently available for CP congeners. COSMO-RS is quantum-chemically based prediction theory for solute activity in solvent and can provide accurate predictions for partition properties such as K values. 11 Glüge et al. 12 demonstrated that COSMO-RS was the most accurate of the three methods that they compared for predicting partition properties of CPs. Our previous work 8 also showed that COSMO-RS predictions for partition coefficients of individual CP congeners were mostly within 1 log unit of the experimental data available. A strong advantage of COSMO-RS is that it can calculate properties from the molecular structure and does not need any additional empirical calibration. Nevertheless, the quantum chemical calculations are highly time-demanding for CP molecules and cannot be performed for thousands of congeners possibly present in CP industrial mixtures. The previous work 8 demonstrated that FCMs trained with 815 COSMO-RS-predicted K values can predict the original COSMO-RS calculations within an RMSE of 0.1-0.3 log units. The trained FCMs were able to provide predictions for octanol/water (K ow ), air/water (K aw ) and octanol/air (K oa ) partition coefficients for as many as 52 000 CP congeners within a reasonable time. The predicted values were shown to agree well with available experimental log K values for individual congeners.
To provide even more useful sets of property data for individual CP congeners, congener groups (dened here as congeners with the same molecular formula; also referred to as homologue groups or as isomers), and whole mixtures, the present work rened and extended the COSMO-RS-trained FCM approach in the following regards. First, the training and validation sets for FCM development were revised, covering MCCPs and LCCPs, which were le out of consideration in the previous work. 8 Second, COSMO-RS-trained FCMs for subcooled liquid saturation vapor pressure (VP) and aqueous solubility (S w ) were developed in addition to K ow , K aw , and K oa . Third, the present work calibrated FCMs for these properties at differing temperatures as well as enthalpy changes (DH) and internal energy changes (DU). Fourth, a Monte Carlo method was introduced that simulates synthetic pathways of CP molecules and predicts congener compositions of technical mixtures with varying degree of chlorination. 13 Combined with the trained FCMs, this stochastic method should offer more realistic property distributions of CP mixtures in comparison to the random generation of congeners as adopted in the previous study. All predicted data are provided as tables and gures in the ESI † together with R scripts for additional calculations to facilitate the use of outputs from this work.

Method overview
The workow of this study is summarized in Fig. 1. The properties considered are log K ow , log K aw , log K oa , log VP (in Pa), and log S w (in mol L À1 ) at 5, 15, 25, 35 and 45 C and corresponding temperature-dependence values which are required for interpolation of these properties (DH ow , DU aw , DU oa , DH vap , and DH diss , respectively). COSMO-RS was used to generate training and validation data for FCMs. A set of congener structures built by the Monte Carlo model was passed to the trained FCMs to predict property distributions for CP mixtures. Details are explained in the following sections.

COSMO-RS
In this work, COSMOconfX 20, TURBOMOLE 7.4 and COSMO-thermX 20 (all from COSMOlogic, Biovia, Dassault Systèmes) were used for implementation of COSMO-RS. The rst two programs provide an optimal set of conformers of each chemical and derive the cosmo and energy les that contain the information (e.g., energy, and surface screening charge density) needed for successive calculations in COSMOthermX. COSMO-confX and TURBOMOLE were run on the NIES supercomputer system (HPE Apollo 2000 System). COSMOthermX with the BP_TZVPD_FINE_20 parameterization was used to calculate the property values. S w was calculated as VP/(RTK aw ) for consistency and computational efficiency, where R is the gas constant and T is the absolute temperature. S w calculated here is for dry, pure subcooled liquid and does not consider mutual saturation with the water phase. The effect of mutual saturation is expected to be small because of low S w of CP congeners. Note that K ow was calculated with wet octanol and K oa with dry octanol. Temperature dependence was evaluated by calculating the properties for 278.15, 288.15, 298.15, 308.15 and 318.15 K. As the log of the COSMOtherm-predicted values was linear against 1/T, the slope of the following equation was used to derive DH and DU, where SP is a given solute property, R is the gas constant, and c is the regression constant. CP congeners can have many stereoisomers because each ÀCHClgroup can be chiral. Moreover, all CP molecules have many rotatable bonds and thus an enormous number of possible conformers exist. Due to these structural features, two considerations appear to be necessary when COSMOconfX is used. First, the default algorithm of the COSMOconfX sometimes generates stereochemically inconsistent output structures as compared to the original input structure, as stated in the previous article. 8 This problem can be circumvented by using the Windows version of COSMOconfX, removing the RDKit conformer generation step, and running only the Balloon method to generate initial candidate conformers. Second, Fig. 1 Workflow for predicting the property distributions of CP mixtures.
COSMOtherm prediction for CPs sometimes depends on the original input structure entered in the COSMOconfX program. It appears that the default algorithm of COSMOconfX cannot always nd the optimal conformation of a CP molecule in the gas phase. This results in sporadically large solvent-air (or low air-solvent) partition coefficients (up to 0.47 log unit difference). Some examples are provided in ESI-1 (see ESI-A, Fig. S1, S2 †). Indeed, by increasing the number of initial candidate conformers to consider (ca. 4 times), it was possible to reduce variability and obtain more repeatable predictions (Fig. S1, S2 †).

Fragment contribution models (FCMs)
FCMs are a linear regression model that uses the numbers of molecular substructures (i.e., fragments) as independent variables to describe the properties of interest. The implementation of FCMs for CP congeners was described in the previous article. 8 In brief, molecular fragments with one to four C atoms were counted for each CP congener. In total, 310 fragment types were considered, including 78 fragments that describe the diastereomeric structure (see the next paragraph for more explanation). For training of the FCMs, COSMOtherm-predicted property values were regressed against these fragment counts. The forward and backward stepwise approach was performed with Akaike's Information Criterion (AIC) as the evaluation metric to select an optimal set of fragments. In contrast to the previous work, this work only reports the model that uses all C 1 -C 4 fragments, denoted as the Level 4 model in the previous article, because this model always performed the best. The partial least squares regression (PLSR) was also computed but external validation showed that the PLSR-derived model was not better than the model directly from the stepwise regression; thus, the PLSR results are not reported here. Fragment counting and statistical analysis were all performed with R (3.6.2) using ChemmineR Some more explanations for fragments that include the diastereomeric structure may be useful. A C 2 -fragment -CHCl-CHCl-is the simplest example of fragments with a diastereomeric structure. Both C atoms can be chiral, and depending on their rotational congurations, two diastereomeric structures are possible: one with the two C atoms having the same rotational conguration and the other having the opposite rotational congurations. Thus, this study counted "the total number of (stereometrically nonspecic) -CHCl-CHCl-" and "the number of -CHCl-CHClwith the same rotational conguration" and used these counts as FCM variables. C 3 -and C 4 -fragments are obviously more complicated. The principle is that, for each fragment, all possible stereometrically specic structures are listed, meso-structures are identied to avoid double counting, and the numbers of enantiomeric structures are combined to account for difference in diastereomers.

Training and validation sets for FCMs
Overall, 1070 congeners for training and 420 congeners for validation were prepared for this work (Fig. 2). Among these, 815 training (grouped as "T0" in Fig. 2) and 120 validation congeners (V0) are those used in the previous study. 8 These original congeners are short and were chosen before because of relatively fast quantum chemical computation. Note, cosmo and energy les for T0 and V0 were all updated in this work using the rened COSMOconfX procedure, as discussed above. In addition, this work prepared 50 congeners each for C [11][12][13] and 30 congeners each for C 14 , C 16 , and C 18 as additional training compounds. Moreover, 15 n-alkanes (C 6-20 ) were added to enlarge the domain of training. The validation set was also extended by adding 30 congeners each for C [11][12][13][14][15][16][17][18][19][20] congeners.
For a given carbon chain length, structures of training and validation congeners were generated in a random manner. In the previous study, 8 H or Cl was assigned to a substitution position at a probability of 50%. Thus, on average, half of the positions were substituted with Cl, equivalent to the mean chlorination degree of 75 wt%. This is relatively high as compared to typical compositions of industrial mixtures (i.e., 30-70 wt%). Moreover, there was an indication that the FCMs trained in the previous study were comparatively inaccurate for low chlorinated congeners. 8 Therefore, for generation of additional training and validation congeners, the current study adopted a reduced probability of chlorination (53 wt% Cl on average) so that lower chlorinated congeners could be better represented.
To test the inuence of the training set on the prediction accuracy, training congeners were grouped into 3 sets (T0, T1, and T_all) and validation congeners into 5 sets (V0, V1, V2, V3 and V_All), as shown in Fig. 2, and the FCM results of all combinations were compared.

Monte Carlo model
Industrial CP mixtures contain an unknown number of congeners with differing molecular structures. To predict the property distributions of mixtures using FCMs, a set of molecular structures that represents the composition of the mixture needs to be prepared. In the previous study, 8 molecular structures were generated on a random basis. However, past studies have shown that Cl substitution does not randomly occur. 16,17 To obtain a more representative ensemble of molecular structures, this work used the Monte Carlo approach developed by Jensen et al. 13 The algorithm of Jensen et al. starts with a set of say 10 000 n-alkane molecules and simulates the free-radical chlorination reaction as a series of substitution of H with Cl. Different probabilities for Cl substitution are assigned to C atoms in the alkane chain, depending on the neighboring Cl substitution patterns. The probabilities were set based on the available experimental data. 13 As this stochastic simulation proceeds, the n-alkane molecules are increasingly chlorinated, exhibiting different substitution patterns, and lead to a nal set of CP structures which is expected to mimic the composition of the actual CP mixture.
In this work, the algorithm from Jensen et al. was used with some modications. First, while the original model randomly selected a C atom to be challenged by chlorine, this work randomly selected an H position instead. This change was made to generate stereoisomers. The two H atoms at a C atom were assumed to have the same probability for Cl substitution (i.e., the same probability to obtain R and S congurations). Second, the three H positions of a terminal C atom were attacked at 2/3 probability of the H positions inside the chain, so that the probability remains uniform with regard to all C atoms. Third, at each reaction step, 1% of the total alkane molecules were selected for possible reaction, instead of one molecule by Jensen's approach, in order to speed up the simulation. The reaction cycle was repeated until the wt% of Cl exceeded the pre-set value (e.g., 40 wt%). A list of SMILES strings was then generated according to the results of simulation and passed to the FCMs to predict the distributions of physicochemical properties for the given mixture. An R code is provided in the ESI zip e to run this Monte Carlo simulation.
The simulation was performed for C 10-20 alkanes with 30-70 wt% Cl. On the basis of preliminary test runs (see ESI-B, Fig. S3 †), the number of n-alkane molecules was set to 10 000 for each combination of the chain length and Cl wt% so that we can obtain reliable K distributions for all congener groups that make up more than 1 mol% of the mixture.

Training and validation sets for FCMs
Prediction accuracy of trained FCMs as evaluated by root mean squared errors (RMSEs) depended on the combination of training and validation sets. RMSEs for log K aw at 25 C are presented in Fig. 3 as an example (see more data in Fig. S4 and Table S1 †). T0-V0 is the combination of training and validation sets used in the previous study. 8 RMSEs for this combination appreciably improved; thus, RMSEs were 0.095, 0.246 and 0.190 for log K ow , log K aw and log K oa , respectively, in this study, while they were 0.123, 0.286 and 0.207, respectively, in the previous study. This improvement should result from more accurate calculations of COSMOtherm due to the improved COSMOconfX procedure because nothing has changed in the FCM method.
Training set T0 resulted in elevated RMSEs for validation sets V1, V2 and V3 (Fig. 3). Thus, the use of congeners with the mean Cl content of 75 wt% (T0) as training compounds led to less accurate predictions for congeners with 53 wt% Cl (V1, V2, V3). Training FCMs with T1 (including congeners with 75 and 53 wt% Cl) resulted in a marked improvement in predicting the V1, V2, and V3 sets. Still, predictions were less accurate for MCCPs (V2) and LCCPs (V3) compared to for SCCPs (V1). The inclusion of M/LCCPs in the training set (i.e., T_all) resulted in only a minor improvement in the prediction of V2 and V3 congeners, suggesting that extrapolation to longer congeners is less of a problem. Higher  RMSEs for V2 and V3 could be related, in part, to lower precision of COSMOtherm predictions for long molecules with a large number of possible conformers (see Section 2.2).
Overall, it can be concluded that extending the training set from T0 to T_all substantially improved the domain of applicability of the trained FCMs. Fig. 6 Mean numbers of C 1 -fragments (upper panels) and C 2 -fragments (lower panels) per molecule. Fragments with the mean number per molecule <0.5 were omitted to avoid overcrowded plots. "CHCl-CHCl*" is a diastereomerically specific fragment with the two C atoms that are in the same rotational configurations. 3.2 FCMs for K ow , K aw , K oa , VP and S w at 5-45 C and the respective DH and DU FCMs for log K ow , log K aw , log K oa , log VP and log S w at 5-45 C as well as their respective DH or DU were calibrated with the T_all set and validated with the V_all set. All results are presented in Fig. S5 and Tables S2. † RMSE values for training and validation are summarized in Fig. 4. An R code and its associated les are provided in the ESI zip le so that the reader can use the trained FCMs to obtain predictions for all properties considered here for given CP congeners. RMSE values for log K's, log VP and log S w were 0.05-0.15 for training and 0.09-0.28 for validation. These ranges are similar to those of the previous work. 8 The FCM tted well with the COSMOtherm-predicted values in the entire range of property values (Fig. S5 †). The validation RMSEs indicate that the trained FCMs can predict the COSMOtherm-predicted values for C [10][11][12][13][14][15][16][17][18][19][20] CP congeners to the accuracy of 0.1-0.3 log units on average. RMSE values for DH's and DU's were 0.5-1.3 kJ mol À1 for training and 0.9-2.1 kJ mol À1 for validation. Generally, the RMSE tends to be lower for liquid/liquid partition properties (i.e., K ow , S w , DH ow , DH diss ) than for liquid/air partition properties (i.e., K aw , K oa , VP, DU aw , DU oa , and DH vap ). This trend may be related to COSMOtherm's precision which is expected to be higher for the liquid/liquid partition properties, because energy terms and their errors calculated for the two liquid phases tend to be cancelled, whereas such cancelation does not occur for liquid/air partitioning. Fig. 4 also shows that RMSEs for both training and validation decrease with temperature. As the intermolecular interaction energy generally diminishes with temperature, the contribution of each fragment decreases, and so does the tting error.
FCM predictions are compared to the available experimental data for log K ow and log K aw of SCCP congeners (Fig. S6 †). FCM predictions in the previous work already agreed well with the experimental data, and FCM predictions from this study agreed    To give an overview of the abundance of fragment types generated by the Monte Carlo simulations, the mean numbers of the C 1 and C 2 fragments per molecule are plotted ( Fig. 6; see Fig. S8 and S9 † for other carbon chain lengths). CH 2 is the most abundant C 1 fragment in low-chlorinated mixtures (#50 wt% Cl). The number of CHCl increases with increasing degree of chlorination, and CHCl is the major C 1 fragment for mixtures with $60 wt% Cl. CCl 2 is minor except for highly chlorinated congener groups (i.e., the number of Cl $ the number of C). The terminal fragments are always minor because there can only be two per molecule. The terminal C atoms can also be chlorinated, but they are less oen chlorinated than C atoms within the chain. Distributions of the C 2 fragments show that CH 2 -CHCl is oen the major C 2 fragment that carries Cl. CHCl-CHCl rather than CH 2 -CCl 2 emerges when the number of CH 2 -CHCl approaches the total number of C atoms divided by 2. C 2 -fragments with double chlorinated C (CH 2 -CCl 2 and CHCl-CCl 2 ) occur only in highly chlorinated congeners. These fragment patterns reect the model parameterization that is based on the knowledge of the varying probabilities for chlorination of C atoms.

Property distributions for CP technical mixtures
For each mixture with a given combination of a carbon chain length (C [10][11][12][13][14][15][16][17][18][19][20] ) and degree of chlorination (30-70 wt% Cl), 10 000 CP molecules were generated by the Monte Carlo approach. Then, property values were predicted by the FCMs. The results for the 55 mixtures (11 C lengths Â 5 Cl degrees) were sorted according to the congener groups (i.e., molecular formula) and are shown with histograms and probability density curves (ESI-2 †). Fig. 7 shows two examples of the distributions of K ow , K aw , and K oa . Low chlorinated (30, 40 wt% Cl) mixtures are characterized by (i) bell-shape overall K distributions with some sharp spikes originating from low chlorinated (Cl 0-3 ) congeners, (ii) a narrow range (2 log units) of log K ow , (iii) a wide range (6 log units) of log K aw , and (iv) a strong correlation between log K aw and log K oa . In contrast, highly chlorinated (60, 70 wt% Cl) mixtures are characterized by (i) symmetric, bell-shaped overall K distributions for all three log K values, (ii) an increase of log K ow with the number of Cl, (iii) broad but Cl-independent log K aw , and (iv) no or an only weak correlation between log K aw and log K oa . Interestingly, these qualitative trends were common for all chain lengths from C 10 to C 20 . That is to say, while the absolute values of log K depend on the chain lengths, the relative distribution patterns as shown in Fig. 7 do not depend on the chain lengths. The distribution patterns of log VP and log S w are only shown in the ESI, † but the former is similar to log K oa and the latter to log K ow .

Does the property distribution of each congener group depend on the mixtures?
Identifying whether the property distribution of a congener group in one mixture is equivalent to that in another mixture is of interest. For instance, C 10 Cl 4 congeners exist in any of C 10mixtures with 30, 40, 50 and 60 wt% Cl in an amount greater than 1 mol%, according to the Monte Carlo simulations. In the 30 wt% Cl mixture, C 10 Cl 4 congeners represent an "advanced" fraction, having undergone the chlorination reaction to a greater extent than the majority of the other molecules in the mixture. In contrast, C 10 Cl 4 congeners in 60 wt% Cl mixture are "le behind" in the reaction and represent a relatively low chlorinated fraction in the mixture. There could be difference in the structure proles and thus in the property distributions. The property distributions predicted by the COSMO-trained-FCMs, however, show that such a difference is small if any (Fig. S10 †). The medians of the property values vary only within 0.2 log units for any congener group from C 10 to C 20 , suggesting that the partition properties of each congener group are virtually independent of the source CP mixtures.
Representative property values for specic congener groups were calculated by summing the predictions for 30, 40, 50, 60, and 70 wt% Cl mixtures. Table 1 lists the medians of the property values at 25 C for all congener groups considered. The ESI excel le provides other quantiles as well as values at temperatures other than 25 C. These data represent the largest and most comprehensive set of COSMO-RS based property values for CP congener groups that consider congener proles in the CP technical mixtures. It is notable that the property values have nonlinear relationships with the number of Cl atoms, as reported previously 8 and can also be seen in Fig. 8. Note however that the congener prole could change in environmental processes such as partitioning and transformation and thus property distributions may also vary in the environment.

Comparing experimental and predicted K ow distributions for technical mixtures
The predicted distributions of log K ow for a CP mixture with a given C length and Cl wt% were compared to the experimental data from Hilger et al., 19 who measured the range of log K ow for a CP mixture using an HPLC retention method (Fig. 9). Their Fig. 9 Experimental and predicted ranges of log K ow for CP mixtures. Experimental K ow data as described by Hilger et al. 19 using an HPLC retention method corresponding to the peak start, top and end are compared to the 2.5, 50, and 97.5 percentiles of log K ow predicted by COSMO-RS-trained FCMs. View Article Online data set includes the mixtures that contain congeners of a single chain length (C 10 , C 11 , C 12 , or C 13 ) with varying degrees of chlorination (45-75 wt% Cl). The predicted medians of log K ow agree well with the experimental values that correspond to the HPLC peak top times within 1.1 log unit in the worst case. The agreement is particularly high for relatively short and low chlorinated mixtures. Less accurate predictions for highly chlorinated mixtures could be related to the relatively low accuracy of the Monte Carlo method for highly chlorinated congeners. The authors of the paper 13 admitted that the model parameters for the reactivity of highly chlorinated molecules were not well based on any experimental data. Fig. 9 also shows that the predicted range of log K ow (as 2.5-97.5 percentiles) is generally narrower than the measured range. This result could be taken as an indication that real CP mixtures contain more diverse congeners than predicted by the Monte Carlo model. It should however be noted that both predicted and measured ranges are somewhat arbitrarily dened. The predicted range was set between 2.5 and 97.5 percentiles here, but a wider range (e.g., 1-99 percentiles) could also be considered. In the HPLC measurements, the start and end times had to be assigned to a broad HPLC peak. Moreover, peak broadening can, in general, occur due to diffusion and dispersion in addition to the variation of properties of mixture components.

Conclusions
This study extended the COSMO-RS-trained FCM approach to CP congeners with a carbon chain length up to C 20 . The quantum chemically based predictions for K ow , K aw , K oa , VP, and S w and their temperature dependence for each congener group were offered, in consideration of varying congener compositions in technical mixtures. These predictions may be used for various environmental fate models or correlated with eldmeasured environmental partition coefficients such as organic carbon/water partition coefficients and gas/particle partition coefficients to understand the environmental behavior of CPs. Further validation of the model predictions with congener (group)-specic experimental data would be desirable, particularly for data-poor MCCPs and LCCPs.

Conflicts of interest
There are no conicts to declare.