Prediction of nanoparticle transport behavior from physicochemical properties: machine learning provides insights to guide the next generation of transport models

In the last 15 years, the development of advection–dispersion particle transport models (PTMs) for the transport of nanoparticles in porous media has focused on improving the fit of model results to experimental data by inclusion of empirical parameters. However, the use of these PTMs has done little to elucidate the complex behavior of nanoparticles in porous media and has failed to provide the mechanistic insights necessary to predictively model nanoparticle transport. The most prominent weakness of current PTMs stems from their inability to consider the influence of physicochemical conditions of the experiments on the transport of nanoparticles in porous media. Qualitative physicochemical influences on particle transport have been well studied and, in some cases, provide plausible explanations for some aspects of nanoparticle transport behavior. However, quantitative models that consider these influences have not yet been developed. With the current work, we intend to support the development of future mechanistic models by relating the physicochemical conditions of the experiments to the experimental outcome using ensemble machine learning (random forest) regression and classification. Regression results demonstrate that the fraction of nanoparticle mass retained over the column length (retained fraction, RF; a measure of nanoparticle transport) can be predicted with an expected mean squared error between 0.025–0.033. Additionally, we find that RF prediction was insensitive to nanomaterial type and that features such as concentration of natural organic matter, ζ potential of nanoparticles and collectors and the ionic strength and pH of the dispersion are strongly associated with the prediction of RF and should be targets for incorporation into mechanistic models. Classification results demonstrate that the shape of the retention profile (RP), such as hyperexponential or linearly decreasing, can be predicted with an expected F1-score between 60–70%. This relatively low performance in the prediction of the RP shape is most likely caused by the limited data on retention profile shapes that are currently available.


Introduction
Predicting the transport behavior of nanomaterials (NMs) in the environment is important for managing both the risks and benefits associated with NM use. Progress towards this goal, however, has been slow. In the last 15 years, the development of advection-dispersion particle transport models (PTMs) for the transport of nanoparticles in porous media has focused on improving the fit of model results to experimental data by inclusion of empirical parameters. [1][2][3] However, inclusion of empirical parameters has done little to unravel the complex transport behaviour of NMs and has failed to provide the mechanistic insights necessary for predicting NM transport in porous media.
There are many reasons for the poor state of predictive modeling of NM transport in porous media. One major problem stems from the extraordinary variability in retention behavior observed in column transport experiments. A prominent example of this can be seen with nano titanium dioxide (nTiO 2 ), one of the most widely studied NMs because of its numerous commercial and industrial applications (cosmetics and personal care products, food, hybrid organic-inorganic light-emitting diodes, solar cells). The transport behavior of nTiO 2 , even in simple, synthetic soil systems (i.e., monodisperse, fully saturated, single-media soils), varies widely. For instance, Chowdhury et al. 4 observed strictly hyperexponential (HE) retention profiles (RPs) for the transport of nTiO 2 under various solution conditions and influent concentrations. Cai et al. 5 observed HE and exponential (EXP) RPs under various pH and ionic strength conditions. Choy et al. 6 observed exclusively suppressed inlet retention (SIR) under various flow velocities, while Chen et al. 7 observed both SIR and increasing retention with depth (IRwD) RPs under various solution conditions and in the presence of a purified humic acid.
Nonexponential retention profiles pose problems for PTMs because these models cannot predict, or even qualitatively describe, transport when nonexponential profile shapes are observed. 8 The source of this problem stems from model construction. Because most PTMs employ first-order kinetics, they cannot describe non-exponential retention profiles by virtue of their mathematical construction (n.b., a model to describe nonmonotonic retention is available, but has received little support 9 ). Further, there is no methodology in place to constrain the parameters of PTMs under these circumstances, and notable examples exist within the peerreviewed literature where these models were used inappropriately. 8 The most fundamental problem facing predictive modeling, however, is that the mathematical construction of the majority of routinely applied PTMs considers neither the physicochemical properties of NMs, nor those of the system as a whole, explicitly. 8 Qualitative influences of physicochemical conditions on particle transport have been well studied and, under some circumstances, provide reasonable explanations for NM retention and transport behavior (e.g., agreement with Derjaguin and Landau, Verwey and Overbeek [DLVO] theory or extended DLVO theory). [10][11][12][13] However, quantitative PTMs that link physicochemical conditions to mechanistic behavior have not been developed, and understanding this link is an important goal for ongoing research. 14 Consequently, none of the parameters in the mass-balance equations for routinely applied PTMs truly reflect properties, such as type, size, shape, surface properties or aggregation behavior of NMs. 8 Without consideration of the physical and chemical properties of both nanoparticles and surrounding environment, PTMs are merely descriptive tools that offer no predictive capability.
Machine learning, in its simplest form, enables computers to develop models that are too complex, or untenable, to set-up by hand. Recently, Bayesian neural networks were applied to predict the biological effects of NMs and to determine nanomaterial toxicity for risk assessment. 15,16 However, machine learning has not yet been applied in the context of nanomaterial transport modeling in the subsurface, largely because the data required to facilitate such an approach are difficult to obtain, validate, and until recently, insufficient in number to support a data-driven approach. In this work, physicochemical conditions from transport experiments with NMs in saturated porous media are used to develop empirical models for the prediction of the resulting experimental outcome. Specifically, we examine the performance of random forest regression and classification machine learning models to predict (i) the retained fraction (RF) of NMs captured over the column length (i.e., a regression problem), and (ii) the NM retention profile (RP) shape resulting from a fully saturated transport column experiment (i.e., a classification problem). Further, this work quantitatively identifies and ranks the importance, and influence, of various physicochemical features on the transport of NMs in saturated porous media. A key point is that quantitative conclusions are drawn from statistical evaluation of the available NM transport experiments as a whole, not on the basis of a separate investigation of individual factors, as it has been frequently done for solution conditions, 5,17-21 NOM, [22][23][24][25] and physical factors, 20,24,26-28 such as grain and particle size, flow velocity, influent concentration, and coating. The database developed for this work includes more than 200 transport experiments extracted from 20 peer-reviewed column transport publications. To provide guidance for the construction of future mechanistic models, and to improve model performance, recursive feature elimination with 5-fold cross validation (RFECV) is employed to identify key features. In the shortterm, this work demonstrates that empirical prediction of NM transport is possible. However, the applicability of the developed method outside of the database is limited, as the generalizability has not yet been sufficiently demonstrated to claim otherwise. In the long-term, this work provides insights from which new mechanistic transport models can be developed.

Nanomaterial transport experiment database
The database developed for this work includes 204 separate experiments extracted from 20 peer-reviewed NM column transport studies in saturated, homogenous porous media (references provided in the ESI †). From each experiment, 19 physicochemical features were recorded (17 training features [physicochemical conditions] and 2 target features [experimental results]). The investigated applicability domain and range of training and target features employed for this study are presented in Table 1.
Only transport experiments with a retention profile, excluding those where retention was reported but not visually discernible, were included in the database. Data limitations prevented the dispersivity and Hamaker constant from being employed in the assessment, because values for these parameters were available for only 17% and 31% of the transport experiments considered. Target features, i.e. the experimental results, were not employed for training (i.e., RF was not employed to predict RP shape and vice versa). Of the 204 experiments, only 183 contained RP shape data; 175 contained RF data. The database structure is generically illustrated in Fig. 1.
Where experimental values were not available, default values, where possible and with references, were used to fill gaps for collector ζ-potential from literature under similar experimental conditions (19 of 204 experiments). No ratio features, such as the ratio of column length to width or ratio of particle to collector size, or differential features, such as the pH-to-IEP distance or particle-collector ζ-potential difference, were included.

Random forest regression and classification
Random forest regression and classification methods were employed for this work because they are relatively insensitive to outliers and noise, and provide internal estimates of generalization error and feature importance. 29 Python and the sklearn package were employed to generate the machine learning models; 30 programming details are included in the ESI. † The random forest method has its name from employing an ensemble of unpruned, random decision trees (i.e., a forest) as simple learners. In this work, each random forest consists of an ensemble of 1000 decision trees. Within a given forest, each decision tree represents a model that predicts the target feature value (regression) or class (classification) by splitting the training feature data set using simple decisional rules learned by statistical analysis. 29 In a random forest, decision trees are constructed from a random subset (bootstrap or out-of-bag samples) of the training feature data set with replacement. In contrast to standard decision tree logic, where each node is split using the best split among all variables, node splits during tree construction for a random forest are determined by the best split among a random subset of the features chosen for that node. 29 It is intuitive to think that the best performing individual tree within the forest is selected for prediction. However, prediction is not based on single trees, but determined by aggregating the predictions of all trees within the forest. This is done in different ways for classification and regressive prediction: 29 regression prediction is determined by averaging the prediction results of each tree in the forest; class prediction is determined by voting i.e., each tree is 'asked to which class new data belongs and the mode of the results is the class prediction. In this work, regression prediction performance was reported and assessed in terms of the mean square error (MSE). The MSE values are on a scale from 0 to 1 because the RF has values between 0 and 1, where 1 corresponds to complete retention of the NM by the column. Classification prediction performance was reported and assessed in terms of the F1-score. 31 Further, it is important to note that random forests do not generate retention or concentration profiles. The outputs of the random forest method are estimates of the RF (regression) and the RP shape class (classification) along with the  MSE (regression) and the F1-score (classification) as performance metrics.

Feature selection with recursive feature elimination with 5-fold cross validation (RFECV)
RFECV was employed to identify which physicochemical features are important to predicting RF and RP shape, and to avoid overfitting by reducing the number of features employed for training. Feature selection is a central problem in machine learning. At its core, the database from which the model is trained must include a representative set of features. However, it is not clear which features are important for prediction at the problem outset and steps must be taken to reduce over-fitting. Employing RFECV to identify and remove features of low importance to prediction enables the training database to be 'trimmed' to only the features responsible for prediction (n.b., due to randomness in the data selection process, the trimmed database may vary in size and composition between model runs). This effort is computationally expensive, but critical to gauge the performance of a model where the number of experiments is relatively small in comparison to the number of features (experiments ≃ 10 × features), as is the case here. For each model run, the training set is divided into 5 randomly partitioned subsets, or cross-validation folds. Each cross-validation fold consists of 20% of the training dataset, or 18% of the total database. Four folds are aggregated to form a temporary cross-validation database from which a random forest model is trained. The performance of the model is then recursively evaluated by systematically removing features of low importance from the training set, re-training the model, and then examining the predictive performance of the model using the remaining 5th fold (i.e., the RF and RP shape class from the remaining fold are predicted by using the physicochemical data from the experiments contained within the remaining fold). For each recursion step within an iteration, the feature with the lowest importance is eliminated and the random forest is re-trained, and re-tested, until a single feature remains.
Feature importance was determined in relation to the mean decrease in regression or classification accuracy, as determined by the random forest method i.e., the importance of a feature is determined by gauging the increase in prediction error when a specific feature's data is randomly permuted. 29 Definition, calculation, and limitations of the feature importance method are investigated in greater detail in Breiman, 29 Nicodemus et al., 32 Louppe et al., 33 and Strobl et al. 34 Pedregosa et al. 30 provide a more detailed discussion of the RFECV method. Modifications made here to the RFECV method are described in the ESI. † In conclusion, identification of important features through RFECV is accomplished by repeatedly applying a feature selection algorithm on the cross-validation folds in a manner that assesses the model's predictive performance with different subsets of features (i.e., recursively removing the least important features to examine the model performance with a varying number and composition of features).

Database partitioning and model structure
Partitioning the database into training, cross validation, and testing data sets is critical to evaluate the performance of the machine learning effort (n.b., the testing partition is called the 'holdout' partition to prevent confusion with cross validation testing). A schematic of the partitioning and model structure is shown in Fig. 2 and supplemented with a textual description.
Overall, 500 model runs were performed. For each model run the following steps were carried out: 1. All usable experiments within the database were randomly assigned to the holdout (10%) or training (90%) data sets. No training was performed on the holdout set.
• The training-holdout split was random for the regression problem.
• The training-holdout split was random for the classification problem, but the ratio of RP shape classes was preserved between datasets. The purpose of this stratified split is to mitigate the influence of class imbalance on the random forest classification, 35 which favors hyperexponential RPs (Table 2).
2. Recursive feature elimination with 5-fold cross validation (RFECV) was performed using random forests generated from the training set (specifically, from the temporary database formed by four folds in combination). Because there are five cross-validation folds, five RFECV iterations were performed in each model run.
• The cross validation training-testing split was random for the regression problem.
• The cross validation training-testing split was random for the classification problem, but the ratio of RP shape classes was preserved in each cross-validation fold.
• Regression performance was assessed using the MSE; classification performance was assessed using the F1-score; 31 these metrics are called "cross-validation score" below.
• For each of the five RFECV iterations, the RFECV routine generates 17 random forests (with the feature set size decreasing from 17 to 1) and reports the feature set size corresponding to the highest cross-validation score (i.e., the 'optimum number of features'). 30 If two or more feature set sizes have the same score, the smallest set size is recorded. The optimum number of features indicates the minimum number of features required to maximize the cross-validation score for a particular cross-validation fold.
• Aggregated-cross validation results show model performance as a function of feature set size for all 500 model runs, as shown in Fig. 3 and 6. 3. The training set was 'trimmed' to the features identified by RFECV (i.e., reduced to the features corresponding to the optimum number of features as it was identified in step 2). A new random forest model was then trained on the trimmed training set and evaluated against the holdout set.
• For each model run, the values of the holdout set training features are used as model inputs and the accuracy of the prediction is evaluated against the holdout set target features.
• Aggregated holdout results, as displayed in Fig. 4 and 7, show the predictive performance as a function of feature set   size for all 500 model runs (500 data points in total). From the aggregated results for all model runs, the optimal feature set size (optimum result) was derived by weighting the number of features, the median prediction error, and the variance, i.e., lower model performance with fewer features and lower variance is preferred over a larger feature set with better performance, but higher variance. The optimum feature set size was determined by means of the following performance-variance-feature set trade-off equation: where n features is the number of features employed in a given model run, E is the median MSE for the model runs with feature set size n features , and r IQ is the interquartile range of the holdout performance for the set of model runs with feature set size n features. 36 E is the median MSE for the regression problem and 1 minus the median F1-score for the classification problem.
It is important to note that all 500 model runs were considered equally important and the random forests trained on the trimmed feature sets given the same weight. Because the training sets from which the random forest models are randomly generated, and thus the optimum feature set identified through RFECV varies between model runs, no "final" model exists.
To identify the most important features, we present a feature ranking in Fig. 5 and 8 that indicates how often a feature occurs in the 500 feature subsets that obtained the optimum score in the RFECV. This allows quantitative assessment of each feature in terms of its influence on the prediction. For example, important features will have an RFECV selection frequency near 100%, as these features are consistently selected. Weaker, but still relevant features will have lower, but nonzero selection frequencies, as these features are selected when stronger features are not present in the currently selected subset. Weak or relatively irrelevant features will have scores close to zero, as they are infrequent among the selected features. 30

Prediction of the retained fraction (RF)
3.1.1 RF prediction during cross validation. In Fig. 3, the aggregate performance of the model in predicting RF is shown as a function of the number of features employed to train the model during cross validation. Visual inspection indicates that the increase in median cross-validation performance is relatively small (≪0.01 MSE) for greater than 5-6 features. In general, RF prediction for the holdout set was better than during cross-validation. This is expected, as the cross validation data sets are smaller than the total training set (72% of the total data set is employed for each crossvalidation iteration [4 of 5 folds of the training set]; 90% is employed for each holdout iteration [i.e., the total training set]). Note that there are no holdout scores for 1-4 features. During cross validation the model performance, when the model was trained with less than four features, was always less than that when trained with four or more features. Therefore, the 'optimum feature set' determined by RFECV (Section 2.4, Step 2), which is employed as a guide to trim the database to the most suitable set of features, never consists of less than four features.

RF prediction for the holdout set
Several interesting findings are noted. First, our results suggest that the salt type has a relatively low importance in predicting the RF. This seems surprising, particularly as many studies report strong influences of bivalent cations (specifically Ca 2+ ) that cannot be accounted for by ionic strength alone (e.g., bridging effects). 4,18,37,38 However, authors often lower the concentration of multivalent cations to be within the same range of influence as monovalent cations (e.g., Chen et al. 18 tests 0.56 mM NaCl against 0.02252 mM CaCl 2 ). This reduces the ability of the models to ascertain the true influence of salt type on NM transport.
Second, our results suggest that the pH is of greater importance in predicting the RF than the IEP, although both are generally considered relevant for assessing NM stability in suspension. 38,39 This is in line with previous work that qualitatively establishes the importance of pH. 5,17-21 Also, we find that the number of influent pore volumes and the influent concentration are critical to predicting the RF. This is well understood mechanistically, and provides further validation that relevant features are identified.
Third, and most intriguingly, we find that the NM type is relatively unimportant to predict the RF. Several publications indicate that NP fate and behaviour cannot be generalized and that each NP needs to be tested individually. 40,41 However, the machine learning results presented here provide evidence that this may not be the case. The interpretation of this finding is that, within the set of physicochemical features employed here, the behavior of the NMs is nearly entirely captured without needing to consider any NMspecific (i.e., associated with the NM type) interaction.

Prediction of the retention profile (RP) shape
3.2.1 RP shape prediction during cross validation. The aggregate performance of the model in predicting RP shape as a function of the number of features employed to train the model during cross validation is shown in Fig. 6. Visual inspection indicates that model performance does not strongly improve with increasing number of features.
3.2.2 RP shape prediction for the holdout set. The aggregate performance of the model to predict the RP shape of the experiments in the holdout set as a function of the number of RFECV-selected features employed to train the model is shown in Fig. 7. In general, the prediction of RP shape class is poor. The highest expected F1-score was obtained when 11 features were employed for training (median F1-score = 68%; r IQ = 0.14%). The optimum result, as determined by eqn (1), was obtained for feature sets consisting of only two features (median F1-Score = 56.3% r IQ = 0.05%). The highest expected performance of the classifier (F1-score of 68%) only provides an increase of 13% above guessing all RPs are HE (56% of the profiles are HE), and the optimum result (classifiers using two features) provides virtually no predictive improvement. This result was not entirely unexpected, as significant class imbalance is present in the data; that the methodology was tailored to reduce these biases (stratified trainingholdout split procedure and RFECV [Section 2.4]) did not really compensate for the limitations of the data set. However, it is also possible that relevant physicochemical features were not included, although we have exhausted the majority of measurable parameters. 42 The stability of the feature set selected by RFECV was evaluated by examining feature inclusion frequencies over all 500 model runs (Fig. 8). The poor class prediction, and low improvement in prediction with increasing number of    Finally, although the optimum result consists of only two features, 11 features were included in more than 90% of model runs (Fig. 8).

Conclusions
In contrast to mechanistic models, which are unable to describe transport (let alone predict it) when non-exponential retention profiles are observed, the applied machine learning regression approach enables prediction of the RF with an MSE < 2.6 × 10 −2 . This approach enables quantitative prediction of nanomaterial transport distance independently of a mechanistic understanding of NM behavior. We anticipate that this method could be used to support high-throughput risk screening to identify conditions with high or low vertical mobility of NM in porous media without costly and timeintensive experimental work. Further, we foresee this approach facilitating the development of materials specifically designed to accumulate, or transport, in response to varying physico-chemical conditions. Mechanistic PTMs are important, but currently not suitable for quantitatively describing and predicting NM transport in porous media. As such, a hybrid mechanistic and machine-learning approach may offer a way to reconcile problems with mechanistic models. As a first step, ranges of physico-chemical conditions may be defined where the current mechanistic models are sufficient. This might be accomplished by a reformulation of the presented multi-class classification problem as a binary problem (exponential vs. nonexponential RPs). For conditions that result in nonexponential RPs, transport equations need to be modified, or alternatively parameterized, using the established method to predict the RF.
Empirical approaches such as the one employed here have limitations, too. For instance, if the data available for the assessment are not representative, the utility of the approach will be low. In particular, we caution extension of the applicability of this work to real soils, as no column transport studies in real soils were employed for this work. Furthermore, for the classification problem serious data limitations currently prevent adequate understanding of how physicochemical conditions influence the shape of the retention profile. The poor performance of the classification predictor is most likely a result of the limited and highly biased data that were available for this work.
In this work we demonstrate that machine learning methods not only add value to the development of mechanistic models through identification of the important features affecting the fate of materials in the environment, but they have the potential to create a new, flexible, and predictionoriented class of NM transport models. However, for an improved understanding of nanoparticle transport in porous media, several changes in experimental design and data presentation are required. First, and in relation to the physicochemical features identified as important in this work, the existing body of literature must be reviewed to determine what kind of data are already available, what the scope and meaning of these data is, and where exactly the most important data gaps exist. Second, future experimentation must proceed in a manner that fully exploits the information on the physicochemical conditions of column transport experiments, while minimizing cost and time resources. 43 On this basis, machine-learning methods can generate more transparent relationships between nanoparticle transport and experimental conditions and, thereby, provide a basis for the development of improved mechanistic models of nanoparticle transport in porous media.