In silico environmental chemical science: properties and processes from statistical and computational modelling

Quantitative structure–activity relationships (QSARs) have long been used in the environmental sciences. More recently, molecular modeling and chemoinformatic methods have become widespread. These methods have the potential to expand and accelerate advances in environmental chemistry because they complement observational and experimental data with “in silico” results and analysis. The opportunities and challenges that arise at the intersection between statistical and theoretical in silico methods are most apparent in the context of properties that determine the environmental fate and effects of chemical contaminants (degradation rate constants, partition coefficients, toxicities, etc.). The main example of this is the calibration of QSARs using descriptor variable data calculated from molecular modeling, which can make QSARs more useful for predicting property data that are unavailable, but also can make them more powerful tools for diagnosis of fate determining pathways and mechanisms. Emerging opportunities for “in silico environmental chemical science” are to move beyond the calculation of specific chemical properties using statistical models and toward more fully in silico models, prediction of transformation pathways and products, incorporation of environmental factors into model predictions, integration of databases and predictive models into more comprehensive and efficient tools for exposure assessment, and extending the applicability of all the above from chemicals to biologicals and materials.


Introduction
Progress in environmental chemical science is limited by the availability of data even more than most domains of science. The complexity of environmental conditions, combined with the diversity of substances (chemical, biological, and material) that are of environmental concern, mean that direct measurements will never be sufficient to meet the data needs of environmental scientists or regulators. Therefore, predicting chemical properties is a long-standing challenge that has received extensive study for many applications (chemical engineering, green chemistry, environmental chemistry, toxicology, pharmacology, etc.). Fortunately, advances in computerbased methods are making it increasingly feasible to estimate substance properties, evaluate their fate-determining processes, and predict their effects. These methods and their applications comprise the domain we refer to herein as "in silico environmental chemical science". The scope of this domain includes theoretical and statistical methods for calculating substance properties, fate, and effects. The theoretical and statistical methods used to calculate substance properties are rooted in very different disciplines, so the recent trend toward combining these approaches poses some novel challenges for developers and users of these models. One goal of this perspective is to show how these challenges become opportunities when methods are combined in a complementary way. To encourage this, we provide an overview of some core concepts, key developments, and opportunities, with emphasis on the properties that are the most fundamental determinants of chemical fate and effects. Another perspective in this issue 1 takes a similar approach, but focuses on biological effects, especially toxicity, and their regulatory implications.
We framed the introduction to this perspective in terms of prediction of substance properties because that is by far the most familiar rationale for work in this area. For example, comprehensive exposure assessment models for chemical contaminants that are used for regulatory decision making (EXAMS, 2 EUSES, 3 FOCUS, 4 etc.) require dozens of chemical properties, for which measured values oen are not available, hence the widely-recognized need for methods of estimating the missing data. 5,6 The demand for methods that estimate environmental substance properties has mostly been met with statistical models, including "quantitative structure-activity relationships" (QSARs) and variations thereof. [7][8][9][10] This eld is mature enough to have already engendered several generations of compilations of predictive models. 6 Prominent early examples are the Handbook of Chemical Property Estimation Methods compiled by Lyman et al., 11 and a similarly structured volume edited by Mackay and Boethling. 12 Since then, there has been a growing number of reviews and databases of QSARs, [13][14][15][16][17] comparative analyses of QSAR accuracy, [18][19][20] and efforts to codify methods of calibration and validation. [21][22][23][24][25] Many QSARs have been incorporated into soware that facilitates their use for property prediction. 6 Currently, the two main examples of this are the estimation program interface (EPI Suite) by the U.S. Environmental Protection Agency (EPA) 26,27 and the QSAR Toolbox by the Organization for Economic Cooperation and Development (OECD), 28 but others are under development.
However, the approach taken in this perspective is broader in that it recognizes that property prediction models, and the processes and methods of developing these models, have additional benets. Besides prediction, another major benet is "diagnostic", as in the diagnosis of mechanisms, categories, or other structures that provide greater understanding of the processes at issue. In chemistry, this process is generically referred to as correlation analysis, 29,30 and it oen takes the specic form of linear free energy relationships (LFERs). 31,32 A third benet, which sometimes is neglected, is for the validation of data (or models). The process of developing QSARs involves analysis of correlations, which should be simple if the variables are closely related, so scatter and outliers may be indicative of errors or bias.

Formulation of statistical models
In the past, and even now, almost all property prediction models have been based on empirical/statistical correlations between data for the response (target, dependent, y) variable and descriptor (independent, x) variable(s), as illustrated in Fig. 1. For training the model, the response variable is usually measured data (e.g., toxicity) and the descriptor variable may be measured or determined in other ways (e.g., various fragment types such as Hammett substituent constants). The statistical model usually is relatively simple (linear, with one or a few descriptor variables) and is derived through calibration: i.e., regression of available property data for a series of related compounds with one, or several, convenient descriptor variables. Usually, a subset of the training data set, or entirely new data, are used to validate the model. The resulting relationship can be used in reverse to predict values of the property for compounds that were not included in the original data set. The general paradigm represented in Fig. 1 still applies for more complex systems and models.
In practice, the response variable is dened by external considerations (e.g., constants required for modelling partitioning or degradation of contaminants) and the development of the predictive model involves mostly the identication of descriptor variables and calibration of the relationship between them. With respect to the selection of descriptor variables for chemical processes, there are three general types: (i) substituent constants such as the s constants that are dened and used with correlations in the form of the Hammett equation, (ii) molecular descriptors such as pK a the way they are used in the Brönsted equation, and (iii) reaction descriptors such as in the correlation of rate or equilibrium constants in one medium with those in another (i.e., "cross-correlations", see below). These three categories of descriptors are illustrated in Fig. 2, using as an example rate constants (k i ) for a reaction of substituted phenols with an environmental agent E. The environmental agent E could be O 3 , MnO 2 , a (co)metabolizing microorganism, etc. The distinction made in Fig. 2 between substituent, molecular, and reaction descriptors could be generalized for application to other types of environmental processes (e.g., volatilization, sorption, bioavailability, and toxicity).
The three types of descriptors represented in Fig. 2 have complementary advantages and disadvantages. The main Fig. 1 Conceptual model for the process of calibration, validation, and prediction using statistical models such as quantitative structureactivity relationships (QSARs). Only one response and one descriptor variable is represented by this 2-D scatter plot, but multivariable models work similarly.
advantage of the substituent approach is that constants for a limited number of substituents can be combined to provide values of the descriptor variable for new, more complex substrate molecules. However, not all substances can be adequately represented as the sum of independent substituents, due to proximity effects, etc. Correlations based on molecular properties are not limited by uncertainties over the additivity of substituent effects because values of their descriptor variables are determined on whole molecules, thereby incorporating the effects of interacting substituents. Correlations based on substituent or molecular properties do not include information about the reaction pathway or products, whereas this information may be incorporated in descriptor variables based on reaction properties. Again, there are advantages and disadvantages to the alternatives: if pathways or products are unresolved in the response variable data (e.g., using overall k's or K's measured in environmental media), there may not be sufficient information to select descriptor variables that correspond to specic reactions, but selection of substituent or molecular property descriptors does not require that information. On the other hand, if there is a need to resolve different pathways and products (e.g., to distinguish environmentally benign and harmful outcomes), then correlations based on descriptors that include information about reactions and products (i.e., the whole reaction) are required.
A variation on the model represented by Fig. 2 is the format sometimes called "cross-correlation analysis" where two variables that typically would be response variables are related directly (e.g., rate constants for reaction with one oxidant vs. rate constants for reaction with another). 13,14,33 Cross-correlations can be used for prediction, validation, and classication, just like conventional QSARs.

Matching response and descriptor variables
The potential scope of in silico environmental chemical science includes phenomena ranging in scale from angstroms to kilometers. This continuum is illustrated in Fig. 3 with representative categories for both physical-chemical (le column) and biological-chemical (right column) systems. At the molecular end of this continuum, the system characteristics are relatively simple, in that they are fundamental and homogeneous (e.g., rate constants for electron transfer between donor and acceptor molecules in solution). At the environmental end of this continuum, system characteristics are relatively complex and heterogeneous (e.g., toxicity to a diverse community of organisms). All of these systems' characteristics are legitimate targets (as response variables) for predictive and/or diagnostic in silico modelling, depending on the context or purpose, such as whether the application is ranking of contaminants for regulation or tuning a treatment technology to produce less harmful by-products.
Just as different response variables correspond to different positions on the scale continuum in Fig. 3, a similar classication applies to descriptor variables. So, for example, the redox potential of reactants corresponds to molecular scale processes, its octanol-water partition coefficient corresponds to membrane/grain scale processes, and its toxicity corresponds to  the cell or community scales. As with the selection of response variables, valid descriptor variables may come from anyplace on the scaling continuum. However, the most easily justied and interpreted models are formulated with response and descriptor variables from similar scales. Thus, redox potential is a well-matched descriptor for response variables involving redox reaction rates and partition coefficients are well matched for modelling bioavailability.
This principle of matching the physical scale of response and descriptor variables also applies to scale in the more abstract sense, as in the distinction made in Fig. 2 between substituent, molecular, and reaction level descriptors. As noted in the discussion of that gure, the three types of descriptors can be more or less effective depending on the type (scale) of descriptor that they are matched with. A specic and even more fundamental example of matching as a criterion on selecting response and descriptor variables can be found in the discussion of descriptors for oxidation of phenols and anilines by Pavitt et al. 36 There the distinction was between descriptors that are properties measured in solution vs. descriptors that are calculated from theory assuming an elementary reaction step. The latter is more precisely dened, but may not fully match the solution chemistry that determines the response variable. In contrast, measured descriptor variables usually are less precisely dened, but this imprecision can make them more effective in QSARs, if the source of the imprecision is in some way shared by (covariant between) the descriptor and response variables.
Toward the larger-scale end of the continuum represented in Fig. 3, response variables for more complex or heterogeneous processes oen are best described with multivariate "polyparameter" models comprised of combinations of descriptors for smaller scale steps that comprise the overall process. In these cases, the key consideration for descriptor selection is not so much matching but rather balancing the smaller scale processes represented by each descriptor. The classic example of this is the Hansch-Fujita model, which represents biological effects with a linear combination of descriptors for partition and reaction processes. 37-39 A more recent example is the Abraham model, which represents partitioning effects in terms of descriptors for all of the factors that inuence the partitioning process. [40][41][42][43] For balance, these descriptors should represent distinct, largely-independent (i.e., not overlapping or covariant) factors. In a case like the Abraham equation, the descriptors are also balanced by representing similar scale effects; for models representing more complex effects, like the Hansch-Fujita equation, a balanced set of descriptors may represent effects over a range of scales.
For complex, large-scale processes and effects, statistical predictive/diagnostic models must be based on correlations between empirical data. However, for molecular scale processes and effects, an alternative to empirical data for descriptor variables is calculation from molecular structure theory (i.e., computational chemistry). While this approach has great appeal because of the potential to alleviate the need for eld or laboratory measurement, and for the high "precision" of theoretically calculated descriptors discussed above, the computational chemistry approach comes with other complications and limitations that limit its potential as an alternative to statistical correlation analysis.

Modelling from computational chemistry
Computational chemistry involves molecular modelling based on theory. 44 Starting from quantum mechanics, all chemical phenomena can-in principal-be calculated from theory, 45 but solving the exact equations directly is infeasible except for very small systems. To overcome this obstacle, many methods have been developed for approximating the difficult equations of quantum mechanics, so that they can be solved for molecular systems. The most promising of these methods rely on sophisticated combinations of two general strategies: (i) use of compact models that capture the key many-particle effects by construction and (ii) efficient stochastic sampling of manydimensions. There are many variations on these methods, which collectively make up the toolbox of computational chemistry (Fig. 4). Some of these methods are easily performed on modern computers, and therefore are available to most environmental chemists, but other methods require advanced computers and applied mathematical techniques, and therefore remain the domain of computation chemistry specialists.
The simplest computational chemistry models are based on molecular mechanics, in which the forces between the atoms are calculated using empirical interatomic potentials or molecular mechanical force elds. 46,47 The computational efficiency of these models makes it practical for them to simulate the dynamics and coupled interactions of tens of thousands of molecules over time-scales of milliseconds, 48 which makes it possible to study molecular behaviour in complex environmental phases. For example, molecular mechanics models have been used to investigate the structure of natural organics Fig. 4 Summary of computational chemistry methods, with respect to their theoretical rigor, and therefore potential accuracy, versus the complexity of systems they can address, and therefore relevance to environmental chemistry issues at different scales. matter, [49][50][51] and interactions of contaminants with mineralwater interfaces. 52-54 A major limitation to the use of molecular mechanics modelling, however, is that the required force eld parameters are not very accurate for effects that are relevant to environmental conditions, such as the strong polarization and other chemical interactions of surrounding water molecules near highly charged ions and complex mineral surfaces. 55,56 Moreover, current molecular mechanics models typically are not designed to simulate chemical reactions (i.e., the making and breaking of chemical bonds) or phenomena that are kinetically limited over time frames that exceed a few microseconds. 46 Solutions to these limitations are active areas of research in computational chemistry and applied mathematics (e.g. ReaxFF 57 and "accelerated sampling" techniques 46,47,58 ) and recent advances in computational algorithms allow the integration over time to be parallelized, thereby allowing for increased simulation time-scales. 59,60 Semi-empirical methods are more complex than molecular mechanics models, and include simplied approximations of quantum mechanics that are sufficient to allow simulation of the making and breaking of bonds during chemical reactions. These methods oen use an simplied Hamiltonian to model organic systems, 61,62 although more general Hamiltonians have been developed to model other parts of the periodic table. 62,63 Unlike the more rigorous computational models discussed below, semi-empirical methods are heavily parameterized with experimental data (or data from higher level models). This allows semi-empirical models to efficiently achieve useful accuracy for large molecules (>10 000 atoms with O(N) algorithms 64 ), although for small molecules or reaction energies the more rigorous models usually are more accurate. There are many examples of using semi-empirical methods in the early applications of computational chemistry to environmental systems, mostly for descriptor variables in calibration of QSARs. [65][66][67][68][69][70][71] Currently, the most popular approximation to quantum mechanics for chemistry is Density Functional Theory (DFT), [72][73][74] which is based on approximations to the exact exchange-correlation functional 73 (e.g. LDA, GGA, hybrid GGA, meta-GGA) that are relatively computationally efficient. DFT's success and popularity can be attributed to several advantages it has over other contemporary computational chemistry approaches: the Hohenberg-Kohn theorem 75 and Kohn-Sham formulations 74 give it a well-established theoretical basis, many of the most popular exchange correlation functionals are constrained by formal theoretical constraints, it is competitive in accuracy for many interesting chemical phenomena, and it is computationally much less expensive than higher-level alternatives such as quantum Monte-Carlo methods and traditional many-body theory (discussed below). DFT has been used extensively in many research domains, including environmental chemistry.
While the DFT level of approximation is suitable for many applications, it is also becoming clear aer many years of active development that there are limits to its accuracy. For example, the DFT calculated free energies of reaction for reactions involving bond breaking can have uncertainties of several kcal mol À1 or more. 76,77 If the reaction occurs in aqueous solution, popular models of solvent effects (i.e. implicit solvent models) will contribute at least a few more kcal mol À1 of uncertainty, 44 so overall errors of 5 kcal mol À1 ($20 kJ mol À1 ) or more are to be expected. This level of accuracy is not satisfactory for some purposes (e.g., direct calculation of absolute values of specic rate constants for contaminant degradation 78 ), but it may be satisfactory for triaging among possible chemical reaction pathways or for descriptor data in QSAR development. 79,80 In addition, the overall accuracy of DFT calculations can be improved by using methods that make use of empirical additivity rules for molecular properties, where various properties of larger molecules can be thought of as being made up of additive contributions of atoms, bonds, or collections of atoms and bonds (i.e., functional groups) of the molecule. 81,82 These approaches have proven to be effective for small organic molecules, 83-90 and recently they have been used in advanced computational algorithms that can be used to simulate extremely large molecules, even including complex proteins and DNA chains. 91,92 Compared with DFT, the higher level theory used in wave function and quantum Monte-Carlo methods 93 can give significantly more accurate results, if the underlying electronic structure is well understood. For small molecules, higher level wave function methods, such as coupled cluster theory and its variants 94,95 are currently considered the most accurate manybody methods in use today. However, the computational cost of these methods increases very steeply with molecular size, such that only molecules containing a few atoms can be handled currently. Despite their high computational cost, many-body methods have the potential to considerably increase the accuracy of the study of many molecular phenomena (except for a few well-known exceptions 96,97 ) and there has recently been signicant progress made at accelerating and parallelizing these methods. 98 For systems composed of numerous small molecules that are difficult to study by experiment-as is oen the case in environmental chemistry-modelling with manybody methods is feasible and attractive (because a large number of benchmark studies have shown that the errors of many-body methods are considerably smaller than for DFT, ranging from <1 kJ mol À1 to up to 3 kJ mol À1 , depending on the species 86,99À101 ).
An approach that combines some of the advantages of the methods summarized above-and is especially useful for describing chemical reactivity in large-scale, complex environments-is the quantum mechanical/molecular mechanical (QM/MM) methodology. In this approach, the system is divided into two parts: a localized QM region surrounded by a MM region. In many applications, this allows for a small chemically active region to be modelled quantum mechanically, while the long-range effects (such as solvent or a protein backbone) can be represented by classical MM interactions. This is a computationally efficient and theoretically powerful method, but uncertainties in how best to divide the QM and MM regions of the model make it the domain of "expert" users, for now. Applications of QM/MM methods to environmental chemistry are still relatively few. [102][103][104][105] Empirical calibration of computational chemistry data with experimental data While few properties that directly impact the environmental fate and effects of substances can be calculated directly from molecular structure theory, properties that can be calculated from theory can be useful in the development of statistical models. Usually, these calculated properties are used as descriptor variable data in correlations with measured response variable data, so the resulting relationship has many of the same characteristics as traditional LFERs and QSARs (Fig. 1). These calculated descriptor variables can be substituent, molecular, or reaction properties (as in Fig. 2), they generally are computationally feasible only for molecular size-scale properties, and their selection is subject to the same considerations of matching and balance discussed above for traditional QSARs (Fig. 3). The major advantages of computationally derived descriptor variables are that they can be programmed to calculate in large batches and they include only the effects they are programmed to model. The latter is also their major disadvantage: they do not include any effects that are not already known to be relevant, or effects that are not practical to calculate from theory.
This mixture of advantages and disadvantages can be seen in the growing body of research done on QSARs using descriptors from computational chemistry. One such class of descriptors includes physico-chemical properties (solubility, Henry's law constants, partitioning constants) calculated using the conductor-like screening models COSMO-RS, 106-108 COSMOtherm, 109,110 and COSMO-SAC. [111][112][113] These are poly-parameter statistical models 40 using combinations of parameters that are balanced (i.e., mechanistically complementary and independent) and calculated based (partly) on theory.
Another class of descriptors that are obtained from computational chemistry calculations includes one-electron oxidation or reduction potentials (E 1 ), which are used in QSARs for rates of contaminant degradation by redox reactions. 79,80,114,115 In this case, the calculated potentials require calibration using experimental data, and the experimental calibration data can be measured by several methods, including electrochemistry and pulse radiolysis. The electrochemical measurements can be confounded by nonideal behaviour, such as irreversibility, which are not included in the theoretical calculations, so there is a mismatch between these two variables that might result in less accurate calibrations. 116,117 Alternatively, E 1 measured by pulse radiolysis, 114,115,117-119 is a better estimate of reversible redox potentials, and therefore is better matched to potentials calculated from computational chemistry. However, E 1 from pulse radiolysis is not necessarily more closely matched to the processes that are controlling solution-phase oxidation kinetics, so they may not provide the most useful, or even the most accurate, structure-activity relationships for oxidation reactions of environmental interest. 36 In principle, this approach could be extended to "fully in silico" calibration of QSARs: i.e., statistical correlations calibrated with descriptor and response variable data calculated from molecular modelling. This was the original goal in a study of the hydrolysis and reduction of nitro aromatic compounds 78,80 and oxidation of their corresponding aromatic amines. 79 However, complexity and uncertainty in the mechanism of the hydrolysis and oxidation reactions made it infeasible to calculate their rates entirely from theory, and even the comparatively well-dened and simple mechanism of reduction proved challenging to model for more than a few compounds. 80 Pathway as opposed to property prediction A relatively new challenge that has emerged in recent years is the ability to predict transformation pathways as a function of environmental conditions. This is partly due to growing recognition that the resulting transformation products can be of more concern than the parent compound with respect to ecological and human toxicity. Our understanding of the process science underlying abiotic and biologically-mediated transformations has progressed to the point that it is now feasible to construct reaction libraries that "encode" the process science that is described in the peer-reviewed literature or publicly available government regulatory documents. The resulting libraries represent reactions as single-step transformations of functional groups, and can include purely chemical reactions (e.g., hydrolysis, reduction, and photolysis) and biologically-mediated processes (i.e., aerobic and anaerobic biodegradation and human metabolism).
The development of reaction libraries is accomplished by the use of reaction transform languages such as SMIRKS and SMARTS, 120 in conjunction with cheminformatics soware tools. The execution of these reaction libraries predicts the major transformation pathways and their products. Although well-developed tools that execute reaction libraries for human metabolism are commercially available, currently only one tool is available for executing reaction libraries that predict environmental fate (enviPath [121][122][123], and this tool currently contains rules only for aerobic biodegradation. Additional soware and libraries for predicting contaminant pathways and products are under development, such as for ozonation of micropollutants under water treatment conditions. 124,125 A common challenge to developers of tools to predict transformation pathways is to minimize the prediction of irrelevant transformation products, sometimes referred to as the "combinatorial explosion", which has been dened as the prediction of many irrelevant transformation products when transformation pathways are iteratively applied to predict consecutive transformation reactions. 126 Strategies that have been used to minimize the problem of combinatorial explosion include assignment of likelihoods to the generalized transformation pathways in a dened reaction library, 126 a relative reasoning approach, 126 a combined absolute and relative approach, 127 a hybrid knowledge and machine learning-based approach, 121 and an approach based on the development of reaction rules for selectivity, reactivity, and exclusion. Fig. 5 provides an example of a reaction scheme for the hydrolysis of halogenated aliphatics (RX) containing vicinal halogens through HX elimination, and shows how this reaction scheme can be pruned by use of rules for selectivity as well as reactivity. In this example, the reaction scheme predicts that hydrolysis of 1,2-dibromo-3-chloropropane (DBCP) could yield four products; however, only one hydrolysis product (2-bromo-3chloropropene) is predicted when rules for selectivity are included, which is consistent with experimental results. 128 The reactivity rule states the order of removal of halogens (labeled reactant atom 3 in the reaction scheme) is inverse to their atomic number (i.e., I > Br > Cl > F), because the carbon-halogen bond strength is greatest for the most electrophilic halogen. 129 Execution of this selectivity rule-which states the hydrogen attached to the b-carbon having the fewest hydrogen substituents is preferentially eliminated 130 -predicts that elimination of the hydrogen in the 2-position of DBCP is the only major pathway.
An additional challenge to the developers of these tools is the need to incorporate the effect of environmental conditions on rates and pathways for many classes of chemicals. For example, it has been well documented how changes in pH can have signicant effects on both the rates and hydrolysis pathways of organophosphorus triesters in aquatic ecosystems. 129 This need to account for environmental conditions is discussed in greater detail below.

Incorporating environmental conditions
In so far as the ultimate goal of in silico environmental chemical science is describing the fate/effects of substances in real/ outdoor environments, it is not always sufficient to model only response variables that are formulated as fundamental properties with the effects of environmental conditions largely factored out. However, leaving too many environmental effects factored into the response variable will limit the applicability of the model to environments with different conditions. Conceptually, the ultimate solution to this dilemma is to incorporate both substance and environmental properties into the model as descriptor variables of the overall response variable. These three types of variables correspond to the x, y, and z in the conceptual model shown in Fig. 6.
An example that clearly ts the conceptual model shown in Fig. 6 is the overall bioavailability (response variable, z) of a class of contaminants of soil and sediment that comes in complex mixtures (e.g., PCBs, PCDDs, PAHs). In this example, one independent variable axis (x) represents the range of properties of the family of congeners (e.g., K oc of different PCBs) and the other independent variable (y) represent the range of environmental conditions (e.g., quantity and composition of sorptive phases in the sediment across a site). An application of the conceptual model in Fig. 6 to degradation of contaminants is exemplied by work on the natural attenuation of chlorinated hydrocarbon (CHC) solvents in groundwater. 131,132 In that case, the overall response is the decrease in total contaminant load (in terms of concentration, equivalent toxicity, etc.); the environmental descriptor is the type and quantity of reducing materials that can dechlorinate the contaminants (iron oxides, suldes, microorganisms, etc.) and the contaminant descriptor is the reactivity of each contaminant with the various reductants (i.e., specic rate constants). The area under the surface could be the overall decrease in contaminant concentration at a site, or rate of decrease in concentration, change in equivalent toxicity, etc. depending on exactly how the response variable is formulated.
A traditional 2-D QSAR corresponds to a cross section through Fig. 6 in the x-z plane, for a particular y. In principle, a QSAR through the y-z plane can be dened at a particular x Fig. 5 Example of pathway prediction with rules for reactivity and selectivity, showing initial steps for dehalogenation of 1,2-dibromo-3chloropropane (DBCP) by elimination. Four products are predicted by the generalized reaction scheme, but pruning with reactivity and selectivity rules predicts only one major product, 2-bromo-3chloropropene. Fig. 6 Conceptual model for 3-dimensional QSARs with response variable on the z-axis, substance property on the x-axis and environmental properties on the y-axis. Traditional QSARs correspond to a cross-section of the surface in the x-z plan. The drawing is not based on any particular data so the surface shape is arbitrary and the axes are not numbered.
(e.g., dechlorination of one CHC by environmental phases with different reduction potentials), but this is a challenging frontier with few examples at this time. Ultimately, it would be desirable to fully dene whole response surfaces such as those shown in Fig. 6, but this also is impractical with current methods. Progress toward the ultimate goal of QSARs that represent both substance and environmental properties has been limited, and the relatively few attempts to do that illustrate the challenges that arise. For example, rates of contaminant reduction in anoxic sediments have been described with a polyparameter QSAR that includes descriptors of both contaminant properties and sediment conditions, 133 but there is too much uncertainty over how to parameterize the environmental factors in such models for them to predict absolute rates of contaminant reduction in sediments with condence.
While the full conceptual model represented by Fig. 6 is difficult to parameterize with specic data, general consideration of the model can provide some useful insights into the process of QSAR formulation. An example of this involves the fungibility of factor allocation among the three types of variables, which is a key aspect of the "art" of formulating successful QSARs, and a manifestation of the principles of matching and balancing described above. By fungibility, we mean that one arrangement of factor allocations sometimes can be replaced by another with similar results. For a specic example of this, consider the strategy of developing QSARs using response variables that are normalized to data for a reference substance in order to remove variability in the data due to experimental or environmental conditions. 14,79,134 This strategy effectively collapses the surface in Fig. 6 into a conventional 2-D QSAR by moving the information about environmental conditions from the y into the z axis. However, implicit in this strategy is the assumption that environmental effects are uniform across the range of QSARs based on substance descriptor variables (i.e., that the relationship in Fig. 6 is a at plane not curved surface), and this is not always true. 79 Integration of databases, pathway prediction systems, and chemical property predictors The primary role of chemical exposure assessment models used for regulatory decision making is to provide estimated environmental concentrations (EECs) of the chemical of interest and its potential transformation products in environmental media. Examples of models that calculate EECs for pesticide exposures include FIRST (FQPA Index Reservoir Screening Tool), GENEEC2 (GENeric Estimated Environmental Concentration), and SWCC (Surface Water Concentration Calculator). The parameterization of these models requires chemical properties and knowledge of the dominant transformation products formed from the environmental transformation of the parent chemical as a function of environmental conditions. The data submitted for the chemical registration process prescribed by environmental laws-such as the Toxic Substances Control Act (TSCA); Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA); and the Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) regulations-are typically measured in media specic systems. For example, the rates and transformation product formation for chemically-mediated transformation processes such hydrolysis and photolysis are measured in water under pH controlled conditions and biologically-mediated processes such as aerobic and anaerobic biodegradation are measured in soils and sediments. In reality, transformation pathways such as hydrolysis and abiotic reduction in anoxic sediments, photolysis and hydrolysis in aquatic ecosystems, and aerobic biodegradation and hydrolysis in aerobic soils, will occur simultaneously.
The parameterization of exposure assessment models requires the development of integrated tools that reect this reality (i.e., have the ability to provide the data required for the estimation of environmental concentrations as a function of environmental conditions). 135,136 The hallmarks of an integrated tool for predicting environmental transport and transformation include (i) seamless connection to databases of measured and calculated chemical properties and chemical pathway prediction systems; (ii) calculation of chemical properties based on the execution of multiple property calculators and prediction of transformation pathways based on environmental conditions; (iii) simultaneous execution of multiple reaction libraries based on specic transformation pathways; (iv) parameterization and execution of QSARs for the calculation of transformation rate constants; (v) high through-put analyses (i.e., run in batch mode), and (vi) open access to the general public.
Movement towards web-based databases and tools, and development of the soware technologies that will enable seamless calls to these systems through web-based services, is accelerating the development of integrated computational systems. This ability for seamless linkage will reduce the need for duplicative efforts, resulting in signicant savings of resources and time. Examples of databases and pathway prediction systems that are currently web-based, or are currently being updated as web-based tools include EPA's ICSS Chemistry Dashboard, 137,138 CEFIC's AMBIT, 139 OECD's Toolbox, 28 and EAWAG's enviPath. 121,122 The ICSS Chemistry Dashboard is a web-based data base for $700 000 chemicals that maps curated physicochemical property data associated with chemical substances to their corresponding chemical structures. EnviPath is an aerobic biodegradation reaction library based on 332 biotransformation descriptions for 249 biotransformation rules. Web services are currently being developed for enviPath.
To address the need for a fully integrated tool, EPA's Office of Research and Development is currently developing the Chemical Transformation Simulator (CTS), with release to the general public planned for late 2017. The primary components of the CTS are a Physico-Chemical Property Calculator (PPC) and a Reaction Pathway Simulator (RPS) (Fig. 7). The PPC will allow the user to compare properties generated by a variety of calculators that take different approaches to estimating specic physicochemical properties. The calculators currently implemented include EPI Suite, which uses a fragment-based approach; TEST (Toxicity Estimation Soware Tool), which uses QSAR-based approaches; and ChemAxon plug-in calculators, which use an atom-based fragment approach. The output derived from these calculators will enable the user to compare the calculated data with measured data extracted from readily accessible web-based databases (e.g., ICSS Chemistry Dashboard).
The RPS allows the user to select individual or multiple reaction libraries dependent on the environmental media of interest. The beta version of the CTS has reaction libraries for hydrolysis, reduction, and human metabolism. A reaction library for photolysis is currently under development and will be available for the fully functional version of the CTS. This updated version of the CTS will have the ability to execute a reaction library of aerobic biodegradation through seamless linkage to the EAWAG PPS using web services that are currently being developed for this tool.
A Reaction Rate Calculator (RRC) is also under development for the fully functional version of the CTS. The RRC will provide for the parametrization and subsequent execution of QSARs for the prediction of transformation rates. Currently, rate constants for transformation processes represent a signicant data gap for the parameterization of models used for estimating environmental concentration. The RRC will be limited by the availability of existing QSARs and the ability to construct new QSARs for this purpose.

Future prospects
The scope of this perspective reects the maturity of traditional statistical QSAR methods for predicting the environmental fate/ effects determining properties of chemicals; the great potential of theoretical/computational chemistry methods for improving the prediction of chemical properties or characterization of transformation pathways; and the transformative impact of integrating QSAR and molecular models, with informatic and internet tools, to make predictive modelling more accessible, efficient, and comprehensive. The emphasis on chemical contaminants reects the balance of focus of most work on development and application of in silico models in environmental chemical science to date. However, some of the methods and results developed for modelling chemical fate and effects in the environment should also apply to other substances. This is evident in the still young but rapidly developing application of QSARs to materials, especially nanoparticles. [140][141][142][143][144][145] An even greater challenge lies in the extension of QSAR methods to biological properties like virulence (i.e., virulence factor activity relationships, VFARs). [146][147][148] The challenges involved in implementing useful VFARs are considerable, but may eventually succumb to the combination of advances from computational toxicology 1 and omic sciences. 149

Disclaimer
The views expressed in this article are those of the author(s) and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency or other sponsoring agencies. Mention of trade names or products does not convey, and should not be interpreted as official EPA approval, endorsement, or recommendation. Fig. 7 Major components of the Chemical Transformation Simulator including Chemical Editor, the Reaction Pathway Simulator. Links to enviPath provide the ability to generate transformation products resulting from aerobic biotransformation. Links to the ICSS Chemistry Dashboard provide additional calculated and measured chemical properties, as well as curated chemical structures.