Cost-eﬀective materials discovery: Bayesian optimization across multiple information sources †

Applications of Bayesian optimization to problems in the materials sciences have primarily focused on consideration of a single source of data, such as DFT, MD, or experiments. This work shows how it is possible to incorporate cost-eﬀective sources of information with more accurate, but expensive, sources as a means to significantly accelerate materials discovery in the computational sciences. Specifically, we compare the performance of three surrogate models for multi-information source optimization (MISO) in combination with a cost-sensitive knowledge gradient approach for the acquisition function: a multivariate Gaussian process regression, a cokriging method exemplified by the intrinsic coregionalization model, and a new surrogate model we created, the Pearson-r coregionalization model. To demonstrate the eﬀectiveness of this MISO approach to the study of commonly encountered materials science problems, we show MISO results for three test cases that outperform a standard eﬃcient global optimization (EGO) algo-rithm: a challenging benchmark function (Rosenbrock), a molecular geometry optimization, and a binding energy maximization. We outline factors that aﬀect the performance of combining diﬀerent information sources, including one in which a standard EGO approach is preferable to MISO.


Introduction
At the forefront of materials sciences, there is a set of research topics that remain largely inaccessible, experimentally and computationally, due to their combinatorial complexity.As one topical example, the study of high entropy alloys has exploded into an almost insurmountable combinatorial problem. 1 In the biological sciences, the large conformational space of protein folding has proven to be exceedingly difficult to tackle. 2 In recent years, machine learning (ML) techniques have shown promise in their application to challenges in the physical and biological sciences, proving to be an effective means to tackle intractable compositional and/or high-dimensional problems.Several landmark studies are emerging, as evidenced by examples using ''deep learning'' approaches on protein folding such as AlphaFold, regression methods for transformation temperature predictions in shape change alloys, 4 the use of a random forest approach to predict the thermoelectric properties of materials, 5,6 the use of neural networks to predict molecular 7 and atomic 8 energies, and using Bayesian optimization to unravel the solution processing of the hybrid organic-inorganic perovskite (HOIP) combinatorial space. 9More nuanced applications have also arisen such as the use of deep transfer learning for materials property prediction, the necessity for multi-objective optimization in the case of a Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.E-mail: hherbol@jhu.edub Uber AI, San Francisco, CA 94105, USA

New concepts
Bayesian optimization methods require an acquisition function (where to search next) and a surrogate model (mimicking the behavior of real systems).We create a novel algorithm that uses the only existing acquisition function capable of taking information from multiple sources (e.g., different experimental sources and/or simulation approaches) in conjunction with a new surrogate model that finds the optimal result meeting a pre-specified objective.Current surrogate models require many fitting parameters, restricting their applicability to less complex domains.Our new model minimizes the number of such hyperparameters and yet frequently performs far better than more complicated approaches.This opens the door to considering larger combinatorial problems than previously possible.We show that our new algorithm is successful at accelerating the search for optimal solutions of common materials science problems, like geometry optimization or optimal solvent choice.We identified that our multi-information source approach will work best in well-correlated systems.Noisy information sources make our approach only comparably effective to a standard EGO approach.Overall, this is an important new addition to existing Bayesian optimization tools, one that functions in decision-making more like our own brains, considering many pieces of information before deciding upon the best solution in the most effective manner.
durable antifogging superomniphobic supertransmissive nanostructured glass development, 11 the use of a support vector machine (SVM) to efficiently identify potential antimicrobial peptides, 12 the development of a novel molecular reaction fingerprint for the study of redox reactions, 13 the use of genetic algorithms to estimate parameters in large kinetic models, 14 the use of a variety of these methods to explore the organic photovoltaic (OPV) space, 15,16 the use of Gaussian process regression (GPR) to model quasar emission spectra for the detection of Ly a absorbers, 17 and a novel neural network design for encodingdecoding molecules to/from continuous space. 18lthough such studies, and many others, are demonstrating the breadth of applicability of ML to materials sciences, they invariably use a single source of data/information, which we will designate as an information source (IS).It is clear that improvements in speed and cost could be made by learning information from a cost-effective source and limiting predictions from other, more expensive, sources.Ultimately, using multiple information sources could lead to more robust predictions of materials' choices and/or properties, not to mention the potential for lowering the cost (in time and resources) if a cheaper information source can be used in place of a more expensive one.For materials studies, information sources could be provided by experimental data, continuum modeling predictions, molecular dynamics (MD) simulations, quantum mechanically derived density functional theory (DFT) calculations, various ML models -neural networks (NN), Gaussian processes (GPs), random forests, etc. -or even an intuitive rule of thumb.Each information source has its own inherent accuracy and cost.In this paper, we will show how to use a combination of sources of information within a Bayesian optimization framework to significantly accelerate materials discovery for common, important calculations in the computational sciences.
Bayesian optimization, from a high-level point of view, involves the process of using Bayes rule to optimize a function.To understand why this is relevant, we must first contemplate the problem of optimizing a continuous, black-box, noisy function, g(x).If we have no way of appreciating the function, our search is blind.However, what if we can devise a function, f, that, given our observations from D = g(x), allows us to make more accurate predictions on g(x)?In essence, can we find f (x|D) B g(x)?If so, we can then use f (x|D) to determine what candidate points x we should sample to maximize (or minimize) g(x).This can be accomplished using Bayes rule (eqn (1)).

Pð f jDÞ
In Bayesian optimization, we call this underlying model the surrogate model; it is frequently chosen to be a GPR.The choice of x from our surrogate model to sample next (where ''sample'' means calling our black-box function) is determined by some suitable acquisition function.For a more in-depth background into Bayesian optimization, we direct the reader to a succinct review by Peter Frazier. 191][22] This approach uses either spatial proximity (a cross-variogram) or correlation (crosscovariance) to define a matrix, called a coregionalization matrix, to interpolate one IS from data in another IS. 20,23his is illustrated in Fig. 1, in which a prediction made from a coarse IS is used to glean information on a finer IS.Kriging is perhaps best known as a geostatistics term for interpolation using a GP, while cokriging is simply the extension of this interpolation to multiple, highly correlated, data sets. 20Cokriging has been used to study a wide range of geological systems, from predicting surface temperatures from elevation 24 to capturing the correlation of rainfall measurements from radar to standard rain gauges. 25n what follows, we will use an IS index number to denote the ''accuracy'' of the source (with accuracy being subjective and based on the user's viewpoint).We denote IS 0 to be more accurate than IS 1 , and so on.
An alternative approach to handling data from several information sources was recently published as a method employing multi-information source optimization with a Knowledge Gradient (misoKG).In that work, a cost-sensitive Knowledge Gradient (csKG) acquisition function was used with a standard multivariate Gaussian process regression (MGP) surrogate model. 26These two components, an acquisition function and a surrogate model, are the building blocks of an optimization algorithm.Hence, consideration of the choices of these two components will figure prominently in the discussions below.Multi-information source optimization (MISO) works by sampling in such a way that the cost is minimized, while the accuracy of the predicted (optimal) result is maximized.The underlying csKG acquisition function allows for this by determining which point, and which source of information, to sample next.If the underlying GPR indicates no (or poor) correlation between the sources of information, then the model defaults to the well-known Knowledge Gradient (KG).At this point, we draw the reader's attention to the difference between MISO and multi-fidelity optimization: MISO can be seen as a generalization of multi-fidelity approaches from costeffective approximations to encompass any correlated source of information.
Each approach used here builds from an underlying Gaussian process, meaning that the code will need to learn empirical fitting parameters.These parameters, called hyperparameters, can vary depending on the choice of surrogate model.With more hyperparameters, it is possible to obtain a better regression; however, this comes at a cost.To fit hyperparameters, it is common to use an approach such as the Maximum Likelihood Estimation (MLE) or Maximum A Posteriori (MAP) estimation. 19The larger the hyperparameter space, the noisier the landscape, and the more difficult it becomes to adequately learn the hyperparameters.Assuming the same model for each information source, the number of hyperparameters in MGP scales linearly with the number of information sources.Coregionalization, on the other hand, will add at most mðm þ 1Þ 2 hyperparameters, being the components of a lower triangular matrix, where m is the number of information sources.This provides a strong impetus to consider coregionalization as a surrogate model.
In this work, we present a ''coregionalized csKG approach, using the csKG acquisition function with Bayesian optimization methods based on the intrinsic coregionalization model (ICM). 27nlike the original multivariate Gaussian process regression (MGP) surrogate model, where the number of hyperparameters effectively scales with the number of information sources, we show it is possible to define the coregionalization matrix using significantly fewer hyperparameters.Further, we introduce an entirely new approach capable of generating the necessary coregionalization matrix, based on Pearson-r correlation coefficients, 28 which has the considerable advantage of removing the need for additional hyperparameters altogether.We call this surrogate model the Pearson-r coregionalization model (PCM).As a ''proof of concept,'' we apply a unique combination of the csKG acquisition function and the new PCM surrogate model to explore complex compositional landscapes involved in the solution processing of a novel class of solar cell materials, known as hybrid organic-inorganic perovskites (HOIPs), and other common computational materials applications.

Nomenclature
To merge the naming conventions between the ML and DFT communities, we present two approaches to identify the models used in this paper.Within the ML community, it is common practice to identify an acquisition function/surrogate model pair by a single name.This can be seen in the case of the commonly used efficient global optimization (EGO) algorithm, 29 which merges expected improvement (EI) with GPR (or, in the case of the original paper, this is referred to as ''kriging'').Similarly, the misoKG algorithm uses a csKG acquisition function with an MGP surrogate model.To maintain this naming convention, we will then define ''PearsonKG'' to mean pairing the csKG acquisition function with the PCM surrogate model.In regards to the csKG with the ICM surrogate model, we note that a similar formulation exists with an alternative acquisition function: entropy search.This algorithm was dubbed multi-task Bayesian optimization (MTBO) 30 and, as such, leads to our choice of its name as ''MultiTaskKG.'' This naming scheme has the benefit of recognizing an algorithm by name; however, it does not allow readers to easily parse the details of the constituent models.Within the DFT literature, the solution for naming the choices of functional and basis-set that define the overall approach is to list the two separated by a forward slash.In the spirit of the DFT naming convention, we express the specific names favored by the ML community by a combination of the underlying acquisition function and surrogate model.As a result, we identify the aforementioned combinations as follows: EGO = EI/GPR misoKG = csKG/MGP MultiTaskKG = csKG/ICM PearsonKG = csKG/PCM Within this paper, we will refer to the algorithm name itself; however, we define the above in an effort to consolidate naming conventions within the ML and DFT literature.This extensible approach allows new researchers in the field to readily understand the taxonomy of algorithmic names in this area of machine learning.

Results
We benchmark three MISO surrogate models -PCM, ICM, and MGP -against a standard EGO approach using the Rosenbrock function as a first test case. 31The Rosenbrock function, with its long, narrow and very flat parabolic basin, is a difficult optimization problem that is commonly used for benchmark purposes.This test case also has probative value since it allows us to benchmark against the original misoKG paper by Poloczek et al. 26 As a second test case, we study the effects of differing DFT functionals/basis sets as sources of information for the geometry optimization of carbon monoxide.Finally, we revisit the HOIP work 9 and assess the benefits of using MISO approaches, in which different levels of theory within DFT are deployed as a set of information sources, as well as differing molecular systems.
Running a MISO approach does not guarantee that one information source will be sampled over another.As such, given that sampling from IS 0 does not necessarily have the same cost as sampling from IS 1 , we end up with a heterogeneous data set where cost varies between replications.This can be conceptualized by considering two experiments in which we run a MISO approach using two sources, IS 0 and IS 1 , whose costs we estimate to be 1000 and 1, respectively.If, on the first experiment, we sampled IS 0 4 times and IS 1 2 times, versus on the second experiment where we sampled IS 0 6 times, our 6th data point will be at a cumulative cost of either 4002 or 6000, depending on the experiment.As we replicate our evaluations several times for statistical significance, it is no longer possible to simply ''average across all replications.''Thus, we must homogenize the sampling domain so as to average the value that first exceeds a given cost.We then plot the x associated with the maximum (or minimum, depending on the problem) posterior mean of the IS 0 model.The ''best'' model is identified as the model that achieves global maximization with the least cost.
Hyperparameter optimization can be achieved in at least three different ways: (1) with data sampled across all IS (i.e., x, if x is sampled for all IS i , mathematically shown as 2) with data sampled only within IS 0 -the most expensive information source -or (3) with all sampled data.In order, we will call these IS Intersection , IS Costly , and IS Full .Additional benchmarks are shown in the ESI.† In this work, we learn the hyperparameters from the data we sampled initially, and then keep them fixed for the remainder of the optimization.As such, we only concern ourselves with IS Intersection and IS 0 .
Finally, as the Bayesian optimization is performed for a combinatorial problem, we discretize the domains we wish to optimize over.The larger the discretized domain, the more complex the problem, and the harder it is to find the global extrema.This class of problem is seen as ''combinatorial optimization'' for ''large discrete domains.''In the case of the Rosenbrock function, we illustrate below three different discretized domains as a way to illustrate the benefit of using MISO over that of using EGO, especially as the complexity of the problem increases.

The Rosenbrock function
Fig. 2 shows the results from a variety of MISO approaches in which IS 0 was the standard 2D Rosenbrock function (see eqn ( 5)), and a slightly noisier alternative was chosen as IS 1 (in which the amplitude of the sine noise, v, was set to 0.1).To achieve better statistical significance, we ran 200 replications of each and plotted with AE2 standard error of the mean (SE).
Fig. 2 A comparison of MISO approaches to a standard EGO approach to find the minimum of the Rosenbrock function (eqn ( 5)).In all cases, the three The value of a MISO approach in all panels of Fig. 2 can clearly be seen in comparison to a more standard Bayesian optimization approach like EGO, which is shown as a control.Candidate (x,y) data were sampled from [À2,2] 2 in increments of 0.016, 0.008, or 0.004 (for a sampled discrete domain size of 250, 500, or 1000, respectively).This comparison of domain complexity is illustrated in Fig. 2a-c, where the superiority of MISO approaches becomes increasingly apparent.In contrast, Fig. 2d shows results from MISO approaches in which IS 0 was the standard Rosenbrock function (see eqn (5)), but a significantly noisier alternative was chosen as IS 1 (in which the amplitude of the sine noise, v, was set to 10.0).Corresponding numerical results are shown in Table 1.

DFT information sources for geometry optimization in CO
We studied the ability of the same four statistical models to minimize the total energy of a carbon monoxide (CO) molecule.This effectively performs a geometry optimization, a common task using DFT, via Bayesian optimization.Information sources were taken as being either a single-point SCF calculation from a double-hybrid method, with a triple-z basis set (B2PLYP and Def2-TZVP), 32,33 which is an accurate and expensive option, or a simple (inexpensive) Hartree-Fock approach with ''three corrections'' 34 in which (1) a geometrical counterpoise correction to remove basis set superposition error, [35][36][37] (2) the D3BJ dispersion correction, 38 and (3) the MINIX basis set. 39Note that these information sources correlate well, with a Pearson-r correlation coefficient of 0.999.As we will show, this strong correlation is an important consideration.The results are shown in Table 2, where the ICM and PCM surrogate models perform the best (i.e., have the lowest mean cost).What is more strikingly apparent though is the improvement to the 99.9th percentile, where PearsonKG shows a 76% improvement compared to that of EGO.

Physical analytics pipeLine test case for HOIP materials
A major difficulty associated with studying hybrid organicinorganic perovskites (HOIPs) computationally lies in the fact that (1) possible candidate materials differ in composition, (2) the compositional space represents a large combinatorial problem, and (3) no MD force field exists that is suitable to use in cost-effective simulations for HOIPs candidates.As a result, computational research on HOIPs formation and growth is currently restricted largely to expensive DFT calculations.
Our previous work 9 showed the benefits of using Bayesian optimization to tame this complexity, and developed a probabilistic model for HOIP-solvent intermolecular binding energies.Here, we investigate the benefits of using multiple information source optimization approaches.The alternate IS in this case consist of data sources in which we varied (1) the number of solvents considered to be bound to the lead salt and (2) differing a Results using the same three statistical models, but for an extremely noisy Rosenbrock function.DFT levels of theory (defined as differing functionals and basis sets), which can vary considerably in expense of calculation.
As semi-empirical force fields for HOIPs become available in the future, these could also be used within a MISO approach.
Results of these tests are shown in Fig. 3.As the MISO surrogate models require an initial sampling from all the information sources prior to running the optimizer, this produces a systematic offset on the x-axis in Fig. 3a-c (and most readily observable in Fig. 3c) indicative of this additional training cost.In contrast, EGO starts optimizing sooner since its initial training is solely against the expensive IS 0 .The largest benefits to using MISO approaches can be seen in Fig. 3a, in a test case in which the two information sources are both DFT functionals (inexpensive GGA and expensive hybrid approaches).In contrast, in Fig. 3b and c, the two different information sources concern information garnered from the number of solvent molecules bound to the lead salt (one solvent molecule vs. three solvent molecules in Fig. 3b, and three solvent molecules vs. five solvent molecules in Fig. 3c).Here, the advantage of using a MISO approach is far less apparent.The origin of this change is, we believe, a function of how noisy the energy landscape becomes as the number of solvent molecules increases which, in turn, necessitates a growing importance of adequate sampling.
To assess the noise in the energy landscapes, we generate a cross-correlation table of possible HOIP-solvent information sources (Table 3).The various information sources are distinguished by the level of theory (GGA vs. hybrid) and the number of solvents (1, 3, or 5).The generalised gradient approximation (GGA) level of theory used was B97-D3 with a triple-z basis set; 38,40 while the hybrid functional was PW6B95 with a triple-z basis set. 33,41The naming convention used was either GGA or hybrid followed by N, where N was 1, 3, or 5 (signifying the number of solvents).Results for all the information sources are Fig. 3 Comparison of the impact of different information sources for HOIP materials in the performance of MISO methods versus a standard EGO approach.The results show that EGO performs as well as, and at times better than, MISO methods in cases (Fig. 4b and c) where the information sources are poorly correlated.MISO methods fall back to EGO-like performance in such cases (with the exception of the x-offset due to the initial sampling).
Table 3 The cross-correlation matrix of all HOIP information sources.

Correlation is calculated only on data that exists across both information sources
Hybrid-1 GGA-1 GGA-3 GGA-5 Hybrid-1 plotted in Fig. 4. We find that GGA-1 correlates best with other IS (i.e., it consistently has a correlation factor over 0.8).

Discussion
As we explore the ability of machine learning to tackle grand challenges in computational materials science, it is natural to want to take advantage of information from a variety of sources and to combine them in such a way that we can make predictions of materials properties or optimal materials discovery in the most cost-effective way possible.In this paper, we use a newly proposed acquisition function, csKG, 26 in concert with several surrogate models, including a new model proposed here, that can harness multiple information sources for costeffective predictions.This is the first application of a MISO approach to conduct common computational materials science calculations.
Our first study involved using the Rosenbrock function as a test of our optimization methods, since it is a challenging nonlinear, shallow-basin problem.The Rosenbrock benchmarks shown in Fig. 2  For the Rosenbrock test, the improvement over EGO was 80% for each of the three MISO models we tested.In regards to hyperparameter optimization, we find that when the model includes hyperparameters that capture the interplay between information sources (as in the case of MGP and ICM), it is necessary to include data across all information sources to adequately learn these parameters.The main improvement of our new model, PCM, over that of ICM and MGP can be seen in the results for the 99.9th percentile.This improvement also comes with the considerable advantage that we no longer need additional inter-IS hyperparameters in this approach.As Bayesian approaches are known to be capable of optimizing noisy functions, we find that when the information sources correlate well but are inherently noisy, MISO approaches continue to outperform EGO, with the PCM surrogate model remaining the best.Finally, looking at Fig. 2, when we consider a large combinatorial space for the optimization, the MISO models perform markedly better (over EGO) than when we consider a smaller combinatorial space.
Turning towards applications of this approach to computational materials sciences, our aim is to understand the effects of including different information sources on cost-effectiveness.Performing a geometry optimization of CO using Bayesian optimization is a prime example of a common computational chemistry calculation.Here, our information sources were different DFT functionals and basis sets.This is a representative real-life issue since these sources differ greatly in cost (certainly by over an order of magnitude).Finding that we have a close-to-unity Pearson correlation coefficient between these information sources, we anticipated that the various MISO methods would perform better than EGO.And, indeed, Table 2 shows that both the MISO surrogate models (ICM and PCM) outperform a standard EGO approach by, on average, 52% in cost.Further, in the case of the 99.9th percentile, MultiTaskKG and PearsonKG outperform EGO by 71% and 76%, respectively.Commonly, local optimizers are used when performing DFT geometry optimizations; however, at times, the desire to find optimal molecular conformations/packing is desired instead.This is a global optimization problem which involves considerations of optimally packing molecules via translational and rotational changes, subsequently followed by a geometry optimization.This work shows that it is possible to use MISO methods to deploy cheap sources of information for this search space (such as Hartree-Fock) with more expensive/accurate functionals for the final calculations, greatly reducing the overall time required to find this optimal packing.
Finally, in a challenging application related to the selection of solar cell materials, hybrid organic-inorganic perovskites, we looked at the cost-effectiveness of combining different information sources within our PAL code base.For these perovskite materials, it becomes even more apparent that the reliability of the cheaper information sources, i.e., IS l 4 0 , to represent the most trusted source, IS 0 , comes under scrutiny.The results in Fig. 3a show a slight benefit in using PearsonKG, corroborated in Table 4 by the reduced mean cost to reach an optimal solution.However, this benefit is rapidly reduced as the information source changes to noisier alternatives.We find the nature of the information sources is an important factor when considering a MISO approach.This effect is clear in Fig. 4, which compares information sources that choose different ways to represent both the solvation of the lead salt and the level of DFT theory used.Choices like Hybrid-1 and GGA-1, Fig. 4 Comparison of all the HOIP information sources sorted such that the results using the cheap GGA-1 function is monotonically increasing.Data from GGA-1 and the expensive Hybrid-1 functional (where the ''-1'' indicates a single solvent) can be seen to correlate well.In contrast however, information sources that involved more solvents (GGA-3 and which share a common number of solvent molecules (=1) model one another much better (i.e., are better correlated) than the alternatives (GGA-3 and GGA-5) in which the number of solvent molecules (and hence the inherent noise) varies.In the latter situation, we find no benefit to using a MISO approach (Fig. 3b and c).
In summary, we have developed a new surrogate model, PCM, and shown how it performs at least as well, and often better, than ICM (where ICM can be seen as the common ''go to'' model when it comes to coregionalization) for several archetypal test cases in the computational materials sciences.Importantly, the new PCM model does not involve any additional hyperparameters.Further, we show how the acquisition function, csKG, can be implemented in combination with these surrogate models.This combination opens the door for computational studies to incorporate any number of data streams of varying expense in a cost-effective way.This method allows the user to exploit the availability of less accurate, lower cost, alternative sources of information.We find that the best approach to take depends heavily on the correlation between information sources and the surrogate model that defines the potential.When an information source possesses the same GPR kernel (K x ) with similar/identical hyperparameters, as in the Rosenbrock benchmark, CO geometry optimization, and the HOIP benchmark comparing Hybrid-1 versus GGA-1, using csKG with the PCM surrogate method converges significantly faster towards a global optimum.In cases where the hyperparameters differ greatly, as exemplified by GGA-5 versus GGA-3, the best course of action appears to reside in using a traditional EGO approach.Finally, we have discovered that, when it is desirable to maximize an objective with an expensive DFT functional, a MISO approach using the PCM surrogate model and a more costeffective functional can greatly reduce the total cost.

The intrinsic coregionalization model
There are many methods of coregionalization, several of which are outlined in a review article by Alvarez et al. 42 The intrinsic coregionalization model (ICM) approach defines a coregionalization matrix (K s ) such that K = K s # K x (where # is the Kronecker product and K x is a user-defined kernel). 42The problem then arises how best to define K s .One method involves simply allowing K s = I, in which the hyperparameters of K x , parameterized against all IS, will capture the correlation between IS. 43,44 Another method ensures a PSD matrix by having K s = ES T SE in which E is a diagonal matrix of scalars with dimension M (the number of IS), and S is an upper triangular matrix. 45It should be noted that there exists an equivalent expression, K s = ELL T E, in which L is a lower triangular matrix.In general, the scalar matrix is not particularly common and, in most literature sources, we simply find that K s is written as LL T . 21

The Pearson-r coregionalization model
The Pearson-r coregionalization model (PCM) was developed based on the idea that the coregionalization matrix should capture the correlation between IS. 22 From this, the coregionalization matrix was not learned, but dynamically generated from the Pearson-r correlation coefficients (r) of sampled data. 28Specifically, each component of the coregionalization matrix was made using eqn (2), to which eqn (3) was calculated (r was only calculated for data that had been sampled at all IS using the python package, SciPy). 46This approach eradicates the need for additional hyperparameters (ontop of those in K x ), allowing for a more scalable model.(3)

Analytical model
The Rosenbrock function, also known as the ''banana function,'' is a well known, well studied and challenging test case for global optimization. 31Further, a noisy alternate form has already been developed by Lam et al. 47 As this function was used by Poloczek et al. 26 to benchmark misoKG and MGP, we use it here to compare our new approach to Poloczek's misoKG.The function itself is shown in eqn (4), where a, b, c A R.
In order to use the Rosenbrock function with MISO surrogate models, we need several IS.Accordingly, we define two IS as the Rosenbrock function, plus some additional varied term.The two IS used are shown in eqn (5): We can replicate the work shown in Poloczek et al. 26 by setting a = 1.0, b = 100.0,c = À456.3,u = 0, v = 0.1, and a cost ratio of 1000 : 1.The domain for x 2 D is given by D i 2 R 2 ji o N and bounded within the origin-centered-square of [À2,2] 2 .N, the number of discrete points of our domain D, was chosen to be either 250, 500, or 1000, in order to capture the effects of the search-space on the models.To probe the limits of our model, we expanded upon this benchmark by considering a significantly noisier secondary information source and allowing v = 10.0.Note that the global minimum of the Rosenbrock function (offset from 0) is set to c, in this case À456.3.

CO model
In order to model the CO molecule using multiple IS, a simple zero-mean, 5 2 Mate ´rn kernel was chosen. 48From this, the expensive source, IS 0 , was chosen to be a double-hybrid approach using the B2PLYP functional 32 and Def2-TZVP basis set. 33The cheaper source, IS 1 , was taken to be a corrected Hartree-Fock approach. 34All DFT calculations were performed using the Orca software. 49Since CO consists of only two atoms, x was taken as the interatomic distance (in Å), bound between [0.5, 2.0], in intervals of 0.001 Å.Five random sampled points were taken to initially train hyperparameters.

HOIP model
The probabilistic model for the HOIP system is the same as that outlined in Herbol et al. 9 This is illustrated in eqn ( 6), with the mean given in eqn (7) and the covariance matrix in eqn (8).
For S 0 (S x ,S x 0 ) we chose the well known 5 2 Mate ´rn kernel, which provides the covariance between measurements made at any two points. 48 the original PAL work, data points were generated to represent the intermolecular binding energy of three solvent molecules to a HOIP lead salt modeled using an ab initio GGA DFT functional.It would also be possible to generate DFT data corresponding to other situations in which a different number of solvent molecules was considered, given that a full shell surrounding the lead salt involves around 25 molecules. 50oreover, we can use various choices of level of DFT theory which may differ significantly in accuracy and cost.Overall sampling of the solvents around the salt were performed using Packmol, 51 LAMMPS, 52 and the OPLS-AA force field. 53Final geometry optimizations and energy calculations were made using Orca. 49e define information sources, for example, as GGA-3, where N3 indicates the case in which three solvent molecules are considered to be bound to the lead salt, and R2 represents the level of theory indexed within the PAL codebase (the label ''2'' corresponding to the GGA functional B97-D3 with a triple-z basis set).Simply changing the number of solvents bound to the lead salt, or the level of theory used, will give rise to a different IS label.In that regard, we have explored the following information sources for benchmarking purposes: GGA-5 -intermolecular binding energy of five solvent molecules to a perovskite lead salt using the GGA functional B97-D3 (179 data points).
GGA-3 -intermolecular binding energy of three solvent molecules to a perovskite lead salt using the GGA functional B97-D3 (240 data points).
GGA-1 -intermolecular binding energy of one solvent molecule to a perovskite lead salt using the GGA functional B97-D3 (480 data points).
Hybrid-1 -intermolecular binding energy of one solvent molecule to a perovskite lead salt using the hybrid functional PW6B95 (480 data points).
When using multiple information sources from among these options, we chose only those that exist across all IS.Thus, when we used two information sources as a test case, labeling them IS 0 and IS 1 , we chose Hybrid-1 and GGA-1, for which a total of 480 data points exist.However, when IS 0 and IS 1 are GGA-3 and GGA-1, respectively, only the 240 intersecting points in the data sets would be used.

Fig. 1
Fig.1Illustration depicting how one IS (shown as the top, more coarsely defined contour plot) can be used to learn about another more finely defined contour plot (shown as the bottom plot).Samples can be taken from either the coarse IS (green dots) or the fine IS (gold dots) source.If the ISs correlate well, samples taken on the top can be used to learn about the bottom, demonstrated here as transparent gold dots.

Table 1
Benchmarking the performance of three MISO statistical models for the minimization of the Rosenbrock function.This table shows the significant advantage of using the PearsonKG approach over other MISO approaches, most readily apparent in the 99.% 9th%-tile.Values reported in this table indicate the cost taken to be below À455.3, the lowest function evaluation across all methods (with the global minimum occurring at À456.3).All values in this table have been rounded to the nearest 1000 (the cost of IS 0 ) and then scaled by 1000 to be more easily compared

Table 2
Benchmarking four statistical models to minimize the total energy of a CO molecule.Values indicate the cost taken to be within 0.6 kcal mol À1 of the lowest function evaluation across all methods.The cost ratio used was approximately 11.5 to 3.5 (based on average computational time, in seconds, for each single point DFT calculation with the respective levels of theory), and as such can be seen as the total time to geometry optimization.All values in this table have been rounded to the nearest 10 and then scaled by 10 so as to be more easily compared.It is clear that the PearsonKG and MultiTaskKG approaches converge to the ground state geometry in significantly less time than the ''industry standard'' EGO This journal is © The Royal Society of Chemistry 2020

Table 4
Benchmarking MISO surrogate models (ICM and PCM) against EGO for the minimization of the HOIP objectives.Values reported in the table indicate the cost, taken to be within 0.6 kcal mol À1 of the lowest function evaluation across all methods.Lower cost indicates better performing models.All values in this table have been rounded to the nearest 10 and then scaled by 10 so as to be more easily compared