Gaussian Processes for Finite Size Extrapolation of Many-Body Simulations

Key to being able to accurately model the properties of realistic materials is being able to predict their properties in the thermodynamic limit. Nevertheless, because most many-body electronic structure methods scale as a high-order polynomial, or even exponentially, with system size, directly simulating large systems in their thermodynamic limit rapidly becomes computationally intractable. As a result, researchers typically estimate the properties of large systems that approach the thermodynamic limit by extrapolating the properties of smaller, computationally-accessible systems based on relatively simple scaling expressions. In this work, we employ Gaussian processes to more accurately and efficiently extrapolate many-body simulations to their thermodynamic limit. We train our Gaussian processes on Smooth Overlap of Atomic Positions (SOAP) descriptors to extrapolate the energies of one-dimensional hydrogen chains obtained using two high-accuracy many-body methods: Coupled Cluster theory and Auxiliary Field Quantum Monte Carlo (AFQMC). In so doing, we show that Gaussian processes trained on relatively short, 10-30-atom chains can predict the energies of both homogeneous and inhomogeneous hydrogen chains in their thermodynamic limit with sub-milliHartree accuracy. Unlike standard scaling expressions, our GPR-based approach is highly generalizable given representative training data and is not dependent on systems' geometries or dimensionality. This work highlights the potential for machine learning to correct for the finite size effects that routinely complicate the interpretation of finite size many-body simulations.


I. INTRODUCTION
Over the past few decades, ab initio electronic structure methods have transformed our ability to design materials by enabling researchers to predict the macroscopic and emergent behavior of solids from a basic knowledge of their constituent atoms.Researchers can now routinely model the electronic and geometric properties of systems ranging from quantum materials to heterogeneous catalysts with -or very near -chemical accuracy.However, the accuracy that accompanies many-body electronic structure methods such as Coupled Cluster (CC) theory, Quantum Monte Carlo (QMC), and many-body perturbation theories often comes at a steep cost: these methods typically scale as a high degree polynomial with system size.For example, Coupled Cluster Singles, Doubles, and Perturbative Triples [CCSD(T)] conventionally scales as O(N 3 M 4 ), where N is the number of electrons and M is the size of the basis set, while Auxiliary Field Quantum Monte Carlo (AFQMC) typically scales as O(N 2 M 2 +M 2 N). 1 In contrast, mean field methods such as Density Functional Theory (DFT) scale as O(N 2 log N) 2,3 or O(N), 4 when locality is a good approximation, but are only predictive when the degree of electron correlation is mild.Historically, the comparatively steep scaling of many-body methods has thwarted their direct application to solids with large unit and/or supercells, limiting their use to systems with just tens to, poa) Electronic mail: edgar_landinez_borda@brown.edub) Electronic mail: brenda_rubenstein@brown.edu tentially, hundreds of atoms.However, such smaller, more computationally-accessible systems cannot manifest the same long-range correlations as are present in larger, more realistic solids, and can exhibit spurious boundary effects that confound their interpretation.Indeed, given the remarkable accuracy of many modern electronic structure methods, these finite size errors are often the largest sources of error in many calculations of solids. 5,6This leads to a long-standing conundrum: if many-body methods may only be directly applied to smaller, finite systems, how can they be leveraged to predict the properties of larger, more realistic solids?
To increase the feasibility of many-body methods for the prediction of the properties of solids in their infinitesize, "thermodynamic limit" researchers have developed approaches that correct results for smaller systems to predict the properties of larger systems.Such so-called finite size corrections consist of two main contributions: one-body and twobody corrections, which ameliorate the one-and two-body contributions to the total energy, respectively.One-body finite size errors typically stem from shell-filling effects that lead to a mis-estimation of the kinetic energy 6,7 and can therefore be corrected by a judicious averaging over k-points. 8For example, in mean field theories, integrating over many points in the first Brillouin zone can be circumvented by instead approximating quantities using mean-value points known as Balderschi points. 9While many-body methods such as QMC methods need to integrate over the full simulation supercell, not just one point, twist averaging 10 provides a means of averaging over a set of angles (offset vectors) on the Brillouin zone of the supercell that results in a rapid convergence of the one-body effects. 6,10In contrast, two-body finite size effects stem from errors in the Coulomb and exchange-correlation interactions and are more challenging to correct.These effects can be alleviated by introducing modified versions of these interactions, such as model periodic Coulomb corrections. 6,7An alternative approach for correcting both one-and two-body finite size effects is to determine finite size corrections using methods that scale more gracefully with system size, such as Density Functional Theory (DFT). 6,7,11Such methods are used to estimate the differences in energies between smaller and larger systems, and then these differences are added to the smaller-sized many-body calculations.One such DFT-based approach is the Kwee, Zhang, Krakauer correction (KZK). 12hile such corrections are now widely applied to materials, they inherently lack the accuracy that would be possible if many-body corrections that take strong correlation into account were applied.
Even though these one-and two-body corrections markedly reduce finite-size errors, extrapolations to the thermodynamic limit are often still made to reduce any remaining errors.The simplest approach for performing these extrapolations is to fit many-body results obtained at smaller system sizes (e.g., 2x2 or 3x3 supercells) to functional forms that enable extrapolation to larger system sizes. 7,11Nevertheless, it is often unclear which functional form should be employed since it can vary with the geometry, dimensionality, and electronic phase of the material. 5,8This is especially true for systems with atypical geometries and boundary conditions.4][15] When a system's correlation energy converges slowly, the number of points necessary for accurate fitting can exceed computational constraints, limiting the overall utility of such extrapolations and the results they yield. 8,16ne potentially promising approach for estimating manybody corrections that can reduce this computational expense is machine learning.8][19] In the context of condensed matter physics, machine learning has been employed to accelerate the prediction and discovery of new materials based upon the properties of known materials 20 as well as to learn the presence of certain phases based upon their known signatures. 21,224][25][26][27][28] To approach the problem of determining accurate, many-body finite size corrections, one can analogously imagine using data from smaller system sizes to train machine learning algorithms to predict the properties of systems of larger sizes.An early such work used energies and densities from the density matrix renormalization group to learn the DFT kinetic energy functional of hydrogen chains in the thermodynamic limit. 29More recently, while this work was being prepared, Gaussian process regression techniques were shown to be able to successfully learn corrections to coupled cluster calculations in k-space.More specifically, Mihm et al. 8,30 employed the transfer structure factor to quantify the finite size effects present in coupled cluster theories' correlation energy.They then innovatively bypassed directly computing the structure factor for G values approaching zero (i.e., in the thermodynamic limit) by flexibly representing the structure factor using Gaussian Process Regression.
In this work, we leverage Gaussian Process Regression (GPR) 31 to learn finite size corrections in real-space to homogeneous (one-dimensional) and inhomogeneous (twodimensional) hydrogen chains modeled using the firstprinciples, many-body methods Coupled Cluster (CC) Theory and Auxiliary Field Quantum Monte Carlo (AFQMC).Kernel methods like Gaussian processes 31 are advantageous because they are not parametric and make use of Bayesian inference that can come at a lower O(N 3 t ) (where N t is the size of the training set) cost than more complicated parametric methods such as neural networks that scale with the number of layers employed. 17Gaussian processes have also been shown to make equally, if not more, accurate predictions than neural networks when less training data is available, which is an important consideration when training is to be performed on data generated using relatively expensive electronic structure calculations. 32We use Gaussian processes to first predict the energies of one-dimensional, homogeneous hydrogen chains of varying lengths using atomic environment descriptors that enable us to incorporate information regarding the geometry and electronic density of each atom and its neighbors.Importantly, even though machine learning methods are most accurate for interpolation, we demonstrate that training our models on the energies of one-dimensional hydrogen chains containing 10-30 atoms enables us to predict (extrapolate) the energies of chains of more than 100 atoms, nearing the thermodynamic limit, with sub-milliHartree accuracy.To contextualize the accuracy of our methods, we compare the accuracy of our predictions to that of polynomial fits to largersized systems, the so-called "subtraction trick," 33 and other alternative regression methods.Finally, to demonstrate the generalizability and robustness of our approach, we show that our technique can readily be adapted to also extrapolate the energies of heterogeneous chains of hydrogen dimers, which possess more free parameters, to their thermodynamic limit.This work thus illustrates that machine learning is a relatively cheap, yet accurate means of correcting for finite size effects in many-body simulations that can potentially address many of the challenges the many-body modeling community faces predicting the properties of solids in the thermodynamic limit.
In the spirit of a Faraday Discussion, in Section II, we begin with a discussion of the emerging synergies between machine learning techniques and stochastic electronic structure methods.We then describe the machine learning methods, descriptors, and electronic structure techniques we employ in our finite-size extrapolation research in Section III.We next present our primary results demonstrating our technique's ability to accurately correct for finite size errors in Section IV.We conclude by discussing the relative merits and potential applications of our algorithm in Sections V and VI.

II. MACHINE LEARNING IN STOCHASTIC ELECTRONIC STRUCTURE
Over the past decade, an increasing amount of research has shown that stochastic electronic structure and machine learning methods can form a very fruitful partnership that both accelerates and extends the capabilities of stochastic methods.Because stochastic electronic structure methods are often more expensive than other common electronic structure methods such as Density Functional Theory, machine learning techniques hold the promise of making stochastic electronic structure techniques less costly.At the same time, the high accuracy of most stochastic electronic structure techniques like Diffusion, 34,35 Full Configuration Interaction, 36 and Auxiliary Field 37,38 Quantum Monte Carlo methods can provide ML techniques with high-quality data that can be used to correct less accurate predictions.
0][41] QMC energies and forces calculated for representative configurations are used to train a variety of different neural networks, e.g., Behler-Parrinello Neural Networks, 42 or other architectures, which are in turn used to predict the energies and forces for other configurations, accelerating geometry relaxation and/or ab initio molecular dynamics simulations. 43,44For instance, in recent work, Diffusion Monte Carlo energies and forces were used to generate a force field using a hierarchical ∆-machine learning scheme based upon the Deep Potential Molecular Dynamics (DPMD) framework 45 that was able to successfully uncover a new phase of hydrogen. 27Since QMC has historically met challenges calculating forces, 46 recent work has also exploited machine learning architectures to learn force fields from energy data alone. 39Other ways to further reduce the cost of QMC data generation for training itself employ either ∆-ML 47 or transfer learning. 48These techniques first learn potentials and forces, using data from less accurate, but less costly theories and then correct those force fields by either adding a machine learned correction or updating the less accurate force field with select higher accuracy information.Both methods capitalize on the fact that less accurate theories can often reproduce much of the correct physical behavior of a system, meaning that high accuracy methods are effectively only needed to correct specific phenomena or regions of the potential energy surface.Further opportunities lie in better harnessing the statistical nature of stochastic methods to more efficiently train such force fields. 49Overall, QMC-quality force fields open up the grand possibilities of studying dynamics in large molecular or solid state systems with relatively little overhead, making QMC dynamics a practical reality.
Stochastic methods and machine learning techniques have also been fruitfully paired to develop new neural networkbased variational ansatze.The Variational Principle, which states that the ground state wave function of a system can best be approximated by varying the parameters and forms of trial wave functions to minimize the energy of the system, has long been used to produce wave function ansatze in computational quantum chemistry and physics.Often, such ansatze have been optimized using QMC (i.e., Variational Monte Carlo methods) and used either on their own or as starting points for projection-based QMC techniques. 6,50Historically, the forms of these variational ansatze have been specified based upon knowledge of the chemistry/physics they ultimately aim to describe (e.g., Gutzwiller 51 or pairing 52 wave functions) or confidence that their form is generalizable and expressive enough to describe the phenomenon under study (e.g., backflow wave functions). 53) Specifying the forms of trial wave functions based upon the physics expected can potentially lead to circular logic in which the physics that is expected to be seen is incorporated into a variational wave function form that then recovers that physics.
Recently, machine learning has been employed to overcome this limitation by providing a means of creating highly expressive variational wave functions.One of the most popular means of achieving this has been to use deep neural networks to specify a given variational wave function and then to optimize that neural network using the energy and/or variance as a loss function. 54Examples of such variational neural networks include DeepQMC, 55 FermiNet 23 , and PauliNet, 24 all of which have shown promise determining the ground states of challenging chemical systems.PauliNet and FermiNet, for example, use deep neural networks to learn a parameterized form of the Jastrow factor and backflow functions and maintain antisymmetry using Slater determinants.Unlike traditional methods that use single-particle orbitals, FermiNet employs functions invariant under two-electron permutations and incorporates back-flow-like transformations for enhanced accuracy. 56n alternative approach to combining the expression and optimization of wave functions with machine learning has been neural network quantum states. 57One promising form of neural network quantum states established by Carleo and Troyer are Restricted Boltzmann Machines, which implement a representation of the wave function through hidden and visible layers. 57The Boltzmann distribution models the probabilities associated with different configurations of visible and hidden nodes based on the energy; lower energies are favored to accommodate the variational principle which guides the optimization of wave function parameters.These wave functions can then be extrapolated to larger systems by reusing the learned features of the wave function to initialize a machine learning model applied to a similar, but larger system. 58,59This process of transferring the learning done for one type of problem to a related, but different problem makes seemingly out-of-reach problems, such as the thermodynamic limit, computationally feasible.Success with the transversefield Ising model, 57 Heisenberg model, 57 and molecules 60 has been demonstrated.Akin to the use of GPR in this work, Gaussian Processes have also been used to specify wave functions called Gaussian Process States. 61These wave functions are expressed as the exponential of a GP estimator and thus, as Gaussian processes more generally, are highly generalizable and can provide critical information about uncertainties.Such machine learning-based wave functions offer a potential means of achieving unprecedented levels of accuracy without the need for typically more expensive projection techniques.
Given these successes combining stochastic methods with machine learning approaches -and the many more we have not been able to discuss due to space constraints -here, we focus on the possibility of using machine learning methods to extend QMC's capabilities in a different way: by facilitating the extrapolation of QMC results to the thermodynamic limit.

A. Gaussian Approximation Potentials (GAP)
In this work, we employ Gaussian Process Regression (GPR) to predict finite size corrections for discrete hydrogen chains.We have focused on GPR 62 because it has previously been shown to yield high accuracy results with less training data than comparable methods. 32This is an especially desirable property when one is interested in performing regressions on data obtained from comparatively costly many-body simulations, since computational expense practically limits how much reference data can reasonably be collected.The Bayesian nature of GPR also makes it possible to compute the variance of its predictions, which greatly facilitates the interpretation of its results. 63For these reasons, we employ a GPR-based approach which is very similar in flavor to the Gaussian Approximation Potential (GAP) approach. 64e first summarize our approach at a high level and then provide more details in subsequent subsections.
A GPR is a random process which takes input vectors x i and maps them to random variables y = f (x) with a multivariate, normal joint distribution with covariance (K) 31,63 p( f (x)) ∼ N(µ, K). ( The target function f (x i ) (which yields the energy in this work) is characterized by the expectation value of the distribution µ = ⟨ f (x i )⟩.Like GAP, we use atomic environment descriptors 65 as input features, x i (which are vectors containing the atomic descriptors of a structure i).These capture the main features of the electron density of an atom and its neighborhood (its atomic environment) to represent the electronic characteristics of the atoms.The covariance determines how the features are correlated and is specified by the kernel function.In kernel methods such as GPR, 31,66 input features x i are mapped to a nonlinear, high-dimensional space through the function φ (x i ).Correlations between descriptors that represent different atomic structures are subsequently represented by taking their inner product in this nonlinear space to yield the kernel Nevertheless, the kernel can be defined in a more arbitrary way as long as it satisfies the properties of a covariance matrix. 66In order to make predictions, Bayesian inference can be used to compute new values of the target function. 31,63This is done by extending the distribution to unobserved data, y * .
The idea is to generate a distribution based on the observed data (y, X) using unseen data X * to generate the prediction y * with the corresponding joint distribution: where µ and µ * denote the means over the training and unobserved data, respectively, and K, K * , and K * * represent the covariances among the training data, training and unobserved data, and unobserved data, respectively.Based upon Bayes' Rule, the posterior distribution is Gaussian since the joint distribution is Gaussian.The posterior distribution can be expressed as while the predicted mean and variance for an unobserved point may be expressed as and The functional form of the prediction is equivalent to that produced by Kernel Ridge regression, 66 and can be written in the same way In this equation, the weights, α, 66 are given by Equation 7 can be written in terms of the coefficients given by Equation 8 where the α i are vectors of the coefficients obtained from the regression and K(x i , x * ) is the kernel between the unseen data, x * , and the training data, x i .Kernel methods such as GPR can thus be used to predict the total energy, E * total , using the equation The kernels can be tuned to optimize the prediction of the Gaussian process through the selection of their free parameters, known as hyper-parameters.The most common method of optimizing the posterior is the log-likelihood maximization method.In this work, we use three-way hold-out and log-likelihood maximization over the hyper-parameters.

B. Atomic environment descriptors and regression model
In contrast with physics-based approaches for describing a system, machine learning models are often more expressive, meaning that a single model has the potential to describe many different systems.One way to constrain the predictions of a machine learning model is to include prior physical information in the surrogate model.This can be achieved by making the model invariant to symmetries, including translational, rotational, or permutation symmetries, or constraints, in order to suppress spurious correlations.These symmetries or constraints are usually incorporated into the model in two ways: explicitly integrating these symmetries into the regression algorithm or designing features that are invariant to the symmetry transformations.
Here, we incorporate symmetries via the latter approach using Smooth Overlap of Atomic Positions (SOAP) descriptors that are invariant to rotation and translation.These atomic environment descriptors represent the electron density at some point r by the superposition of the Gaussian densities of atoms with the same atomic number Z in the neighborhood of that point where R i is the position of an atom, i, in the neighborhood and σ 2 is the variance of the Gaussian.This density may be expanded in terms of radial and angular basis functions where the g n (r) are the n radial basis functions that can be expressed in terms of polynomials or atomic orbitals and the Y lm correspond to the spherical harmonic functions.The c Z nlm coefficients of the expansion can be computed by integrating over the density In this work, we use the Dscribe library 67 to obtain the descriptors.This library implements SOAP descriptors using a partial power spectrum that only includes real spherical harmonics.Because the density depends on the square of the distances between points, it is already invariant to translation.A descriptor vector, p, is formed from elements of the power spectrum where n and n ′ ≤ n max run over the radial basis functions and l ≤ l max runs over the spherical harmonics.n max and l max define the maximum number of radial and angular functions in which the density in Equation 12is expanded, respectively.Z 1 and Z 2 are the atomic numbers of the species.The resulting power spectra are rotationally-and permutationally-invariant by construction.The original SOAP descriptors compare the local atomic environments using a kernel that is the dot product of the normalized power spectra between different configurations This kernel takes the overlap of two atomic environments.However, other kernels employ different ways of measuring the similarity of the environments that may lead to better results.One of the most common kernels because of its versatility and robustness is the Radial Basis Function (RBF) or Squared Exponential (SE) kernel where d(p, p ′ ) is the Euclidean distance, v 2 is a tunable amplitude, and l 2 is the global weight or length scale of the features.We choose to use the latter kernel throughout this work because of its flexibility and robustness for comparing features.

C. Comparing environments with global descriptors
The descriptor vector, p, of an atomic structure depends on the number of atoms of each species and is created by concatenating the different combinations of atomic species, each with n radial basis functions and a maximum angular number l max . 65As a result, structures with different numbers of atoms, M, N, have different numbers of descriptors.One way to deal with descriptor vectors of differing lengths is to pad the feature vectors with zeros such that their dimensions match those of the descriptor vectors with the largest number of features in the samples.A similar approach involves padding the dummy (missing) features with values selected to decrease the biases the missing features would otherwise introduce. 68 alternative that can reduce bias is the use of global descriptors.These descriptors characterize the whole structure, i.e., the features depend on all of the atoms, rendering the number of features independent of the number of atoms in the structure.However, such an approach may diminish the quality of the kernel, since the descriptors may not have enough resolution to distinguish subtle differences between structures because of their global nature.A very simple and intuitive method to make the kernel global is to construct an "average kernel:" 69 Such a kernel recursively compares the features of the atoms i and j in structures A and B, respectively, using the kernel, C, and averaging over its corresponding numbers of atoms N and M.This approach is equivalent to averaging the features of all of the atoms of each configuration and comparing them with the kernel C, which amounts to making the descriptors global The RBF kernel with the global descriptors then becomes It is important to note that, when global descriptors are employed, the total energy is no longer the simple sum of local contributions.Now, it explicitly depends on quantities that interrelate features of the whole structure.This overall description of atomic structures implicitly removes the need for descriptors that capture long-range order.Nonetheless, the resolution of the features still needs to be high enough to capture small structural changes, as mentioned earlier.The resolution of the kernel can be improved by weighting each global feature by some characteristic length, l i , according to Equation 19.This improves kernel performance by allowing fine-tuning of the parameters, but at the cost of adding more complexity to the model.In the following, we employ this combination of SOAP-averaged descriptors and the RBF kernel on linear hydrogen chains, which serve as an interesting and challenging benchmark.

IV. RESULTS
A. One-Dimensional, Homogeneous Hydrogen Chains

Coupled Cluster and AFQMC Database of Homogenous Hydrogen Chain Energies
To analyze the ability of our GPRs to predict the energies of solids in their thermodynamic limit, we first attempt to predict the finite size effects of linear hydrogen chains (LHC) stretched homogeneously, i.e., with their atoms equally-spaced, and with open boundary conditions.][72][73] As illustrated in Figure 1, Unrestricted Hartree-Fock theory (UHF) underbinds the hydrogen atoms, while Unrestricted Coupled Cluster Theory (UCCSD(T)) and AFQMC are able to relatively accurately reproduce the chains' energies near their equilibrium bond lengths, but can struggle to capture their energies at longer bond lengths closer to the dissociation limit. 1,70is system furthermore exhibits a metal-to-insulator transition when stretched homogeneously, which occurs at 1.8 bohr. 1 This transition is of second order, meaning that it is continuous with respect to the energy, but can be characterized by the polarization or spin correlation functions.70 Dimerization of pairs of hydrogen atoms in the chains can be observed by looking at the electron density profile along the chains, as in Figure 2. The maxima correspond to the nuclear positions, while the deep minima indicative of dimerization may be observed between pairs of atoms at all of the chain lengths depicted.Methods capable of predicting the energies as a function of bond length must implicitly be able to predict energies across these transitions.
In order to generate enough data for training, we created a database of the energies of hydrogen chains at varying bond lengths using UHF and two many-body methods -UCCSD(T) 74 and AFQMC 75 -in the minimal STO-6G basis.UCCSD(T) has long been considered the gold standard for accuracy for quantum chemistry calculations, 74,76 and is seeing an increasing number of applications to solids. 77,78FQMC 75 is a second-quantized QMC method that, despite its typical use of the phaseless approximation, 79 has been shown to achieve chemical accuracy in systems ranging from small molecules, [80][81][82] to complexes, 82,83 to strongly correlated solids. 84,85As a check on our databases, we produced and extended the benchmarks of Motta et al. 1 with sub-milliHartree accuracy (see Figure 1).To perform our UHF and UCCSD(T) calculations, we use the open source software PySCF. 86For the AFQMC calculations, we use the high-performance implementation of AFQMC in QMCPACK. 87Within QMCPACK, we employ UHF wave functions produced by PySCF as trial wave functions and perform calculations with a time step of 0.005, 1000 walkers, a Cholesky decomposition threshold of 10 −8 , and 10 4 steps in the phaseless approximation. 75Energies FIG. 2. Electron density as a function of atomic position (x) for 10atom hydrogen chains.Each curve depicts the electron density profile when the chain is stretched homogeneously at the bond lengths indicated in the legend.The change in the distance between and depth of adjacent local minima as the bond distance is increased reflects the onset of dimerization.The densities depicted here were computed using Full-Configuration Interaction 74 and the y-axis was shifted so that the profiles for all bond lengths could be clearly seen.
are computed with the hybrid estimator.Using all of these methods, we compute 250 points for each 10-60-atom chain with bond lengths ranging from 1 to 3.65 bohr.For chains of 70 to 100 atoms, we compute 40 points within the same range of bond lengths in order to conserve computational resources.The 10-30 atom data was used for training, while chains with larger numbers of atoms were used for benchmarking and analysis.

Energy Predictions Using Gaussian Process Regression
We use the smallest of our hydrogen chains of 10-30 atoms to train and test the GP regressions, which corresponds to 750 total samples.Samples were uniformly mixed by shuffling the data points at all bond lengths for each chain of a given size in the training set.This is to avoid training with an imbalanced data set.SOAP descriptors were constructed by using six GTOs as radial basis functions with a sigma of 1 bohr and six tesseral spherical harmonics as angular functions to build the atomic descriptors for all sizes and bond lengths.A cutoff radius that defines the extent of the atomic environment was set to 7 bohr for all chain sizes and bond lengths.This cutoff radius guarantees that the local environment of an atom consists of a maximum of 14 atoms at the shortest bond lengths studied and a minimum of 2 atoms at the longest bond lengths studied.It may be anticipated that the local atomic environment descriptors become linearly dependent when they have a large cutoff radius and are placed on bulk atoms that repeat throughout the chains.Nonetheless, descriptors placed on the edge atoms manifest asymmetries that reflect the finite extent of the chains.
The descriptors are first generated for all of the atoms of each chain in the database.A global descriptor is then obtained by averaging each descriptor over the atoms within each chain.Finally, feature selection is carried out by obtaining leverage scores from a CUR decomposition. 88The leverage scores are ordered in descending order and features are taken until 97% of the leverage score is accounted for.To perform the CUR decomposition, a Singular Value Decomposition (SVD) 88 is conducted given a singular value threshold that defines the rank of the decomposition.For this purpose, we used optimal hard thresholding, 89 which makes an optimal choice based on the dimensions and the estimated noise in the features or global descriptors matrix.We don't orthogonalize the features or use covariate principal coordinate analysis to improve our current feature selection, as further discussed in Section V. 90 A Gaussian kernel with multiple length scales allows more sensitivity to global descriptors without greatly increasing the complexity of the model.We used the maximum likelihood 31 method to optimize the kernel hyper-parameters.We employed up to 500 configurations for training and 250 for validation.

Accuracy of GPR Predictions
After training our GPRs on the UCCSD(T) and AFQMC energies of shorter hydrogen chains, we are able to predict the energies per atom of chains with larger numbers of hydrogen atoms over the same range of bond lengths in the database with reasonable accuracy.We predict the energies per atom using the mean and variance of the posterior distribution.
Figure 3 depicts the differences between the energies computed with the UCCSD(T) (left) and AFQMC (right) methods, and their respective GPR predictions.In both cases, the differences between the predictions and the calculated energies are less than 1 mHa.It is reassuring to note that the short chain length predictions are most accurate throughout the prediction interval, which is a consequence of training the Gaussian processes on short chains.Prediction errors grow with the lengths of the chains because the generalization error increases with system size.This is reflected in the larger confidence intervals that accompany the larger chain length predictions.Hydrogen chains have previously been observed to exhibit slower convergence at short bond lengths because their total chain lengths are not yet long enough to converge finite size effects that stem from long-range Coulomb interactions.This comparatively slow convergence is likely responsible for the larger error bars we observe at short bond lengths. 1 At bond lengths longer than 3 Bohr where dissociation begins to occur, the error is significantly smaller and expected to converge faster because chains with longer total lengths will more rapidly converge the long-range Coulomb interaction.
The quality and characteristics of the AFQMC-based GPR predictions are similar to that of the UCCSD(T)-based predictions.Higher accuracies are again observed for shorter chains and at larger bond lengths.The AFQMC-GPR differences are, FIG. 3. Energy differences between the calculated UCCSD(T) (left) and AFQMC (right) energies, and their respective E GPR predictions per atom for hydrogen chains of different lengths in mHa.The green dashed lines depict the bounds of 1 mHa energy differences.The shadows delineate 95% confidence intervals based on the predicted variance.The vertical dashed line denotes the bond length at which the metalinsulator transition occurs.6 however, noisier than the UCCSD(T) differences, which reflects the stochastic character of AFQMC.The AFQMC-GPR differences are, in general, smaller than the UCCSD(T)-GPR predictions, especially at intermediate bond lengths.Overall, the AFQMC predictions are slightly more accurate and homogeneous at all of the bond lengths studied, likely due to a larger consistency within the AFQMC data.

Extrapolation of Chain Energies to the Thermodynamic Limit
Given the sub-milliHartree accuracy of these predictions, we now turn to analyzing the performance of our GPR predictions for extrapolating the energies of very long, yet finite chains that approach the thermodynamic limit.In previous studies, 1 thermodynamic limit predictions were made by assuming the chain energies varied polynomially with N −1 , with orders ranging from 1 to 3 depending upon the convergence speed exhibited by the data. 1 To make use of such scaling laws, a polynomial must be fit to a large enough number of different chain sizes to capture the correct scaling behavior.To compare the performance of our GP regressions against this more conventional fitting procedure, we fit the energies of chains containing 10, 30, and 50 atoms, as was done in Reference 1.We contrasted the extrapolations produced by this polynomial fit with GPR results trained once across different bond lengths on chains of 10, 20, and 30 atoms.Indeed, the primary advantage of our method is that we can automat-ically predict the energy per atom of any chain by computing its global descriptor vector and using the posterior to predict its energy.As an added benefit, the confidence intervals based on the predicted variance provide an estimate of the uncertainty of the prediction, which is not available from typical polynomial regressions.
Figure 4 displays the convergence of the energy per atom to the thermodynamic limit for four representative bond lengths.The circles denote the UCCSD(T) calculations while the squares represent the GPR predictions at each size.As before, the shadows delineate 95% confidence intervals on the GPR calculations.The green dashed line denotes the polynomial regression at the given bond length and the red triangle represents the energy in the thermodynamic limit taken from Reference 1.The GPR prediction of the energy in the thermodynamic limit is made using a chain of 5000 hydrogen atoms.As an illustration of the speed of our regression technique, producing the descriptors for the 5000-atom chain took about 2 minutes on an Intel Core i7-8550U (Turbo 4.0 GHz, 4 Cores, 8 Threads) laptop.We note that the differences between the thermodynamic limit predictions made by the reference regression 1 and the polynomial regression performed on our dataset simply reflect the small differences between the two different databases.The GPR predictions are in good agreement with the reference and polynomial regressions, deviating most for bond lengths near the equilibrium bond length (around 1.8 bohr) where the convergence is less linear.Note that the energies converge one to two orders of magnitude more rapidly FIG. 4. Predictions of the energy per atom in the thermodynamic limit vs. N −1 based on UCCSD(T) results for hydrogen chains with bond lengths of 1.0, 1.4, 1.8, and 2.8 Bohr.Cyan circles denote UCCSD(T) calculations, while maroon squares denote the GPR predictions.The shadow depicts 95% confidence intervals, and the dashed lines depict the polynomial regression of second order at each bond length.The triangle represents the energy in the thermodynamic limit computed in Reference 1.The gap between points at small values of N −1 corresponds to chains between 100 and 5000 atoms, which are prohibitive to model even using less expensive theories.at larger bond lengths because the long-range Coulomb interaction is weaker at larger bond lengths, as described earlier.
Figure 5 similarly exhibits the convergence to the thermodynamic limit for the AFQMC database and its respective GPR predictions.One of the most noticeable differences relative to the UCCSD(T) calculations is that the AFQMC predictions seem more linear close to the equilibrium bond length.This means that the AFQMC calculations can more accurately resolve small, sub-milliHartree differences in the energies as a function of system size and therefore so can the AFQMC-based GPR.
The left panel of Figure 6 presents the energy of the hydrogen chains as a function of bond length directly calculated using UHF and UCCSD(T) for 100-atom chains, as well FIG. 5. Predictions of the energy per atom in the thermodynamic limit vs. N −1 based on AFQMC results for hydrogen chains with bond lengths of 1.0, 1.4, 1.8, and 2.8 Bohr.Cyan circles denote direct AFQMC calculations, while the maroon squares denote the GPR predictions.The shadow depicts 95% confidence intervals, and the dashed lines depict the polynomial regression of second order at each bond length.The triangle represents the energy in the thermodynamic limit computed in Reference 1.The gap between points at small values of N −1 corresponds to chains between 100 and 5000 atoms, which are too prohibitive to compute using even less expensive theories.Note that the TDL of Reference 1 (REF TDL) for the bond length of 2.8 Bohr was replaced by our TDL extrapolation using a polynomial regression because the reference value seemed to be in disagreement with the rest of the reference's data at that bond length.
as the reference 1 and GPR predictions in the thermodynamic limit.The energy differences between the N = 100 UCCSD(T), reference, and GPR predictions are hardly perceptible.The right panel of Figure 6 likewise presents the energy as a function of bond length for the largest, N = 100atom AFQMC calculations we were able to perform, in addition to the reference and GPR thermodynamic limit predictions.As in the UCCSD(T) case, the discrepancies are too small to discern at this scale.
To more closely examine how the GPR predictions converge with the number of atoms in the chains, in Figure 7, we plot the difference between the thermodynamic limit predictions of Ref. 1 and our UCCSD(T) (left) and AFQMC-based FIG. 6.Energy per atom computed for N = 100 chains and predicted for N = ∞ chains using the UCCSD(T) (left) and AFQMC (right) methods.The "REF → ∞" is the TDL extrapolation taken from Ref. 1.We plot this reference's extrapolation so that it can be contrasted with our GPR's prediction using 5000 atoms.
(right) GPR predictions on the milliHartree scale.For both methods, we take N = 5000 hydrogen chain GPR predictions to be representative of the thermodynamic limit.In the left-hand panel, we also plot UCCSD(T) results for N = 200 hydrogen chains, the largest we could directly simulate, to contrast N = 200 with N → ∞ results.We see that, at smaller bond lengths, discrepancies still remain between the N = 200 and N → ∞ results, signifying that finite size effects still influence the energies of even N = 200-length chains.
These discrepancies are also manifested in the larger confidence intervals that accompany the GPR predictions.Even so, GPR predictions at all bond lengths studied possess sub-milliHartree accuracy, and the discrepancies between the different chain length predictions disappear at the longest bond lengths studied.In contrast, the right panel of Figure 7 demonstrates that the AFQMC-based GPR predictions are in much better agreement with Ref. 1's thermodynamic limit predictions, even at shorter bond lengths.This is in line with the results presented earlier in Figure 3.
Since our calculations were performed with open boundary conditions (OBC), it is also worthwhile to compare our predictions to those produced using the "subtraction trick," 33 in which the energies of systems of different sizes are subtracted to eliminate surface effects from bulk energies.Figure 8 presents the differences in energy between our GPR predictions of the energies in the thermodynamic limit and those produced using the subtraction trick based on chains of different lengths.The differences in the energies predicted by these approaches is sub-milliHartree at all bond lengths stud-ied, further demonstrating that our GPR predictions are highly accurate relative to a widely-employed benchmark, while also illustrating the surprising accuracy of the subtraction trick.As the subtraction trick eliminates edge effects from energy predictions, this comparison especially highlights the GPR method's ability to correct for edge effects.It is satisfying to see that the energies predicted by the subtraction trick performed on chains of lengths 30 and 50, which should yield the most accurate predictions of the subtraction trick calculations, are in the greatest agreement with our GPR predictions, especially at intermediate bond lengths.As before, we see that our GPR predictions are in the greatest agreement with the subtraction trick results at longer bond lengths.Indeed, our GPR predictions almost perfectly agree with all three of the subtraction trick predictions at the longest bond lengths studied.Moreover, our AFQMC-based GPR predictions again converge more rapidly and reliably to the thermodynamic limit with increasing bond length.Overall, Figures 7 and 8 possess very similar features: the GPR predictions overestimate the energies at the shortest bond lengths and then oscillate between under-and overestimating the energies at intermediate bond lengths before coming to agreement at the longest bond lengths.This points to the overwhelming agreement between the polynomial regression and subtraction trick energies.These comparisons also demonstrate that the GPR predictions are not uniformly biased toward over-or underestimating energies.
A more quantitative comparison of the predictions generated by all of these methods can be found in the Supplementary Materials.FIG. 7. Differences between the energies predicted by Ref. 1 and the UCCSD(T) (left) and AFQMC (right) GPR-predicted energies in the thermodynamic limit (red triangles).For both the UCCSD(T) and AFQMC plots, we assume that the GPR prediction using 5000 atoms is representative of the GPR prediction in the thermodynamic limit.On the left, we also plot the UCCSD(T) energies for N = 200 hydrogen chains, the largest we could directly simulate.The shadows depict 95% confidence intervals for the GPR predictions.

B. Two-Dimensional, Inhomogeneous Hydrogen Chains
Given the success of GPR at predicting the energies of homogeneously-stretched hydrogen chains in the thermodynamic limit, we next examine the capacity for the same GPR techniques to predict the energies of inherently heterogeneous chains of hydrogen dimers.As depicted in Figure 9, these chains of hydrogen dimers are described by two key distances: the intra-dimer bond distance, a, and the inter-dimer bond distance, b.In the following, we generally fix the intradimer dis- tance, a, between 1.0 and 3.5 bohr, and vary the interdimer distance between 1.0 and a bohr, maintaining open boundary conditions.While these chains of dimers enable us to retain the same periodicity present in our earlier homogeneous chains, they also enable us to purposefully and controllably introduce heterogeneity into our systems that complicates our prediction problem.Indeed, these chains of dimers manifest several levels of correlation when stretched, typically necessitating the use of advanced quantum chemistry methods to make high-accuracy energy predictions. 91o study the performance of our GPR algorithm on these chains, we generate a database of dimer chain energies staring from UHF calculations with single Slater determinants that we again input into either CCSD(T) or AFQMC calculations.We model our hydrogen atoms using the minimal STO-6G basis set given the steep computational cost of the system with increasing system size.Chains of 5, 10, and 15 dimers for a total of 176 configurations were employed for training.The remaining 315 configurations of chains consisting of 20 to 50 dimers were subsequently used for testing and validation.The same atomic environment descriptors previously employed for the homogeneous chains were also employed here.
The energy surfaces for chains consisting of N = 30, 50, and 100 atoms are depicted in Figure 10.It can be seen that the GPR predictions are in qualitative agreement with the AFQMC database values, both for short chains (N=10) and long chains approaching the thermodynamic limit (N=100).In particular, GPR is able to well describe both the energy minimum around a=1.5 bohr, b=3.25 bohr, and highly stretched chains with both a and b greater than 3 bohr.More detailed slices of the potential energy surface for several values of a are depicted in Figure 11.
As is apparent from these plots, the approach to large interdimer separations is highly dependent upon the intra-dimer separation: for small a, the approach is steeper than for large a.This behavior is a sign of correlation between the a and b values and is non-trivial, given the seeming simplicity of the model.This makes the model a useful testbed for multidimensional extrapolations, as further discussed in the Supplementary Materials.
To visualize the energy surface, as shown in the Figure 12, we use triangulation over the sample points and then Delaunay smoothing.The thermodynamic limit was estimated using GPR regression on N=5000 atoms.On the left, we present the 3D energy surface for a chain of 15 dimers; the black dots denote the energies estimated by GPR in the thermodynamic limit.On the right, we provide a heat map corresponding to the plot on the left annotated with iso-energy contour lines predicted using GPR for systems of different sizes.In particular, the red and yellow dashed lines denote the energies for chains comprised of 15 and 50 dimers, respectively.The errors on these energies are all less than 1 mHa, which is within chemical accuracy.
This plot underscores how the contours change or shift with system size.We can see that the largest differences between the contours occur near the minimum of the plot around an intra-dimer distance of 1.5 bohr and an interdimer distance of 3 bohr.In this region of the surface, the N = 30 contours differ significantly from the N = 100 contours, which nearly align with the thermodynamic limit contours, suggesting that 100 atoms are nearly enough to converge simulations of this system to their TDL.Away from this minimum, the contours for all three system sizes concur, demonstrating that the system experiences weaker finite size effects for these parameters.GPR's success extrapolating the energies of this non-trivial, multidimensional model suggest that it is likely to have similar success on the more complex models and solids of interest to the wider scientific community.

V. DISCUSSION OF RESULTS
Although we employed Gaussian Process Regression in this work, a wide range of other machine learning approaches, including artificial neural networks, could also be used to perform these extrapolations.We opted to employ kernel methods like Gaussian processes because they are non-parametric and make use of Bayesian inference at a comparatively low, O(N 3 t ) cost, where N t is the size of the training set. 17It has been proposed 92 as a rule of thumb to use N t = 10 × d, where d is the dimension of the feature space, to train a GPR.In contrast, neural network-based approaches involve matrix-vector multiplications that scale with the number of neurons in the network, N n , and the dimension of the input vector, d, as O(N n d).If the number of neurons in the network is small, this implies that neural networks are less expensive to employ than GPR.However, neural networks typically necessitate the use of non-linear activation functions that may increase their overall cost.More importantly, neural networks often suffer from overfitting if care is not taken to reoptimize their number of nodes or structures.Overfitting is much less of a concern for GPR since GPR with the same kernel but more training points is guaranteed to be more accurate.In practice, NNs use at least 2 to 3 orders of magnitude more training data than GPR. 17,32hen training data is scarce -as it usually is when manybody electronic structure calculations are involved -GPRbased techniques hence become the method of choice. 32One may also ask whether using GPR on these low-dimensional data sets is more sophisticated than necessary and if other, less sophisticated regression techniques based on a small number of parameters could instead be employed.As demonstrated in the Supplementary Material, we have compared the performance of our GPR approach to that of Bayesian Multivariate Adaptive Regression Splines, a spline-based technique, and found that our GPR approach can extrapolate with significantly greater accuracy.We moreover show that, while one can extrapolate using a few simple parameters, this extrapolation is not readily generalizable to more complex situations in which the parameters to use are less obvious.Lastly, as illustrated throughout this manuscript, GPR inherently quantifies uncertainties, which are critical for being able to determine its accuracy relative to that of other methods.
Our studies of low-dimensional hydrogen chains naturally beg the question of how well our techniques can be generalized to more realistic multidimensional solids that are accompanied by an even more rapid growth in computational expense.Much like other GAP methods, our approach should be readily generalizable to higher dimensional systems, given sufficient data and high-quality features.Indeed, here, we took the first step toward demonstrating this by applying our model to both a one-dimensional and a nontrivial twodimensional system, and in a previous preprint, we demonstrated how a similar GPR-based approach could be leveraged to predict the energies of 3D alloys. 93The key challenge associated with higher-dimensional predictions is the curse of dimensionality: the higher the dimensionality of the space, the more data that is needed for training to learn the larger space with sufficient accuracy to make effective comparisons between different atomic environments.The resulting increase in cost can be slowed through a more judicious selection of features and design of kernels.CUR 88 decompositions and Kernelized Principal Covariates Regression 90 are excellent alternatives for identifying the most relevant features, which can significantly reduce the dimension of the descriptors of a given data set.More effective kernels may also be constructed through approaches that recursively evaluate the differences between structures. 69For example, De et al. proposed kernels based on regularized structure matching to optimize the comparisons between the atomic environments of different structures. 69Thus, with further technical developments, we believe that the techniques presented here should be readily generalizable to the even larger, more complicated solids that they would most benefit.

VI. CONCLUSIONS
In summary, in this work, we have presented a Gaussian Process Regression-based approach for predicting the manybody energies of hydrogen chains, the simplest examples of ab initio solids, in the thermodynamic limit.We have shown that, by training on databases of the energies of short (10-30-atom) homogeneous and inhomogeneous hydrogen chains with varying intra-and inter-dimer distances, we can predict the energies of these chains in the thermodynamic limit with sub-milliHartree accuracy relative to predictions made by alternative extrapolation techniques.These alternative techniques, including polynomial regressions and the "subtraction trick," typically necessitate computing the energies of chains much longer than the chains employed in our training sets.As such, our approach enables the highly accurate prediction of the energies of solids in the thermodynamic limit based upon relatively small systems, and hence, much less expensive calculations.Unlike many finite size extrapolation techniques which apply to systems with only certain geometries, densities, and/or dimensionality, as demonstrated by the easy generalizability of our method to both homogeneous and inhomogeneous chains, our approach is largely agnostic to the physical characteristics of the system studied; as long as there is sufficient and representative training data, our approach can be applied, making it particularly useful for some of the more complex systems of modern interest, such as those at interfaces or having irregular geometries.

VII. DATA AVAILABILITY
The data that support the findings of this study are openly available online at https://github.com/josuelandinez/LHC_Database.

VIII. SUPPLEMENTARY MATERIAL
A supplementary materials document has been provided that contains comparisons of the accuracy of our GPR methods to those of other regression techniques and tables of the energies predicted by the different approaches described here that underlie many of the plots provided in the main text.
IX. ACKNOWLEDGEMENTS E.J.L.B. would like to thank Amit Samanta for insightful discussions and Qiming Sun for his kind assistance with per-

FIG. 1 .
FIG. 1. Energy per atom vs. bond length for a 50-atom hydrogen chain using the UHF, UCCSD(T), and AFQMC methods in the STO-6G basis.The symbols depict the energies from calculations from Reference ('Ref') 1, while the dotted lines interpolate among 250 of our database energies.AFQMC error bars are too small to see.

FIG. 8 .
FIG.8.The difference in energies between our GPR predictions in the thermodynamic limit, E GPR (N → ∞), and extrapolated energies obtained using the subtraction trick, E ST .(Left) Differences based upon UCCSD(T) energies; (Right) differences based upon AFQMC energies.

FIG. 9 .
FIG. 9. Illustration of the linear chains of hydrogen dimers studied in this work with intradimer distance, a, and interdimer distance, b.

FIG. 10 .
FIG. 10.Energy surfaces, E/N, predicted for chains consisting of (left) 30, (center) 50, and (right) 100 atoms (15, 25, and 50 dimers, respectively) for several a and b values.AFQMC energies are given by the cyan circles, while GPR predictions are given by the maroon triangles.

FIG. 12
FIG. 12. (Left) Interpolated GPR potential energy surface, E/N, for a chain of 15 dimers as a function of a and b.The black dots denote the AFQMC training points from the database.(Right)Iso-contours of the energy surface, E/N, interpolated for chains of N = 30 (red) and N = 100 (yellow) atoms, and in the thermodynamic limit (dashed black lines).The color map denotes the GPR predictions of the energy in the thermodynamic limit.