Avoiding the 4-index transformation in one-body reduced density matrix functional calculations for separable functionals

One of the major computational bottlenecks in one-body reduced density matrix (1RDM) functional theory is the evaluation of approximate 1RDM functionals and their derivatives. The reason is that more advanced approximate functionals are almost exclusively defined in the natural orbital basis, so a 4-index transformation of the two-electron integrals appears to be unavoidable. I will show that this is not the case and that so-called separable functionals can be evaluated much more efficiently, i.e. only at cubic cost in the basis size. Since most approximate functionals are actually separable, this new algorithm is an important development to make 1RDM functional theory calculations feasible for large electronic systems.


Introduction
Though density functional theory (DFT) is formally exact, practical density functionals have great difficulty to capture strongly correlated phenomena such as the breaking of chemical bonds. [1][2][3] The calculation of excitation energies along the bond-breaking coordinate with the current approximations results in an even bigger disaster. [4][5][6] Also long-range charge transfer excitations pose a serious challenge for semi-local density functionals, 7,8 though some improvements have been reported with the help of rangeseparated hybrids, 9,10 direct modifications of the kernel 7,11,12 or the variational approach by Ziegler et al. [13][14][15] One-body reduced density matrix (1RDM)functional theory provides a promising route to alleviate most of these problems existent in practical DFT. It has been demonstrated that the ground state energy of small singly bonded molecular systems can be reasonably well reproduced along the full bond-breaking coordinate. [16][17][18][19][20] It has been shown for two-electron systems that the time-dependent extension of 1RDM functional is much more capable in dealing with bond breaking excitations and charge transfer excitations even within the adiabatic approximation, and also a significant amount of double excitations is captured. [21][22][23][24] Attempts are currently made to extend these results to general Nelectron systems. [25][26][27] A more extensive overview of the current status of 1RDM functional theory can be found in Ref. 28.
Though 1RDM functional theory has some appealing advantages compared to DFT, its practical use is currently limited due to two major computational bottle necks. One computational hurdle * Section of Theoretical Chemistry, Faculty  is the excruciatingly slow self-consistent field (SCF) convergence to obtain the ground state energy. Although several algorithms have been proposed, 29-32 a significant breakthrough on this difficulty has not yet been achieved. Another computational complication is the evaluation of the 1RDM functionals themselves. Only the simplest approximate 1RDM functionals are explicitly defined in terms of the 1RDM. More advanced approximations are defined implicitly via the natural orbitals (NOs) and (natural) occupation numbers, which are defined as the eigenfunctions and eigenvalues of the 1RDM respectively where x := rσ is a combined space-spin coordinate. Current implementations therefore rely on a 4-index transformation of the two-electron integrals to the NO basis which is a very costly operation and impairs any calculation on systems with a large number of electrons. The situation is far worse than in correlated methods such as coupled cluster (CC) and configuration interaction (CI), since the 4-index transformation needs to be performed at each step of the SCF procedure.
I will show in this article that the 4-index transformation can actually be avoided for so-called separable functionals. Separability allows a functional to evaluated directly in the atomic orbital (AO) basis, or any other basis employed in the computer code. This reduces the computational cost from formal m 5 to a formal m 4 scaling, where m is the size of the basis set. Most integral routines make use of screening techniques to only calculate significant two-electron integrals which further reduces the scaling to m 3 or even less. It turns out that most current approximate 1RDM functionals are separable functionals, with a only a few exceptions.
The paper is organized as follows. First the algorithm for the evaluation of separable 1RDM functionals and the first order derivatives is explained in detail. Particular attention is needed for the so-called diagonal corrections which are sometimes called 'self-interaction corrections'. Special care needs to be taken to avoid excessive computational cost and memory imprint for these corrections. The evaluation of some particular functionals will be discussed to illustrate the algorithm. The next section presents benchmark results for the new algorithm using alkanes of varying length as test systems. A significant speed-up is obtained for the evaluation of separable functionals and is most drastic for functionals without diagonal corrections.

Algorithm
To avoid the 4-index transformation we generalize the direct approach in Hartree-Fock (HF) theory to calculate the contribution of the two-body interaction to the energy and the Fock matrix. 33 For example, consider the Hartree (classical Coulomb) contribution to the energy which in terms of the HF orbitals has the following simple form where n i are the occupation numbers of the HF orbitals, i.e. the NOs of the HF system. The two-electron integrals are denoted in the chemist's notation, so they are defined as The interaction between the particles will be usually the Coulomb interaction w(r) = 1/r, but could also have some other form, e.g. the long-range part of the Coulomb interaction in a rangeseparated scheme w(r) = erf(µr)/r. [34][35][36] In an arbitrary basis, {χ µ }, the Hartree term can be expressed as where the matrix elements of the 1RDM are defined as Throughout the text I will always use Greek indices to refer to this arbitrary basis in which the two-electron integrals are supplied.
The latin indices will exclusively be used for the NO basis.
Suppose now that all the two-electron integrals are available in the basis {χ µ }. Typically this will be an atomic orbital basis or a plane wave basis. The Hartree contribution can be calculated without transforming any 4-index quantity by first performing the contraction over only one 1RDM as The summations are generally performed by looping over inte-grals stored on file or by calculating all integrals on the fly. In a second step the contraction with the other 1RDM is performed The only essential feature of the Hartree contribution to allow for this trick is that it is a Coulomb-type separable functional. With a Coulomb-type separable functional I mean a functional which can be expressed as a linear combination of a few terms of the form This is a Coulomb-type separable functional which can efficiently be evaluated in the same fashion as the Hartree term. Typically we will have f = g and f = f † , but this is not necessary for the trick to work. Its evaluation proceeds again via a Coulomb-like potential (6), which is generalized now to Likewise, an exchange-type separable functional can be expressed as a linear combination of a few terms of the form so just two indices are swapped around compared to the Coulomb case. The exchange-type separable functionals allow for a similar efficient evaluation as the Coulomb-type separable functionals by first forming an exchange-type potential and subsequently performing the final contraction When the basis functions are real, the Coulomb-and exchangetype versions are the only two possible distinct forms of a separable functional. In the case one uses complex basis functions, e.g. plane waves, one also might need to consider the versions where the complex conjugation has been interchanged in the last pair of the two-electron integral, i.e.
Not only the contribution to the energy can be calculated in this manner. Also the projected orbital derivatives needed for the construction of the Fock matrix 31,32 or direct orbital optimization, 29 can easily be calculated. Only the final contraction over the potentials needs to be left out and subsequently a transformation to the NO basis needs to be made. For most approximate functionals, f and g will be diagonal in the NO basis. So it is computationally beneficial to transform first the potentials to the NO basis and only then to multiply them by the diagonal f and g matrices where f k and g k denote the diagonal entries of the f and g matrices in the NO representation. The energetic contribution can now cheaply be obtained by taking the trace If the matrices f and g are not only diagonal in the NO basis, but if also their diagonal elements only depend on the occupation numbers with the same index, i.e. f k (γ) = f k (n k ) and g k (γ) = g k (n k ), the derivatives with respect to the occupation numbers become particularly simple Most approximate 1RDM functionals exhibit such a simple dependence on the occupation numbers. More complicated dependencies need to be worked out for each approximate functional separately.
Though separability might seem to be a very stringent condition on a general 1RDM functional, many approximate functionals used in 1RDM functional calculations are actually separable. The only non-separable functionals I am aware of are the empirical functional by Marques and Lathiotakis 37 , the automated version of the BBC3 functional by Rohr et al. 17 and the PNOF4 by Piris et al. 18 To use the proposed scheme to avoid the 4-index transformation, one only needs to rewrite the approximate 1RDM functional in a separable form, i.e. as a linear combination of terms of the form (8) and (10). An obvious example is the Müller functional, [38][39][40] since it was originally published in its separable form A function of the 1RDM, the square root in this case, is defined in the usual manner via its diagonal representation Approximate functionals which simply modify the square root to some other power, √ γ → γ α , also belong to this class of functionals which the 4-index transformation can trivially be avoided. [41][42][43][44] An explicit expression in terms of the 1RDM itself is not available for more advanced 1RDM functionals which classify NOs in strongly and weakly occupied groups and/or contain 'diagonal corrections'. Though these functionals are only explicitly defined in terms of occupation numbers and NOs, many of these functionals can still be rewritten in a separable form, allowing for the previously described tricks. These non-trivial separable forms of more advanced 1RDM functionals are best explained with an example. To this end we will use the BBC2 functional, which is defined in the NO representation as 16 where W H is the usual Hartree term introduced earlier (2) and F(n i , n j ) is defined as The involved expression for F(n i , n j ) renders an explicit expression in terms of γ virtually impossible. Nevertheless, the BBC2 functional can still be expressed in a separable form by using other (auxiliary) matrices. To this end we first neglect the i = j conditions in (21). This 'non-diagonal part' of the BBC2 functional can now be expressed in a separable form as where f (γ) occ The remaining diagonal part (correction) is now of the form Unfortunately the diagonal correction cannot be straightforwardly be written in a separable form and it seems that we still need to resort to a 4-index transformation of the two-electron integrals for their evaluation. However, the proposed trick can still be used by first constructing 1RDMs in which only one NO is occupiedγ The next step is to form the contraction with the two-electron integrals as v (i) The last step is to transform the potentials v (i) back to the NO representation and to form the final contraction The corresponding projected orbital derivatives required for the SCF can be obtained from the off-diagonal elements Though we could avoid the use of a 4-index transformation in this manner, the additional computational cost to calculate the diagonal correction of the BBC2 functional (24) is significant compared to the cost to calculate the non-diagonal part (22). The nondiagonal part only needs the contraction with 4 different auxiliary matrices (1 Coulomb and 3 exchange), whereas the diagonal part requires a contraction with m auxiliary matrices, so comprises a significantly more expensive part of the functional to evaluate. Furthermore, the complete construction of the orbital 1RMDsγ (i) and the corresponding potentials v (i) gives a significant memory imprint, since both scale cubically with the number of basis functions. Fortunately, the special structure of the diagonal correction allows one to avoid the explicit construction of these large matrices. Avoiding the explicit construction of the orbital 1RDMs,γ (i) , is readily achieved by not constructing them explicitly. Instead, the required elements are only constructed on a need-to-be basis when looping over the two-electron integrals. Now let us consider the high memory imprint of the potentials v (i) . Note that in the expression for the energy contribution (27) and for the orbital derivative (29), at least one of the lower indices is always equal to the upper index, so we can avoid the construction of many unnecessary elements. We do this by transforming the potentials v (i) partially to the NO basis immediately during their construction as The projected orbital derivatives and the contribution to the energy can now easily be calculated by transforming the last index also to the NO basis and forming Though we have now avoided the explicit construction of thē γ (i) matrices and the corresponding potentials v (i) , the operation count for the calculation of the diagonal correction is significantly higher than for the non-diagonal part of the functional. For separable functionals with a diagonal correction, the evaluation of the diagonal part therefore remains the computational bottleneck in the evaluation of their values and derivatives.
The formal scaling of the proposed algorithm for functionals without diagonal corrections is still of order m 4 due to the loop over the two-electron integrals to construct the Coulomb potentials (9) and exchange potentials (11). The situation is even worse for functionals with diagonal corrections, since the construction of the intermediate potentialsv (30) makes it formally an m 5 process. The main advantage of the proposed algorithm is that integral screening becomes way more effective than in a calculation via a 4-index transformation. Screening of the twoelectron integrals is a very common strategy in quantum chemistry software to avoid the calculation of many insignificant twoelectron integrals. The number of significant two-electron integrals turns out to be only of the order m 2 for the larger molecular systems of interest, 45 so the scaling can in principle be reduced by an additional factor 2. This is readily done by using the Schwarz inequality on the two-electron integrals 46 The integrals on the right-hand side are pre-calculated and only require m 2 storage. When the product of the square roots on the right-hand side are larger than some tolerance ε, the program will actually calculate the integrals on the left. Screening of twoelectron is particularly effective when the integrals are evaluated in a basis with a strong local character. This method and more advanced techniques have been implemented in virtually all quantum chemistry software packages, so can directly be exploited to reduce the scaling significantly as I will demonstrate in the following section.
It should be mentioned that the importance of separability for an efficient evaluation of energy expressions and derivatives has already been recognised for a few decades in quantum chemistry. The most famous example is the 2 nd order Møller-Plesset (MP2) energy correction, which in its usual form is non-separable where the indices i, j refer to the occupied Hartree-Fock (HF) orbitals, the indices a, b to the unoccupied HF orbitals and ε r are the HF orbital energies. By writing the denominator as a Laplace transform, the MP2 energy correction can be turned into a separable formt [47][48][49] The summation over the HF orbitals is now a separable expression and the strategy described before can now be used to evaluate the integrant efficiently. The price to pay is that the contraction needs to be evaluated for several values of t to evaluate the integral numerically with some suitable numerical integration scheme, but this only increases the computational cost by a prefactor. A similar trick can be used to bring the random phase approximation (RPA) in the linear scaling regime via the spherical Laplace transform. 50 Similar tricks will probably be useful to rewrite additional approximate 1RDM functionals in a separable form.

Benchmarking
To test the new strategy to evaluate the energy of separable functionals, I have constructed a modular Fortran 2003/2008 implementation, interfaced to a modified version of the GAMESS-US program package. 51-53 The build-system of GAMESS-US has been replaced by FORAY, 54 to avoid figuring out module dependencies by hand and allow for parallel compilation. The implementation is currently only intended for serial runs. A parallel implementation will be considered later. The cut-off criterion for the twoelectron integrals has been set to ε = 10 −10 . The evaluation of the energy and gradients has been implemented in 3 different manners 1. The straightforward version using the 4-index transformation of the two-electron integrals.
2. Evaluation in AO basis using integrals stored on disk.
3. A direct option which does not store the two-electron integrals, but recalculates them each time when they are needed.
In Fig. 1 I plot timings for the Müller functional (18). For the input 1RDM I have used the HF orbitals as NOs. Typically, the occupation numbers are fractional in 1RDM calculations. This has been mimicked to some extend by setting the occupations of the occupied HF orbitals to 0.9 and the occupations of the virtual HF orbitals to 0.1 N/(m − N). As test systems I used (linear) alkanes C n H 2n+2 in a spherical cc-pVTZ basis. 55 The highest point group symmetry of each system has been used to calculate only the symmetry unique integrals. For even n the point group is C 2h and for odd n this is C 2v . The two smallest alkanes allow for the use of higher point groups (T d for methane and D 3d for ethane). This additional reduction in symmetry unique two-electron integrals is reflected by the strong deviation from the general trend in the plot in Fig. 1. The asymptotic scaling behavior (exponent) of the different strategies has been estimated by fitting a power function, A e α m , to the results for a large enough basis size. The exact points taken into account varies per calculation and has been indicated by the straight line connection the points in Figs 1 and 2. The naïve implementation via the 4-index transformation of the two-electron integrals is clearly the most expensive one in Fig. 1. Its main cost is the 4-index transformation, which is reflected in its high asymptotic scaling of order m 4.8 . The evaluation in the AO basis is clearly more efficient. Reading the integrals from file is slightly faster than recalculating them, but the asymptotic scaling is basically the same. In this benchmark, the file reading is particularly fast due to the solid state drive (SSD) and the direct option is actually slow, since only one core is used. It is therefore expected that a parallel implementation would easily beat the ' AO file' option, since the different processes would only start to compete for disk access. An additional bottle neck for the ' AO file' option is that the two-electron integral file becomes excessively large. The last reported point for the ' AO file' option (C 20 H 42 : 1330 cartesian basis functions) has already a two-electron file of 250 GB. The AO direct option does not have this bottle neck and can easily handle larger systems. The scaling of the ' AO direct' option has been estimated to order m 2.3 . The exponent has not converged however, since the formation of the required 1RDMs in AO basis (γ and √ γ) involves a matrix-matrix product which should scale worse. Assuming the BLAS implementation is based on Strassen's algorithm, 56 an asymptotic scaling of at least order m log 2 7 ≈ m 2.8 would be expected.
In Fig. 2 the computational time for the evaluation of the BBC2 functional with and without its diagonal part (24) are shown when using the ' AO direct' option. The first thing to notice is the effectiveness of screening by the Schwarz inequality. Both for the full BBC2 functional and only the non-diagonal part, the use of screening leads to a drastic reduction in the asymptotic scaling. Now let us concentrate on the calculations with only the nondiagonal part of the BBC2 functional (22). Though the nondiagonal part of the BBC2 functional requires the contraction with 4 different 1RDMs, instead of only 2 as required for the Müller functional, the increase in computational cost is marginal. Only the prefactor is increased slightly by A BBC2 − A Müller /A Müller = 5%, but the asymptotic scaling remains the same, m 2.3 .
On the other hand, the calculation of the diagonal corrections poses a significant increase in computational cost. They add an order of magnitude to the computational cost. If no screening is used, this even means that the asymptotic scaling is similar to the scaling of the 4-index transformation. Avoiding the use of small integrals is therefore crucial for the algorithm to improve over the standard evaluation by the 4-index transformation. If screening is used, however, the use of an AO based evaluation of the diagonal correction provides a major increase in efficiency compared to the traditional approach via a 4-index transformation. The asymptotic computational cost has been reduced from order m 5 to m 3.2 , so much larger quantum systems are accessible with the new implementation, especially when the algorithm is additionally parallelized. The effect of the cutoff criterion on the total energies and gradients has also been investigated. In the upper panel of Fig. 3 the relative absolute error in the energy, |(E ε −E 0 )/E 0 |, has been plotted as a function of the cartesian basis size for the alkane series. For the full range of alkanes, the error in the energy can be systematically reduced by lowering the value of the cutoff parameter, ε. Only for the small alkanes the reduction in the error is more erratic, probably due to more or less error cancelations. In the lower panel of Fig. 3 the averaged absolute error (open circles) and maximum error (filled circles) is shown for the alkane series. Again, the error in the gradient can be systematically reduced by lowering the cutoff parameter, ε.
In Fig. 4 I show the same errors for the full BBC2 functional for the different cutoff parameters, ε = {10 −9 , 10 −10 , 10 −11 }. We see that the inclusion of the diagonal contributions to the BBC2 functional (24) looks very similar for the larger alkanes, So the error in the energy and gradient in the same ballpark when the diagonal part of the BBC2 functional is included. Only for smallest alkanes the errors are somewhat larger, probably due to less fortuitous error cancellations.

Conclusion
An algorithm to avoid the 4-index transformation for non-trivial separable 1RDM functionals has been presented, which reduces the formal computational cost to m 4 . Diagonal corrections could be handled with a similar strategy, though additional modifications were needed to avoid excessive use of memory. The formal scaling of the diagonal corrections remains m 5 unfortunately. The main advantage of the newly proposed strategy for the evaluation of 1RDM functionals and their derivatives, is that now integral screening techniques can be used more effectively, leading to a significant reduction in the asymptotic scaling of the computational cost. Using screening by the Schwarz inequality, I have shown that the asymptotic scaling can be reduced to m 3.2 for separable functionals including diagonal corrections. For separable functionals without diagonal corrections the asymptotic scaling is even further reduced to m 2.3 . It is expected, however, that for larger systems the scaling for functionals without diagonal corrections should be about order m 2.8 due to the matrix-matrix products involved. All in all, a significant boost in the speed of the evaluation of approximate 1RDM functionals has been achieved. As the evaluation was one of the major computational bottlenecks in practical 1RDM functional calculations. This is an important step to make the use of 1RDM functionals feasible for large molecular systems.