New scaling relations to compute atom-in-material polarizabilities and dispersion coefficients: part 2. Linear-scaling computational algorithms and parallelization

We present two algorithms to compute system-specific polarizabilities and dispersion coefficients such that required memory and computational time scale linearly with increasing number of atoms in the unit cell for large systems. The first algorithm computes the atom-in-material (AIM) static polarizability tensors, force-field polarizabilities, and C6, C8, C9, C10 dispersion coefficients using the MCLF method. The second algorithm computes the AIM polarizability tensors and C6 coefficients using the TS-SCS method. Linear-scaling computational cost is achieved using a dipole interaction cutoff length function combined with iterative methods that avoid large dense matrix multiplies and large matrix inversions. For MCLF, Richardson extrapolation of the screening increments is used. For TS-SCS, a failproof conjugate residual (FCR) algorithm is introduced that solves any linear equation system having Hermitian coefficients matrix. These algorithms have mathematically provable stable convergence that resists round-off errors. We parallelized these methods to provide rapid computation on multi-core computers. Excellent parallelization efficiencies were obtained, and adding parallel processors does not significantly increase memory requirements. This enables system-specific polarizabilities and dispersion coefficients to be readily computed for materials containing millions of atoms in the unit cell. The largest example studied herein is an ice crystal containing >2 million atoms in the unit cell. For this material, the FCR algorithm solved a linear equation system containing >6 million rows, 7.57 billion interacting atom pairs, 45.4 billion stored non-negligible matrix components used in each large matrix-vector multiplication, and ∼19 million unknowns per frequency point (>300 million total unknowns).

is not close to zero for sufficiently small finite j  (eqn (S12)),

S5
Here, 0 x is any initial guess for x . This shift always makes y0=0 as the initial guess for y. The conditioning matrix C must be non-singular. Since A is Hermitian, the matrix M will automatically be Hermitian for any conditioning matrix C: As shown in Section 2.2 below, the algorithm's convergence speed is largely affected by matrix M's eigenspectrum. As with other conjugate gradient methods, convergence is most rapid if the eigenvalues are clustered together. 2 Thus, conditioning's primary goal is to make the eigenspectrum more clustered together. 2 In this article, the dot project of two vectors v and w is defined as If the linear equation system (i.e., eqn (S19)) is inconsistent (i.e., has no exact solution), this FCR algorithm returns a statement that the linear equation system has no exact solution along with a value y = y FCR that minimizes the least-squares problem 2 Minimize F z z W My This represents a best possible choice for y irrespective of whether an exact solution to eqn (S19) exists. There are some applications where this least-squares fit has utility even if an exact solution to eqn (S19) does not exist.
When the eigenvalues are extremely spread out in values, then it takes a higher-order polynomial (and hence a larger number of FCR iterations) to make     i Poly 1/    for all of the eigenvalues contributing to W (eqn (S35)). Thus, the primary goal of conditioning (eqn (S20)) is to make the eigenvalues less spread out in values.
Combining eqn (S31) and (S35) means the Krylov subspaces can be expanded in terms of eigenvalues as where span refers to the space generated by all possible   k b values. where Nrows is the number of rows in matrix M. For example, if matrix M contains Nrows =1 million, but matrix M has only 12 distinct eigenvalues, then FCR will converge to the exact result in at most 6 iterations in exact arithmetic.

S2.2.3 Singular coefficients matrix
Matrix M is singular if and only if one of its distinct eigenvalues equals zero. Singular matrices are not invertible. Let 0   be the particular eigenvalue that is zero. Let c be the coefficient in eqn (S35) associated with this particular eigenvalue. If c0  , then let V be the eigenvector (associated with zero eigenvalue  ) that makes non-zero contribution to W in eqn (S35).

S8
The eigenvectors of a Hermitian matrix can always be chosen to form an orthonormal basis that spans the whole Nrows-dimensional space. Consequently, for singular matrices the eigenvector(s) spanning kernel(M) are orthogonal to all eigenvectors not in kernel(M).
For singular matrices, there are obviously two separate cases. Case 1: When c0  , then the zero-eigenvalue of M does not contribute anything to the vector W. In this case, eigenvector(s)   V associated with zero eigenvalue do not appear in the Krylov subspace sequence is built-up as W, MW, M 2 W, …. Since the conjugate residual method is based on this Krylov subspace sequence, the eigenstate(s)     ,V   are completely irrelevant to the problem in exact arithmetic, because these zero eigenstates simply do not appear in the equation to be solved nor in its solution. In this case, the linear equation system My=W is consistent and its solution in exact arithmetic proceeds analogously to the non-singular case.
Case 2: When c0  , then a zero eigenstate makes non-zero contribution to W in eqn (S35). Examining eqn (S39), an exact solution would be reached if for every eigenstate contributing to W. For 0   , eqn (S44) simplifies to 10  which is manifestly inconsistent. Therefore, the linear equation system My=W is manifestly inconsistent in this case. Because My multiplies each eigenvector contributing to y by its eigenvalue, then My does not contain any contributions from zero eigenstates. Therefore, whenever W contains a non-zero contribution from a zero eigenstate, no exact solution exists.
For singular coefficients matrix (whether the linear equation system is consistent or inconsistent), each iteration removes conditioned residual along two z-search directions that are orthogonal to each other and all prior z-search directions. Therefore, convergence is reached in at most ceiling(  /2) iterations, which is the same result as for the non-singular case given in Section S2.2.2 above.

S2.2.4 Deriving the convergence criteria
Regardless of whether the linear equation system is non-singular consistent, singular consistent, or singular inconsistent, the least-squares solution corresponds to removing all contributions to W that are not in kernel(M), with the final residual being that portion of W that lies within kernel(M): Examining eqn (S39), this FCR method always proceeds normally along all z-search directions that are not in kernel(M). Because these z-search directions are a linear combination of eigenstates that are orthogonal to kernel(M), projection of the conditioned residual onto kernel(M) is not effected during this search.

S9
Manifestly, the algorithm is converged. Therefore, a needed condition to construct a fail-proof conjugate residual method is that one of the z-search directions (say so that z-search always reduces the conditioned residual norm in each and every iteration until Therefore, the FCR algorithm has two convergence criteria. If   i z0  , the FCR algorithm converged to an exact solution (within the convergence tolerance).
Mz Mz 0  , then the linear equation system is inconsistent and the FCR algorithm converged to the least-squares solution (within the convergence tolerance).

S2.3 Defining conjugate search directions that span the Krylov subspaces
Each iteration involves two search directions,   i p and   i q : subject to the constraints In exact arithmetic, all z-search directions are chosen to be orthogonal, thus ensuring a conjugate method: Following eqn (S53) and (S54), the conditioned residual is perpendicular to the current and prior search subspaces:

S10
Because two conjugate search directions are chosen from a Krylov subspace that increases by two directions per iteration (and are orthogonal to all prior z-search directions), it follows that Combining eqn (S55) and (S57) yields: Combining eqn (S49), (S52), and (S53) yields As stated above in eqn (S51) -(S52),     ii p ,q are M 2 -conjugate to all prior search directions that span the Krylov subspaces of orders ≤2i-2. Since it follows that for j≥0: where   p,q means the equation holds for both p and q (i.e., ). These equations are for exact arithmetic.
These requirements will be fulfilled by choosing First, we derive the conjugacy conditions. Contracting eqn (S65) with which must equal zero to satisfy eqn (S52). By eqn (S60),  is the only non-zero component, for convenience the subscript is dropped. Contracting Making use of eqn (S67) gives This choice for   i p satisfies the requirement in eqn (S47) above: which rearranges to give are automatically zero in exact arithmetic. The following ordered sequence of assignment operations gives   i q with these required properties. Assigning

RSC Advances
This choice satisfies the conjugacy condition:

S2.4 Vector lengths in exact arithmetic
The above relationships prove the z-search directions are orthogonal and members of the appropriate Krylov subspaces, but additional analysis is required to prove the search directions have non-zero lengths (at least until convergence).
occur on any iteration except the last one, then the correct Krylov subspaces may not be spanned by these vectors. Note that a vector  equals zero if and only if its squared length (  ) is also zero. Also, the squared length of any vector is non-negative: Here is a proof that   i p0  until convergence is reached. From eqn (S53), Thus, taking the dot product of eqn (S65) with itself yields From eqn (S93), the two terms on the right side of eqn (S95) must be non-negative. As mentioned in Section S2.
 (eqn (S82)) is set to zero when its corresponding denominator would be zero.

S2.5 Robust convergence and round-off error resistance
The Orthodir conjugate gradient algorithm can work for both positive definite and indefinite Hermitian coefficients matrix M. 3 However, it suffers from the accumulation of roundoff errors. In the Orthodir algorithm, the y-search direction at each successive iteration is computed as where   The FCR algorithm solves this problem by making the new search direction   i p depend explicitly on the residual's current value (eqn (S73)). Thus, irrespective of any round-off errors that have occurred, this search direction will never become uncoupled from the residual's value.
Round-off errors can potentially cause the   i q direction to stray from its precise value. It will now be proved   i  necessarily decreases in each iteration until FCR converges, even when using fixed precision arithmetic. Rearranging eqn (S49) gives Taking the dot product of eqn (S103) with itself gives Hence, Substituting eqn (S33), (S61), and (S62) into (S105) gives Expanding   i p using eqn (S73) gives

RSC Advances
where the quantities marked zero via eqn (S53). Substituting eqn (S107) into (S106) gives As discussed in Section S2.2.4 above, FCR is not yet converged when Inserting this into eqn (S108) gives   is not yet converged. Hence, the FCR algorithm always decreases the conditioned residual norm in every iteration until convergence. Because the required conjugacy conditions (eqn (S99) -(S102)) hold even for fixed precision arithmetic, eqn (S108) remains valid for fixed precision arithmetic.

S2.6 Computational algorithm
Convergence thresholds: The following three parameters control convergence thresholds. (1) FCR will exit if the maximum absolute value of every conditioned residual component is less than CR_convergence_tol. (2) FCR will exit if SQRT(<Mz | Mz>) is less than Mz_length_tolerance. The division_tolerance = (Mz_length_tolerance 2 ) sets the threshold for when <Mq | Mq> and related denominators are considered to be effectively zero. Mz_length_tolerance should be set to a value much smaller than CR_convergence_tol. (3) max_CR_steps specifies the maximum number of FCR iterations that can be performed.

S3. Proof the wp lookup table method yields error proportional to the interval squared
Using Mathcad's symbolic engine, this was expanded to a Taylor series in powers of interval: This formula lends itself to extremely easy magnitude analysis. Over the interval Inserting eqn (S134) and (S135) into (S133) yields 2 interval 16  (S136) where the third-order and higher terms can be neglected, because 0 < interval << 1. Because total 6 C is the sum of positive 6,AB C values, the unsigned relative error (URE) of total 6 C is necessarily less than or equal to that of the largest URE of 6,AB C . Therefore, the URE in

S21
Substituting eqn (S138) into the center of eqn (S137) yields the right-most inequality of eqn (S137). For Num_lookup = 10 5 , a URE ≤ ~4×10 -10 is expected. Figure S1 is a flow diagram of the linear-scaling algorithm to set up the lists of interacting atom pairs. Data is grouped to enable fast computation by avoiding all array searches. For example, information is ordered such that arrays do not have to be searched to identify which atoms belong in each spatial region. Also, each array allocation is performed once, rather than continuously appending arrays (which would be extremely slow). This is accomplished by first performing a 'dry run' code block that executes a sequence to count up the required array size, followed by array allocation, followed by a code block that writes data to the allocated array. The four key steps to construct these lists are:

S4. Linear-scaling algorithm for setting up the lists of interacting atom pairs S4.1 Overview
1. Define basis vectors and unit cell parallelepiped: A parallelepiped of non-zero volume is constructed to enclose the system's unit cell. Three basis vectors correspond to this parallelepiped's non-collinear edges. For a periodic direction, the basis vector is the corresponding periodic lattice vector. For a non-periodic direction, the basis vector is chosen to be of a non-zero length that fully encloses all the nuclear positions. Periodic basis vectors can be non-perpendicular to each other (e.g., triclinic unit cells), but each non-periodic basis vector is chosen to be perpendicular to the other basis vectors. 2. Divide the unit cell parallelepiped into spatial regions: This unit cell parallelepiped is divided into a whole number of spatial regions along each basis vector. A periodic direction produces an infinite number of periodic images of each region, while a nonperiodic direction has only the reference image. Atoms in the reference unit cell are classified by region, and a sorted list is prepared such that atoms of the same region are adjacent in this sorted list. Because each spatial region is defined such that its volume is less than that of a sphere of dipole interaction cutoff length radius, the number of atoms in each region is always below a threshold. The code ignores regions that do not contain any atoms. Because empty regions are skipped, having a few atoms in the center of an enormous unit cell would execute quickly. 3. Construct arrays listing interacting region pair images: Two spatial region images interact if the minimum distance between inter-region points is ≤ the dipole interaction cutoff length. Because the spatial regions and their images are coordinate system indexed, a list of interacting region pair images is constructed without having to construct a double summation over all region pairs. Thus, even for an extremely large unit cell (e.g., containing billions of atoms) divided into many regions (e.g., millions), the list of interacting region pair images is constructed in time and memory scaling linearly with increasing unit cell size. Different regions can interact with different periodic images. For example, in an extremely large unit cell, a region near the center would interact only with nearby regions in the reference unit cell, while a region not too

S22
far from the left edge would interact with some regions in the reference unit cell and some other region images in the left-translated unit cell. Thus, a first array is constructed listing pairs of regions having any interacting images, and a second array is prepared that lists which specific images of each particular region pair interact. Region pairs that do not interact are not included in these two arrays. 4. Construct two lists of interacting atom pairs: Because interacting atom pairs must be contained in interacting region pair images, the code identifies the interacting atom pairs by executing an outer loop over the interacting region pair images and inner loops over the atoms in these regions (along with tests for inclusion criteria). Using the list of atoms sorted by region makes this process cache access friendly. For each such atom pair, tests are performed to determine if it meets the small and large list inclusion criteria. If so, its information is added to the small and/or large lists. Because the number of interacting region pair images scales linearly with large Natoms and the number of atoms in each region is below a threshold, these small and large lists are constructed in linear-scaling computational time and memory for large Natoms.
The zip archive included in the Electronic Supplementary Information contains a folder named "construct_lists_of_interacting_atom_pairs". This folder contains example Fortran code that implements the method described here. Readers can examine this for finer coding details. This code is in the form of modules that should be called by a main program (not included). The main program must supply the required input data (e.g., number of atoms, coordinates for each atom, unit cell information, etc).
The following Sections S4.2, S4.3, S4.4, and S4.5 explain important mathematical equations and procedural details for the four key steps to construct the lists of interacting atom pairs. Figure S1: Flow diagram for setting up the lists of interacting atom pairs. The inputs followed by (MCLF) are needed only for the MCLF method and not the TS-SCS method.

Input Data
Parameters: dipole interaction cutoff length, multibody screening parameter (MCLF), distance to attenuation ratio cutoff. Unit cell properties: number of atoms, number of periodic directions, lattice vector for each periodic direction, whether to ignore periodic boundary conditions. Properties for each atom in the material: coordinates, unscreened damping radius (MCLF), wp, static unscreened polarizability

Construct arrays listing interacting region pair images
Compute: the sum limit along each basis vector, and the largest body diagonal of one spatial region. Construct using data grouping: one array listing the spatial region pairs having any interacting images, and another array listing which images interact for each particular interacting region pair

Construct two lists of interacting atom pairs
Both the small and large lists contain only translation symmetry unique atom pairs. The small list contains 'overlapping atom images' as determined by the distance to attenuation ratio being less than the distance to attenuation ratio cutoff. For each entry in the small list, the first atom resides the in the reference unit cell, while the second atom may or may not reside in the reference unit cell. The large list contains atom pairs for which any images are within the dipole interaction cutoff length. The large list stores only sums collected over all periodic images for an atom pair; no individual image data is stored in the large list.

S4.2 Define basis vectors and unit cell parallelepiped
A parallelepiped of non-zero volume is constructed to enclose the system's unit cell.

S25
3. Three periodic boundary conditions: The basis vectors are set equal to the periodic lattice vectors. The origin is set to (0,0,0).

Define the matrix of basis vectors as
Then, the unit cell volume is the absolute value of the determinant: where matmul is the matrix multiplication function. Then a modulo is applied to select the image of atom A that resides within the reference unit cell: inverse_coords :,atom1 = modulo inverse_coords :,atom1 ,1.0 (S143) This ensures the reciprocal space coordinates lie within the interval ( [0,1), [0,1), [0,1) ).

S4.3 Divide the unit cell parallelepiped into spatial regions
First, a few geometric factors are computed from the unit cell parallelepiped: The ceiling function rounds up to an integer. The factor (3/dipole_interaction_cutoff_length) sets the preferred length of a region to approximately one-third the dipole interaction cutoff length. This guarantees the size of each region is always substantially smaller than the dipole interaction cutoff length while also not creating an unnecessarily large number of regions.
During a loop performed over atoms, each atom is assigned to a spatial region as labeled by three whole number indices in the interval ([1,ref_regions_a], [1,ref_regions_b], [1,ref_regions_c]). Along the first basis vector, the corresponding region index for atom1 is     region _ a NINT ref _ regions _ a inverse _ coords 1,atom1 0.5    (S153) with analogous expressions for the second and third basis vectors, where NINT rounds to the closest integer (0.5 being rounded up). If any region does not contain any atoms, this region is marked as empty and skipped in all subsequent analysis.
Finally, data grouping is performed to prepare a list of atoms and their coordinates such that all atoms from the same region are grouped together in this list. The starting and ending index values for each region in this list is computed and stored. This makes it extremely easy to perform loops over all atoms in a particular region without having to do any array searches.

S4.4 Construct arrays listing interacting region pair images
When constructing a list of representative interacting region pair images, the first region can always be chosen within the reference unit cell. Since there are no periodic translations along a non-periodic direction, then for a non-periodic direction the second region in the interacting region pair image also always lies within the reference unit cell. Along a periodic direction, the second region in the interacting region pair image can be a periodically translated image of some region such that this image has some points located closer than dipole_interaction_cutoff_length of some points in the first region. Therefore, periodicsumlimitA, periodicsumlimitB, and periodicsumlimitC are defined such that only region images with integer translation indices 1 periodicsum lim itA L periodicsum lim itA    (S154) 2 periodicsum lim itB L periodicsum lim itB    (S155) 3 periodicsum lim itC L periodicsum lim itC    (S156) need to be tested when setting up the list of interacting region pair images. For a non-periodic direction, the corresponding periodic sum limit is obviously set to zero. If ignore_PBC = true, then RSC Advances S27 periodic boundary conditions are ignored and all periodic sum limits are set to zero. For a periodic direction, the corresponding periodic sum limit is set to the corresponding one of the following: The longest body diagonal (aka largest_body_diag) of one spatial region is computed by finding the maximum real-space distance corresponding to a change of ( (ref_regions_a) -1 , ±(ref_regions_b) -1 , ±'(ref_regions_c) -1 ) in reciprocal space coordinates. The real-space distance along these diagonals is: All four (±, ±') combinations are computed and the maximum real-space distance among these four combinations is chosen.
The condition for two region pair images to be included in the list of interacting region pair images is that the distance from the 'lower left' corner of the first region to the 'lower left' corner of the second region image should be ≤ (dipole_interaction_cutoff_length + largest_body_diag). Including the largest_body_diag ensures that all points in the first region are farther away than dipole_interaction_cutoff_length from all points in the second region image. This list is assembled in linear-scaling computational time by using the following computational efficiencies. First, only translation indices satisfying eqn (S154) -(S156) need to be considered for the second region image. Second, according to eqn (S150) -(S152) the whole number of regions along each basis vector in the reference unit cell is chosen such that the region widths (along directions parallel to each basis vector) are as large as feasible without being greater than one-third of the dipole_interaction_cutoff_length. This means a line segment of dipole_interaction_cutoff_length placed parallel to a basis vector direction could completely cross at most four regions and partially cross at most five regions. Let each region in the reference unit cell be identified by whole number triplets in the interval ( [1, ref_regions_a], [1, ref_regions_b], [1, ref_regions_c] ). Then let (region_A, region_B, region_C) be the region indices identifying a first region in the unit cell and (i, j, k) be the region indices identifying a second region in the reference unit cell. Then, the second region could potentially have one (or