A collection of forcefield precursors for metal–organic frameworks

A host of important performance properties for metal–organic frameworks (MOFs) and other complex materials can be calculated by modeling statistical ensembles. The principle challenge is to develop accurate and computationally efficient interaction models for these simulations. Two major approaches are (i) ab initio molecular dynamics in which the interaction model is provided by an exchange–correlation theory (e.g., DFT + dispersion functional) and (ii) molecular mechanics in which the interaction model is a parameterized classical force field. The first approach requires further development to improve computational speed. The second approach requires further development to automate accurate forcefield parameterization. Because of the extreme chemical diversity across thousands of MOF structures, this problem is still mostly unsolved today. For example, here we show structures in the 2014 CoRE MOF database contain more than 8 thousand different atom types based on first and second neighbors. Our results showed that atom types based on both first and second neighbors adequately capture the chemical environment, but atom types based on only first neighbors do not. For 3056 MOFs, we used density functional theory (DFT) followed by DDEC6 atomic population analysis to extract a host of important forcefield precursors: partial atomic charges; atom-in-material (AIM) C6, C8, and C10 dispersion coefficients; AIM dipole and quadrupole moments; various AIM polarizabilities; quantum Drude oscillator parameters; AIM electron cloud parameters; etc. Electrostatic parameters were validated through comparisons to the DFT-computed electrostatic potential. These forcefield precursors should find widespread applications to developing MOF force fields.


Introduction
Metal-organic frameworks (MOFs) are a kind of coordination network comprised of metal atoms connected by organic linkers. 1 In this work, we are interested in MOFs that are porous crystals. Because of their nanoporous structures, these materials attract much interest for gas storage, gas separation, catalysis, and other applications. [2][3][4][5][6] Many thousands of MOF crystal structures have been deposited in the Cambridge Structural Database (CSD). 1,7,8,[115][116][117] Most oen, these crystal structures were measured using X-ray diffraction crystallography (XRDC). Since hydrogen atoms contain no core electrons, they diffract X-rays extremely weakly. 9 This makes it much harder to rene hydrogen atom positions than positions for heavier atoms. [9][10][11] Consequently, hydrogen atom positions may be unresolved in some of the reported crystal structures. Other complications include the presence of disordered atoms, solvent molecules, and/or free ions in some of the reported MOF crystal structures.
In 2014, Chung et al. reported a Computation Ready Experimental (CoRE) MOF database that was constructed by rst searching the CSD to identify MOFs and then partially cleaning these structures. 7 The searching step was designed to identify structures containing metal atoms bonded to non-metal atoms that form 3-dimensional networks. The cleaning process was intended to x or discard structures containing disordered atoms and partial occupancies. The cleaning process was also intended to remove solvent molecules and other small adsorbates in the MOF's pores but to retain charge-balancing ions. Finally, missing hydrogen atoms were added to some of the structures. However, this cleaning process was imperfect resulting in some structures with errors. 12,13,115 This 2014 CoRE MOF database was the starting point for our study. It contains a total of 5109 structures, of which 4764 structures were modied during the cleaning process and 345 retained their original CSD structures. 7 Fig. 1 illustrates some of the chemical diversity within this database. Since organic compounds contain carbon, porous metal-linker networks that do not contain any carbon atoms are called metal-inorganic frameworks (MIFs). 14,15 The 2014 CoRE database contains some structures that are MIFs. 13 Several follow-ups were subsequently made to the CoRE MOF database. In 2016, Nazarian et al. reported DDEC net atomic charges (NACs) for 2932 structures. 16 In 2017, Nazarian et al.  115 reported a comparative analysis between the 2014 CoRE MOF database and the CSDderived MOF database of Moghadam et al. 116 Although many thousands of MOFs have been synthesized to date, this is only a tiny fraction of the number of MOFs that could potentially be made. 1 Owing to the large number of different metal atoms and organic linkers that could be combined in various ways, there is a nearly innite number of potentially synthesizable MOFs. 1,[18][19][20] Computational chemistry can be an efficient way to search this vast chemical space to help identify the most promising materials to later synthesize and experimentally test. 21 This will avoid unnecessary efforts to synthesize a large number of materials that do not perform well for the desired applications.
A host of important performance properties for MOFs and other complex materials can be calculated by modeling statistical ensembles (e.g., constant NVE, NVT, NPT, mVT, mPT, NPH, etc.). Here, the properties can be roughly divided into thermodynamic properties that occur at equilibrium and dynamic properties that describe transient system response to nonequilibrium. Gas adsorption isotherms are oen computed in the grand canonical (mVT) or Gibbs ensembles. 22,23 Gas diffusion constants are oen computed in the canonical (NVT) or microcanonical (NVE) ensembles using equilibrium molecular dynamics. 24,25 System behaviors under extreme conditions, such as high pressures, low or high temperatures, applied electromagnetic elds, irradiation, high stresses, and corrosive conditions are of emerging interest.
When modeling statistical ensembles, an interaction model is required to compute the system's energy as a function of chemical conguration. Several different types of interaction models exist depending on the length scale at which the system is modeled. The smallest length scale considers individual electrons using an exchange-correlation theory (e.g., DFT + dispersion functional 26,27 ) to solve the Schrodinger equation for the material's electron distribution (i.e., quantum chemistry). A somewhat larger length scale considers individual atoms interacting via a classical force eld in atomistic simulations. 21,28,29 Coarse-grained and continuum models treat even larger length scales. 30,31 Quantum chemistry using an exchange-correlation theory has different advantages and limitations compared to classical atomistic simulations using a force eld. An exact exchangecorrelation theory (e.g., full conguration interaction) would describe all material types but is extremely computationally expensive. In practice, approximate exchange-correlation theories are used that describe an extremely broad range of different material types with acceptable accuracy and computational efficiency. Because exchange-correlation theories are not material specic, they should not have to be reparameterized for a new material type. In contrast, classical force elds have to be extensively parameterized for individual atoms or atom types. 29,32 Because computing the system's energy is usually much faster using a classical force eld than an exchange-correlation theory, classical atomistic simulations can usually be performed faster and over larger length and time scales than quantum chemistry calculations.  Classical force elds typically contain bonded and nonbonded terms. Bonded terms describe exibility (bond stretching, angle bending, torsion, and/or out-of-plane parameters) and have been incorporated into many exible force elds for MOFs. [33][34][35][36][37][38] Non-bonded terms describe electrostatic interactions (NACs, atomic multipoles, charge penetration, polarizability [39][40][41][42][43] ), London dispersion interactions, and/or short-range exchange-repulsion. 44,45 When optimizing the bonded parameter values, it is orders of magnitude more efficient to t them to atom-in-material forces across multiple geometries than to only t them to the system's total energy across multiple geometries. 46,47 Because each geometry yields many atom-in-material forces but only one total energy, fewer geometries are needed to t the bonded parameter values to forces than to energies.
A classical force eld must contain many-body dispersion and/or many-body polarization to accurately describe system properties over a wide range of conditions. 48 As shown in Fig. 2, dipolar interactions between two particles can be purely attractive, but dipolar interactions between many noncollinear particles are partly attractive and partly repulsive. Consequently, tting two-body potentials (e.g., Lennard-Jones) to two-particle interaction curves gives force elds that oen overestimate liquid-phase densities. 49 If the parameters of the two-body potential are adjusted to yield correct liquid-phase densities, then the two-particle interaction curve will be incorrectly described. 45 This problem can be xed by explicitly including many-body dispersion and/or many-body polarization in the force eld. For example, Kiss and Baranyai showed polarizability must be included in a force eld to describe liquid water accurately over a wide range of conditions. 50 Currently, there is a bottleneck to effectively model statistical ensembles: the quantum chemistry calculations are computationally expensive while the classical force elds are tedious to accurately parameterize. To resolve this bottleneck, either the quantum chemistry calculations should be sped up by orders of magnitude or classical force eld parameterization should be made orders of magnitude easier. Some progress was made, but much more work remains to be done. Car-Parrinello molecular dynamics (CPMD) is a quantum chemistry method in which the wavefunction does not have to be self-consistently solved at each time step, because it evolves via an effective Lagrangian. CPMD improves the computational efficiency of ab initio molecular dynamics, but CPMD is limited to materials containing a band gap (i.e., insulators and semi-conductors). 52-54 VASP can perform ab initio molecular dynamics even for metals, but this requires computing selfconsistent orbitals at each time step. 54,55 Recently, many studies focused on rst-principles derived classical force elds whose parameters are extracted from quantum chemistry calculations. 22,35,40,44,45,49,56,57 Other studies parameterized force elds by tting to experimental data 36,58 or using machine learning (ref. [59][60][61][62].
This work is a large-scale computation of atom types and non-bonded parameters for MOFs. Fig. 3 summarizes our project workow. The 5109 structures from the 2014 CoRE MOF database included 345 structures directly from CSD plus 4764 structures modied by Chung et al. 7 We attempted DFT calculations on all of the modied structures except the largest ones containing >$1700 atoms; however, some DFT calculations did not converge. DFT calculations converged for 4445 structures that formed the calculated group, while the unconverged and large structures comprised 319 uncalculated structures. The acceptance or rejection criteria described in Section 2.4 were then applied to the CSD, calculated, and uncalculated structures. Atom typing was then applied to all of the accepted structures. A total of 8607 different atom types were identied. Various forceeld precursors were computed and reported in the ESI † for 3056 accepted calculated structures.

Electron density calculation
The periodic quantum chemistry calculations of MOFs were performed using the PBE 63 exchange-correlation functional and the VASP 64,65 soware. Frozen-core all-electron calculations were performed using the projector augmented wave 66,67 (PAW) method that uses a scalar-relativistic treatment of relaxed valence electrons and high-level relativistic treatment of frozen-core electrons. 68 The PAW potentials recommended on the VASP website were used for all these calculations. The planewave energy cutoff was 400 eV. Following prior recommendation, 69 the number of k-points along each lattice vector times the lattice vector length was $16Å. If the k-points mesh was 1 Â 1 Â 1 then the Gamma point was used, otherwise we used Monkhorst-Pack 70 k-point grids. Real-space grids were chosen to avoid aliasing errors (i.e., PREC ¼ accurate). 69 The magnetic alignment of unpaired electron spins was optimized starting from the VASP default guess (which corresponds to a ferromagnetic alignment).
Calculations for the small organic molecules in Section 3.3 were performed using Gaussian16 (ref. 71) with the B3LYP 72,73 functional and def2QZVPPDD 74,75 basis set. Geometries of these molecules were optimized at this level of theory. Table 1 lists the computed forceeld precursors. The net atomic charge, atomic dipole, atomic quadrupole, electron cloud parameters, hr 3 i and hr 4 i radial moments, and atomic spin moment were computed in the Chargemol program using DDEC6 partitioning. 69,76,77 The electron cloud parameters t the electron density tail of each atom in the material to an exponential decay function. Specically, the logarithm of the spherically averaged electron density assigned to atom A was t to a straight line ln(r avg A (r A )) z a À br A (1) using least squares regression over the r A values for which 10 À4 # r avg A (r A ) # 10 À2 e bohr À3 . r A is the distance from positionr to atom A's nuclear position. The tted intercept a, slope b, and squared correlation coefficient R 2 are reported. The R 2 values were nearly always > 0.99 indicating nearly perfect linearity. These electron cloud parameters have several uses in molecular mechanics force elds. First, they are useful to describe charge penetration (also called cloud penetration). 76 Second, they are useful to describe short-range exchange-repulsion. 44 van Vleet et al. showed the Born-Mayer exponential term describing short-range exchange-repulsion has an effective exponent of $0.84 Â b. 44 Although their results were derived for iterated stockholder atom 78 (ISA) partitioning, 44 we expect the same relation to hold for DDEC6 partitioning. Third, these electron cloud parameters are useful to parameterize dispersion damping functions (e.g., Tang-Toennies 79,80 ) that keep the dispersion energy nite as the distance between two atoms approaches zero. 44,45 The polarizabilities, dispersion coefficients, and quantum Drude oscillator (QDO) parameters were computed using the MCLF method 74 and soware (ref. 81). The MCLF method yields several different types of polarizabilities and dispersion coefficients. The forceeld polarizability is a non-directionally screened polarizability that is suitable for use as an input parameter in polarizable force elds. 74 This avoids double counting the directional screening, because directional screening naturally arises during the classical atomistic simulation when the force eld is used. 74 The uctuating polarizability is the polarizability associated with the uctuating dipoles of the London dispersion interaction. 74 The static polarizability and static polarizability tensor quantify the system's response to an externally applied constant electric eld. 74 Various dispersion coefficients are associated with different kinds of uctuating multipoles: C 6 (dipole-dipole), C 8 (dipole-quadrupole), C 9 (dipole-dipole-dipole), C 10 (dipole-octupole and quadrupole-quadrupole). 74,82,83 The C 9 coefficients were not printed to the forceeld precursors le, because they can be readily computed 74 from the printed forceeld polarizability and QDO parameters. The MCLF method includes convenient mixing rules (based on a QDO model) to easily calculate each of these dispersion coefficients between unlike atoms using only parameters of the individual atoms. 74 Of key importance, the C 6 dispersion coefficient does not equal the r À6 coefficient of the Lennard-Jones force eld. 84 Because the Lennard-Jones force eld does not explicitly include higher-order dispersion (e.g., C 8 , C 10 , etc.) terms, the Lennard-Jones r À6 coefficient must be made articially higher than C 6 to effectively compensate for the neglected C 8 and C 10 dispersion terms. 74,84 Near the minimum energy separation between two atoms, higher-order dispersion (e.g., C 8 , C 10 , etc.) can contribute $35% of the total dispersion energy. 79 A QDO is a quantum harmonic oscillator containing a pseudoelectron attracted to a pseudonucleus. 85,86 Three QDO parameters were computed for each atom in the material: (a) the pseudoelectron's effective charge, (b) the effective QDO frequency, and (c) the QDO's reduced mass. 74,85 Force elds based on QDO models describe multibody dispersion and multibody polarization beyond the dipole approximation but require advanced simulation techniques. [85][86][87]

Quantifying electrostatic model accuracy
The root mean squared error (RMSE) of an electrostatic model quanties its error compared to the quantum mechanically computed (e.g., DFT) electrostatic potential over a chosen set of grid points: 88 where N grid_points is the number of grid points used to compute RMSE. V model offset is the average difference between V QM (r) and V model (r) over these grid points. 89 These grid points are normally chosen to be outside the material's van der Waals surface. 90 For a MOF crystal, these grid points occur inside the MOF's pores.
In this study, the RMSE grid points were uniformly distributed in the volume of space that simultaneously met all three of the following criteria: (i) the material's total electron density, r(r), was #10 À4 e bohr À3 , (ii) the grid point was no closer than 5 bohr to any atom in the material, and (iii) the grid point was closer than 12 bohr to at least one atom in the material. If any one of these three criteria were not met, the grid point was not used.
The relative root mean squared error (RRMSE) is the ratio of the electrostatic model's RMSE to the RMSE of a null model for which V null_model (r) ¼ 0. 91 Note that V null_model offset equals the average of V QM (r) over the RMSE grid points, V QM avg . Therefore, The RRMSE quanties the fraction of electrostatic potential variations that are not described by the model. For example, an RRMSE ¼ 0.1 means that 90% of the electrostatic potential variations are captured by the model and 10% are not. This RRMSE denition was used in several prior studies 76,88,91,92 coauthored by one of us and differs from the even earlier literature 89,93 by including the potential offsets in this RRMSE denition. These potential offsets must be included in the RMSE and RRMSE denitions, because in periodic materials the electrostatic potential has no obvious spatial position where it approaches zero value and the energy of a system having no net charge is invariant to a constant electrostatic potential shi. In this work, the DFT electrostatic potential V QM (r) was output from VASP using the keyword LVHAR ¼ .true. We wrote an OpenMP parallelized Fortran program to compute RMSE and RRMSE. The RMSE and RRMSE were computed for the following four electrostatic models: (A) NACs only (aka monopole order), (B) NACs plus spherical cloud penetration, (C) NACs plus atomic dipoles (aka dipole order), and (D) NACs plus atomic dipoles plus spherical cloud penetration.

Structure acceptance or rejection
The 2014 CoRE MOF database was developed with goals of xing disordered atoms, adding missing hydrogens, removing solvent molecules, discarding unxable structures, and other cleaning protocols. 7 These goals were partially but not fully achieved. To partially address remaining structural deciencies in the 2014 CoRE MOF database, we performed: (i) reasonability checks on our computed forceeld precursors for the calculated structures and (ii) connectivity checks for all structure categories. Table 2 lists reasons for rejecting a MOF structure.
The sum of bond orders (SBO) was used as a screening criterion for reasonableness. DDEC6 bond orders and SBO for each atom in a material were computed using the Chargemol program. 69,94  ). The accepted SBO ranges for P and As were set to [3.0, 6.0] to accommodate their ability to form $3 (e.g., PF 3 , AsF 3 ) to $6 heuristic bonds (e.g., H 2 PO 4 À , H 2 AsO 4 À ). We do not claim these choices are perfect, but they can screen out some bad structures.
A calculated structure was rejected if any metal atom had negative NAC. In this article, the denition of metal atom included all elements not listed in Table 3 except rare gases (He, Ne, Ar, Kr, Xe, Rn) and At. The rational for this criterion is that metal atoms in MOFs are typically bound to more electronegative elements (e.g., N, O, etc.).
A calculated structure was rejected if any one of the four RRMSE described in Section 2.3 was greater than 0.5. This criterion ensured each of our four electrostatic models described the electrostatic potential with an error less than half of a no charges model. In other words, each electrostatic model described the majority of electrostatic potential variations in the MOF's pores.
Structures were rejected from all categories if they contained any isolated atoms. An atom was considered isolated if it was not connected to any other atom based on the radii listed in Table 5.
For CSD and uncalculated structures, some structures were rejected based on strange connectivity identied by visual inspection. Because this visual inspection was not performed systematically, we do not claim all instances of bad connectivity were identied. Table 4 lists the category breakdown for 2014 CoRE MOF structures. Calculated structures were those for which an electron density distribution was calculated using DFT. Uncalculated structures were non-CSD structures for which the electron density distribution was not calculated. This could be due to one of two factors: (i) the MOF structure was so large that DFT calculation was not practical within our computational budget or (ii) the DFT calculation did not readily converge. CSD structures were those for which the geometry came directly from the CSD database. 7 We do not report forceeld precursors for the   CSD structures, because CSD licensing terms may not allow us to publicly distribute their geometries. The second column of Table 4 lists the total number of structures in each category. The third and fourth columns list the number of rejected and non-rejected structures in each category, respectively. The ESI † contains detailed lists of MOFs in each category. For each rejected structure, the ESI † lists the specic reason(s) that structure was rejected.

Forceeld precursors
Files containing forceeld precursors for 3056 accepted calculated structures are contained in the ESI. † The data collected for each atom in a MOF includes: the coordinates; the net atomic charge (NAC); C 6 , C 8 and C 10 dispersion coefficients; uctuating polarizability; forceeld polarizability; electron cloud parameters; hr 3 i and h r 4 i moments; QDO parameters; atomic dipole and quadrupole; atomic screened static polarizability tensor and isotropic polarizability; atomic spin moment. The data collected for the MOF's unit cell includes: the lattice vectors, the total spin magnetic moment, the net charge, and the RRMSE of the electrostatic potential for electrostatic models including (A) NACs only, (B) NACs plus spherical cloud penetration, (C) NACs plus atomic dipoles, and (D) NACs plus atomic dipoles and spherical cloud penetration. For each MOF, these data were written to a xyz le whose chemical structure can be visualized with the Jmol 95,96 program. In the future, these forceeld precursors could be used as building blocks to construct force elds for these materials. Fig. 4 plots histograms of electrostatic potential RRMSE for the accepted calculated structures. At monopole order (with or without spherical cloud penetration), the histograms peak at RRMSE ¼ (0.1, 0.2). At both monopole and dipole orders, spherical cloud penetration had negligible effect on the histograms. Dipole order showed dramatic improvement over monopole order.

RRMSE of electrostatic potential
Some additional comments are in order. Even though spherical cloud penetration had negligible effect on the RRMSE values, this does not necessarily mean charge penetration effects are not important for constructing accurate electrostatic models. In MOF's, the RRMSE probes volume in space that is likely occupied by an adsorbate atom's nuclear position. Because charge penetration extends over the volume occupied by an adsorbate atom's full electron density distribution, it affects the electrostatic potential over a much larger volume than that probed by RRMSE calculation. For this reason, charge penetration can be more important than indicated by the RRMSE values. Prior studies demonstrated intermolecular electrostatic interaction energies are substantially improved by including charge penetration. 97-99 Therefore, including spherical cloud penetration (aka 'charge penetration') will likely improve the adsorbate-MOF and adsorbate-adsorbate electrostatic interaction energies even though its effects on RRMSE were negligible.
When constructing force elds, decisions have to be made about tradeoffs between simplicity and accuracy. For best accuracy, the force eld should include NACs, atomic dipoles (and possibly atomic quadrupoles), and spherical charge penetration. However, the simplest force eld would include NACs only and neglect atomic multipoles and charge penetration.

Atom types
Atom types are widely used in force elds. 29 The main idea of atom typing is to classify similar atoms into the same atom type to facilitate forceeld parameterization, where all atoms of the same atom type are normally assigned identical forceeld parameters. 32 Biomolecular force elds have used atom types for several decades. [100][101][102][103][104][105][106] The universal force eld (UFF) and its extensions to MOFs (e.g., UFF4MOF) focused on atom types for structural optimizations. [107][108][109] The connectivity based atom contribution (CBAC) method dened atom types in porous materials (e.g., MOFs and covalent organic frameworks) using rst neighbors only. 110,111 Two signicant limitations of the CBAC method are: (i) its atom types are based on rst neighbors only which limits chemical transferability and (ii) the NACs for its atom types were obtained by CHELPG analysis of small fragment clusters but it is unclear how accurately these may reproduce the fully periodic electrostatic potential. 110,111 In our work, atom types are dened based on both rst and second neighbors leading to better chemical transferability. Also, the NACs for our atom types were calculated using the fully periodic electrostatic potential, rather than small fragment clusters. Fig. 5 illustrates our atom typing scheme. When multiple neighbor groups are present, they are sorted by the atomic number of the neighbor atoms from smallest to largest. If a rst neighbor has no second neighbors, its lack of second neighbor is indicated by (0) in the atom type label. For example, the atom type 6[1-(0),1-(0),1-(0),6- (1,1,8)] indicates a central carbon atom with four rst neighbors (H, H, H, C) where each of the rst neighbor H atoms is not directly bonded to any second neighbors and the rst-neighbor C atom is directly bonded to H, H, and O plus the central atom. As another example, the atom type 15 [6-(6),6-(6,6),6-(6,6),29-(7,7,7)] demonstrates that 6-(6) is listed prior to 6-(6,6). As another example, the atom type 7[1-(0),6-(1,1,6),6-(6,8)] demonstrates that 6-(1,1,6) is listed prior to 6- (6,8).
During atom typing, two atoms were considered directly connected if the distance between them was no greater than the sum of the atom typing radii listed in Table 5. These radii have the following history. Starting with the atom connectivity radii of OpenBabel version 1.100.1, we added 0.18Å (i.e., a skin distance was incorporated into the radii). Second, we reduced some transition metal radii to eliminate unreasonable metalmetal bonds and excessively high coordination numbers.
Third, the carbon and oxygen radii were slightly increased to maintain carbon/oxygen-metal bonds aer decreasing the metal radii. These radii values are unique only in an  Paper approximate sense, because small increases or decreases in the radii values may be feasible. Large changes in these radii values are not feasible, because bad atom connectivity would result. These atom type denitions gave a total of 1313 rstneighbor atom types and 7033 second-neighbor atom types for the 3056 accepted calculated MOFs. Of these, 783 rstneighbor and 3015 second-neighbor atom types appeared in more than one MOF.
Is second-neighbor-based atom typing optimal? To explore this question, the averages and standard deviations of 17 different forceeld precursors were computed and listed in the ESI † for each rst-neighbor-based and each second-neighborbased atom type: NAC; C 6 , C 8 , and C 10 dispersion coefficients; uctuating, forceeld, and static polarizabilities; hr 3 i and hr 4 i radial moments; a, b, and R 2 electron cloud parameters; atomic dipole moment magnitude; three QDO parameters; and atomic spin moment. Fig. 6 presents histograms of the standard Fig. 6 Histograms of standard deviations of NACs, forcefield polarizabilities, and C 6 dispersion coefficients for first-neighbor-based and secondneighbor-based atom types. Second-neighbor-based atom types included both first and second neighbors. Each standard deviation was computed across different occurrences of the same atom type. The second-neighbor-based atom types showed much smaller standard deviations than the first-neighbor-based atom types. deviations for NAC, forceeld polarizability, and C 6 across rstneighbor-based and second-neighbor-based atom types. Only atom types appearing in more than one accepted calculated structure were included in this analysis. Second-neighbor-based atom types showed remarkably better chemical transferability compared to rst-neighbor-based atom types. For example, 91.9% of the second-neighbor-based atom types had NAC standard deviations #0.05e, while only 59.5% of the rstneighbor-based atom types did.
This atom typing is theoretically justied, because in organic molecules electronic effects play a major role in group reactivity and induction contributes a signicant portion of that effect. In organic chemistry, the inductive effect is the electron donating or withdrawing ability of a chemical group that can be transmitted to other parts of the molecule through chemical bonds. 112 The inductive effect usually decreases across each bond and is usually limited to 2 to 3 bonds lengths. 112 Our decision to base atom typing on rst and second neighbors is a pragmatic one. As shown in Fig. 6, using only rst neighbors leads to large standard deviations in the calculated forceeld parameters. The standard deviation is greatly reduced by incorporating second neighbors in the atom types. Incorporating third neighbors into the atom type denition would lead to an enormous number of different atom types, thus making forceeld parameterization too tedious. Therefore, second-neighbor-based atom types are optimal. Fig. 7 explores an extreme case of third neighbor inductive effects. In malonic acid (Fig. 7B), the four highly electronegative oxygen groups withdraw electrons increasing the acidity of the circled hydrogen atom compared to the circled hydrogen in propane (Fig. 7A). Specically, the circled hydrogen in Fig. 7B can be stripped away from the carbon atom by strong base while the one in Fig. 7A does not possess the same reactivity. The NACs of those hydrogens are 0.06 for Fig. 7A and 0.18 for Fig. 7B. If these hydrogens are replaced with a methyl group, the methyl carbon NACs are À0.37 and À0.34 respectively. These differences in NACs are within an acceptable range for atoms sharing the same atom type. Table 6 summarizes the breakdown of atom types in the 3608 accepted structures. A total of 8607 different atom types appeared in these structures. 3015 atom types appeared in more than one accepted calculated structure, while 4018 atom types appeared in only one accepted calculated structure. An additional 1574 atom types appeared only in the uncalculated and/ or CSD structures. These results clearly show that many atom types appeared in only one MOF in the dataset. These results also show the nearly innite chemical diversity of MOFs. For example, the 296 accepted CSD structures contained 320 new atom types that did not appear in either the calculated or uncalculated structures.
Do a relatively small number of atom types completely describe a large percentage of the accepted calculated structures? Table 7 summarizes calculations performed to explore this question. Starting with a 'frequency threshold', any MOF containing any atom type that appeared in fewer than 'frequency threshold' different MOFs (within the 3056 accepted calculated structures) was discarded. All atom types appearing in the remaining MOFs occurred in at least 'frequency threshold' different MOFs in the 3056 accepted calculated structures. Then, the number of distinct atom types appearing in the remaining MOFs was counted and reported in the second column of Table 7. The results show the chemical diversity of MOFs in this dataset is well-mixed. In other words, there was not a large segregated subset of MOFs in this dataset that are completely described by a few atom types.
Zn and Cu were the most common metal elements in the 2014 CoRE MOF database. 7 Among the 3056 accepted calculated structures, 785 structures were built entirely of the elements Cu, Zn, C, H, N, and O. These 785 structures contained 1314 different atom types.
Finally, we tested the transferability of atom typing across different classes of materials. Applying the second-neighbor atom typing procedure discussed above to the molecular test set of Bleiziffer et al., 62 we obtained 32 530 atom types from

Conclusions
In this article, we considered the problem of how to automatically extract forceeld precursors from quantum chemistry calculations using DDEC6 atomic population analysis. We focused on calculating non-bonded parameters for each atom in the material: net atomic charge, atomic dipole and quadrupole, electron cloud parameters, atomic spin moment, dispersion coefficients, various polarizabilities, and QDO parameters. Our rst main result was calculated values of these atomistic descriptors for 3056 MOFs that will serve as building blocks to construct classical force elds for these materials.
Regarding the electrostatic parameters, our results conrm the general belief that NACs can oen reproduce the electrostatic potential surrounding a material with acceptable accuracy. Including atomic dipoles dramatically reduced the electrostatic potential errors; therefore, it is preferable to include atomic dipoles in the electrostatic model. Spherical cloud penetration had negligible effect on the RRMSE histograms; however, charge penetration effects can still be important at short interatomic distances.
Our second main result was the practical assessment of atom typing for MOFs. We improved values of the atom typing radii that dene whether two atoms in a material are directly bonded to each other. Our results showed that atom types based on both rst and second neighbors adequately captures the chemical environment, but atom types based on only rst neighbors does not. Specically, the standard deviation of calculated forceeld precursors was relatively large across atoms sharing similar rst neighbor environments but relatively small across atoms sharing similar rst and second neighbor environments. Including third neighbors in the atom type denition would create an unnecessarily large and burdensome number of different atom types. Therefore, atom typing including both rst and second neighbors is optimal.
Our results demonstrate the large chemical diversity of MOFs. 8607 different atom types were identied in the 3608 non-rejected MOFs. These atom types should be useful to develop future force elds for MOFs. Although the MOF to atom type ratio was lower than 1.0 in our study, there are two key reasons to be believe this ratio will improve in the future: (1) the number of chemical elements from which MOFs can be synthesized is practically limited to $100, because heavier chemical elements undergo rapid radioactive decay. Therefore, Fig. 8 Histogram of absolute difference of average DDEC6 NAC between 130k molecules dataset and 3056 accepted calculated MOFs dataset. This histogram includes 780 atom types that have more than one occurrence in each dataset.  5  1167  1120  30  166  219  10  572  682  35  138  188  15  385  508  40  120  151  20  273  334  45  106  107  25  208  268  50  94  77 new MOFs will reuse many of the same chemical elements. (2) There is an almost innite number of different ways to combine common metals and common organic linkers to form MOFs. 18,19 To date, only a tiny fraction of these hypothetical MOFs have been chemically synthesized. 18,19 As more and more of these hypothetical MOFs are synthesized, many atom types will be reused leading to a higher MOF to atom type ratio. Moreover, parameterizing only the popular atom types could describe a substantial percentage of MOFs with higher MOF to atom type ratio.
In summary, more research is urgently needed to develop accurate interaction models for MOFs. Because of the large chemical diversity of real and hypothetical MOFs, it is impractical to evaluate all possible MOFs experimentally. Therefore, computational assessment is required. The interaction model is a crucial ingredient of computational assessment. Two major types of interaction models are exchange-correlation theories (e.g., DFT + dispersion) and classical force elds. The former needs orders of magnitude improvement in computational speed to yield rapid ab initio molecular dynamics. The latter requires extensive parameterization to yield accurate force elds. To actually develop working force elds, the non-bonded interaction terms studied herein will need to be combined with bonded interaction terms (e.g., bond springs, angle springs, torsion and/or out-of-plane parameters) and further parameterization of short-range exchange-repulsion. However, in simulations using a rigid framework approximation (e.g., Monte Carlo simulations of adsorption in rigid frameworks), the bonded interaction terms are not required.

Funding
National Science Foundation (NSF) CAREER Award DMR-1555376 provided nancial support. Supercomputing resources were provided by the Extreme Science and Engineering Discovery Environment (XSEDE). 114 XSEDE is funded by NSF grant ACI-1548562. XSEDE project grant TG-CTS100027 provided allocations on the Stampede2 cluster at the Texas Advanced Computing Center and the Comet cluster at the San Diego Supercomputing Center.
Authors' contributions T. C. performed the calculations and wrote Python scripts to prepare input les, analyze output les, perform atom typing, determine whether to accept or reject structures, and write les containing force-eld precursors and atom type statistics. T. A. M. wrote the programs to compute the electrostatic potential RRMSE and electron cloud parameters. T. A. M. obtained funding for the project. Both authors designed the study, interpreted data, and wrote the manuscript.

Conflicts of interest
There are no conicts of interest to declare.