Apples to apples comparison of standardized to unstandardized principal component analysis of methods that assign partial atomic charges in molecules

Articles by Cho et al. (ChemPhysChem, 2020, 21, 688–696) and Manz (RSC Adv., 2020, 10, 44121–44148) performed unstandardized and standardized, respectively, principal component analysis (PCA) to study atomic charge assignment methods for molecular systems. Both articles used subsets of atomic charges computed by Cho et al.; however, the data subsets employed were not strictly identical. Herein, an element by element analysis of this dataset is first performed to compare the spread of charge values across individual chemical elements and charge assignment methods. This reveals an underlying problem with the reported Becke partial atomic charges in this dataset. Due to their unphysical values, these Becke charges were not included in the subsequent PCA. Standardized and unstandardized PCA are performed across two datasets: (i) 19 charge assignment methods having a complete basis set limit and (ii) all 25 charge assignment methods (excluding Becke) for which Cho et al. computed atomic charges. The dataset contained ∼2000 molecules having a total of 29 907 atoms in materials. The following five methods (listed here in alphabetical order) showed the greatest correlation to the first principal component in standardized and unstandardized PCA: DDEC6, Hirshfeld-I, ISA, MBIS, and MBSBickelhaupt (note: MBSBickelhaupt does not appear in the 19 methods dataset). For standardized PCA, the DDEC6 method ranked first followed closely by MBIS. For unstandardized PCA, Hirshfeld-I (19 methods) or MBSBickelhaupt (25 methods) ranked first followed by DDEC6 in second place (both 19 and 25 methods).


Introduction
Many factors should be considered when assessing the performance of methods for assigning partial atomic charges. [1][2][3] Six factors that bear special consideration here include the following: (1) The method should have a well-dened mathematical limit as the basis set is improved towards completeness (aka 'a complete basis set limit') and have atomic charge values that do not depend on the orientation of the external coordinate system (aka 'rotational invariance'). 3 (2) An assigned atomic charge should correspond to assigning some non-negative number of electrons to the atom. This means the assigned atomic charge should not exceed the atom's atomic number.
(3) Ideally, the method should work reliably across diverse material types including those containing both surface and buried atoms.
(4) The assigned atomic charges should exhibit similar values across similar chemical bonding environments (i.e., good chemical and conformational transferability). While the precise denition of 'similar chemical bonding environments' may vary, one possible denition is based on the two chemical environments having the same bond connectivity graphs including rst and second neighbors. 4 (5) The assigned atomic charges should exhibit strong statistical correlations to related chemical and physical properties. (6) The charge assignment method should be computationally efficient and convenient.
This article is primarily concerned with statistical correlations between different methods for assigning atomic charges. This relates to factor 5 above. Colloquially, one can think of analyzing correlations between different charge assignment methods as a form of democratic voting. The charge assignment method that exhibits the highest summed correlation to all charge assignment methods in the group has been 'voted' by the group to be the most representative of that group.
This 'voting' turns out to be far more important than one might naively expect. Rather than simply being a popularity contest, this 'voting' indicates which quantitative descriptor (e.g., charge assignment method) is positioned to exhibit average or better statistical correlations to each of many related properties. 1 An analogy is useful to understand how this works. Imagine a group of darts. The dart in the group's center always lands closer than approximately 50% or more of these darts to each and every conceivable target. 1 Now if we have a group of methods for assigning atomic charges, a centrally located method would correlate better than approximately 50% or more of these methods to each of many properties related to atomic charges. 1 This frees us from the bias of having to 'choose' which particular target property should be used to rank the charge assignment methods. This revolutionary idea is illuminated by the seven conuence principles that were recently introduced and proved. 1 This turns out to be closely related to standardized principal component analysis (PCA), because the rst principal component (i.e., PC1) is dened as the normalized linear combination of standardized charge assignment methods that maximizes the sum of squared correlations between PC1 and all the charge assignment methods in the group. 1 In standardized PCA, each independent descriptor (charge assignment method in this case) is standardized to have an average of zero and a variance of 1. 5 This standardization gives each independent descriptor equal power to vote. In standardized PCA, the principal components are the eigenvectors of the correlation matrix. PC1 is the eigenvector with the largest eigenvalue, PC2 is the eigenvector with the second largest eigenvalue, and so on. The eigenvalues sum to the number of independent descriptors, N.
Unstandardized PCA gives a larger voting power to an independent descriptor having a larger variance. The average charge transfer magnitude of a charge assignment method equals its standard deviation, which is the square root of the variance. 1 Hence, the QTAIM method (which has a large average charge transfer magnitude) receives more voting power than the Hirshfeld method (which has a small average charge transfer magnitude). 1 In unstandardized PCA, the principal components are the eigenvectors of the covariance matrix. PC1 is the eigenvector with the largest eigenvalue, PC2 is the eigenvector with the second largest eigenvalue, and so on. The eigenvalues sum to the trace of the variance-covariance matrix (i.e., the sum of variances of the independent descriptors). In unstandardized PCA, PC1 is dened as the normalized linear combination of charge assignment methods that maximizes its variance.
Cho et al. reported an unstandardized PCA on the covariance matrix of atomic charges computed by different atomic population analysis methods. 6 There are two aspects of Cho et al.'s data analysis procedure that require reanalysis. As explained by Manz, 1 a small number of bad datapoints were included in the unstandardized PCA of Cho et al. The nature of these bad datapoints was such that the reported atomic charges of a few charge assignment methods summed to the wrong system net charge for a handful of systems. Each of these bad datapoints was either corrected or not included in the standardized PCA of Manz. 1 The second aspect that requires reanalysis is that Cho et al.'s presentation of the PCA results used different numbers of charge assignment methods on different pages of their journal article. 6 Their complete dataset consisted of computed atomic charges for 26 different charge assignment methods applied to ∼2000 molecules from the GMTKN55 (ref. 7) collection. Table II  on page 692 of their article shows the squared correlation  matrix between 18 of these different charge assignment  methods. Table III on page 693 lists the eigenvalues and rst six principal component vectors for unstandardized PCA using 21 of these different charge assignment methods. Table IV on page  694 lists the squared correlation coefficient between individual charge assignment methods and PC1 for unstandardized PCA based on 16 of these different charge assignment methods.
Manz presented standardized PCA for the 20 of these different charge assignment methods that have a well-dened limit as the basis set is improved towards completeness. 1 For comparison, he also presented standardized PCA that included all 26 charge assignment methods. Except for the correction/ removal of a small number of bad datapoints as explained above and the somewhat differing numbers of charge assignment methods included in the PCA, Manz's standardized PCA used the same underlying dataset of molecules and computed atomic charges as Cho et al.
An apples to apples comparison between standardized PCA and unstandardized PCA results for this dataset is critically needed, because of the different conclusions reported by Cho et al. and Manz. For unstandardized PCA, Cho et al. reported on p. 688 of ref. 6: "The single charge distributions that have the greatest statistical similarity to the rst principal component are iterated Hirshfeld (Hirshfeld-I) and a minimal-basis projected modication of Bickelhaupt charges." For standardized PCA, Manz reported that the DDEC6 method had the highest correlation to the main principal component. 1 As explained above, the datasets used in those two studies were not exactly equal. The main purpose of this article is to resolve this issue by providing a clean comparison between standardized and unstandardized PCA for the same dataset.
Another purpose of this article is to develop a better understanding of the large magnitude datapoints in this dataset. This will be done by examining the ranges and box plots for individual chemical elements and individual charge assignment methods. As discussed in the sections below, this produced some interesting and unexpected ndings.
Quantum chemistry calculations of Li 4 C, SiF 4 , and AlF 3 were performed using Gaussian 16 with geometry optimization. 35 These geometries were converged such that the maximum force was less than 0.00045 hartrees bohr −1 and the maximum displacement was less than 0.0018 bohr. Aer geometry optimization, the DDEC6 atomic charges were computed for these molecules and found to match (within AE∼0.01 e) the values reported in Cho et al.'s dataset. The Foster-Boys 36 localized orbitals of these molecules were prepared and plotted in Multiwfn 37 (version 3.6).
Throughout this entire work, the unit for atomic charge is e, which is the absolute value of the charge of one electron.

Results and discussion
3.1 Elemental analysis of each charge assignment method to identify extreme atomic charges Table 1 lists the number of atoms and charge range for each chemical element in the dataset. In Table 1, the number of atoms listed for each chemical element is per charge assignment method. The largest numbers of atoms were for H followed by C followed by O and N. The average, minima, and maxima values listed in Table 1 are for all of the data values across the listed chemical element and charge assignment methods. For example, across 26 charge assignment methods (including Becke), the 8917 × 26 = 231 842 carbon atom charge values had an average = −0.05, a minimum value = −6.73, and a maximum value = 7.46. These average, minimum, and maximum values are provided to give the reader a sense of the range of values present in the dataset.
The charge ranges were unexpectedly large for Al, B, C, H, N, and O. If an atom loses all of its electrons, the largest physical charge it could have would equal its atomic number (i.e., the number of protons in its nucleus). The maximum atomic charges of 5.43 for H, 7.46 for C, and 8.24 for N exceed this physical bound.
The last row in Table 1 refers to all of the chemical elements and represents the entire dataset. 29 907 was the total number of atomic charges reported per charge assignment method; the total number of numeric values in the dataset was 29 907 × 26 = 777 582. The listed overall average atomic charge value of 0.0017 is the average of these 777 582 data values; while this overall average atomic charge is informative, it does not represent anything other than the average of these 777 582 data values. While the average, minimum, and maximum values provide useful insights into the dataset, for fuller understanding of the dataset a more extensive statistical analysis is  required and is provided by the box plots shown in Fig. 1. The data listed in Table 1 and the box plots shown in Fig. 1 are for the atomic charge values not standardized variables. The box plots shown in Fig. 1 were prepared for each chemical element for each charge assignment method. The Becke charges showed an enormously large range with some unphysically large atomic charges for H, C, N that exceeded the atomic numbers of these chemical elements. Furthermore, some of the Becke negative atomic charges for B, H, and C had extremely large magnitudes that are not physically realistic. These results are not currently explainable. In the Becke method, electrons are assigned to each atom using Becke's multigrid integration weights 10 for some chosen set of atomic radii. The Becke method was believed to be a stockholder-type 15 electron density partitioning method that assigns atom-inmaterial electron densities r A [r] using a non-negative atomic weighting function w A [r]: If this were true, then the Becke method should never assign a negative number of electrons to an atom in a material, and thus the Becke partial atomic charge should never exceed the atomic number. Since some of the Becke partial atomic charges for H, C, and N reported in Cho et al.'s dataset 6 exceeded those elements' atomic numbers, there must be an underlying problem with how they were computed. Therefore, I had to remove all the Becke method data from this dataset when performing further statistical analysis.
The last three columns in Table 1 list the average, minimum, and maximum charge of each chemical element across the dataset of 25 charge assignment methods that does not include the Becke method. Except for H, the maximum atomic charge for each chemical element is now less than or equal to its atomic number. To better understand some of the atomic charges with large magnitudes, Table 2 lists details for each instance of a H atom having charge >1.00 and each instance of any other atom having a net charge larger in magnitude than 3.00. The ADCH and HLY methods gave some H atoms with charges >1.00; because these are not stockholder-type charge partitioning methods, they sometimes assign a negative number of electrons to an atom in a material. Several methods assigned atomic charges more negative than −3.00 to the C atom in CLi 4 and/or CHLi 5 . The QTAIM method assigned atomic charges >3.00 to some of the P, S, and Si atoms in several molecules.
Returning to the box plots in Fig. 1, the CM5, EEQ, Hirshfeld, and VDD methods gave the smallest ranges of atomic charges; atomic charges for these methods were between −1.5 and +1.5. The previously computed average charge transfer magnitudes for these molecular systems followed the order Hirshfeld < VDD < Mulliken < ACP < CM5 < ADCH < EEQ </ < QTAIM. 1 From these two observations, we conclude the Hirshfeld and VDD methods consistently give relatively small magnitudes of atomic charges. Behavior of the ISA method is interesting, because although its average charge transfer magnitude 1 is moderate, sometimes it gives outliers with high magnitudes. For example, the atomic charge of C in Li 4 C was −4.02. If each Li atom only retained its core electrons the C atomic charge in this molecule would be −4. The ISA charge of −4.02 appears to indicate a slight loss of core electrons from the Li atoms, which seems physically dubious. For reasons that are not currently understood, for the Ros-Schuit method the Br atom box plot showed an extremely large range compared to the Br atom box plot for all of the other charge assignment methods.  To better understand these results, Fig. 2 plots the Foster-Boys localized valence orbitals of Li 4 C, AlF 3 , and SiF 4 . In Li 4 C, these localized valence orbitals have a tetrahedral symmetry with most of the electron density located on the C atom; however, there is clearly some shared electron density in the bonding regions between the C and Li atoms. The AlF 3 molecule is planar with most of the electron density of the valence orbitals located on the F atoms; however, there is clearly some shared electron density in the bonding regions between the Al and F atoms. The SiF 4 molecule is tetrahedral with most of the electron density of the valence orbitals located on the F atoms; however, there is clearly some shared electron density in the bonding regions between the Si and F atoms. These results reect the element electronegativity values. F is more electronegative than Al and Si. [38][39][40] C is more electronegative than Li. [38][39][40] These orbital plots show the DDEC6 computed atomic charges of −2.88 for C in Li 4 C, 1.84 for Al in AlF 3 , and 1.84 for Si in SF 4 are plausible.

Comparing standardized to unstandardized PCA over identical datasets
In this section, standardized PCA is compared to unstandardized PCA over identical datasets. Although such a comparison is not revolutionary science, it nevertheless is a signicant scien-tic advance in two respects. The prior studies of Cho et al. for unstandardized PCA and Manz for standardized PCA included some bad datapoints and were performed over somewhat different subsets of the same parent dataset. 1,6 Cho et al.'s study included a small number of missing and bad datapoints that were corrected or removed in Manz's study. 1 Moreover, both of those studies included the Becke data that are shown in the previous section to be erroneous. This raises the question of how robust the conclusions of those studies are to the removal of the bad datapoints.
As proved in ref. 1, for standardized PCA some performance measures are robust to corruption of any one of the independent descriptors. This robustness arises, because in standardized PCA each one of the independent descriptors contributes exactly 1.0 to the trace of the correlation matrix. For example, when performing standardized PCA over a set of 20 independent descriptors, each independent descriptor contributes exactly 5% to the trace of the correlation matrix: 1.0/20.0 = 5%. As a consequence, standardized PCA limits the potential impact that could be incurred by an error in one of the independent descriptors. This robustness is obviously a key advantage of using standardized PCA as opposed to using unstandardized PCA.
In unstandardized PCA, the contributions of different independent descriptors to the trace of the variance-covariance matrix can be different. Consequently, a large corruption of one of the independent descriptors can have an uncontrolled impact on the unstandardized PCA results. As shown in Section 3.1 above, the Becke charges in Cho et al.'s dataset were corrupted by a large amount. Because of this, it is not safe to assume that the unstandardized PCA results or conclusions that were reported by Cho et al. 6 would automatically still hold once these bad datapoints are removed. Therefore, the conclusions of ref. 6 cannot automatically be assumed valid once it is discovered that some bad datapoints were included in their study.
In my view, the best way to resolve these issues is to reanalyze the dataset using both standardized PCA and unstandardized PCA with the bad datapoints removed. Such a reanalysis shows which of the previously proposed conclusions are valid and which are invalid (if any). It is absolutely essential to perform and publish such a reanalysis with the bad datapoints removed; otherwise, the unstandardized PCA results and conclusions that were reported by Cho et al. 6 have to be set aside as inconclusive (i.e., as no longer conclusive), because their validity cannot be established without such a reanalysis.
In addition to the issue of bad datapoints discussed above, a second issue that needs to be addressed is the previously published unstandardized PCA included a slightly different subset of charge assignment methods than the previously published standardized PCA. Cho et al. reported that the MBSBickelhaupt and Hirshfeld-I atomic charges were most strongly correlated to the PC1 of unstandardized PCA including 16 charge assignment methods; the DDEC6 and MBIS methods also exhibited almost as high of correlations to PC1. 6 Manz reported that the DDEC6 atomic charges consistently exhibited the highest correlations to PC1 for standardized PCA across 20 methods with a complete basis set limit and across all 26 charge assignment methods; the MBIS, ISA, and Hirshfeld-I methods also exhibited almost as high of correlations to PC1. When including all 26 methods, the MBSBickelhaupt method also exhibited almost as high correlation to PC1 in standardized PCA as DDEC6 and MBIS. 1 The second question that must be addressed is whether differences in the conclusions of those two studies is due to standardized versus unstandardized PCA or whether it is due to the use of slightly different datasets (e.g., the inclusion of sligthly different subsets of charge assignment methods) for the analysis. The only way to denitively address this question is to perform unstandardized and standardized PCA on the same dataset and compare results.
Here, I performed standardized and unstandardized PCA on a dataset of 19 charge assignment methods having a complete basis set limit and across all 25 charge assignment methods. These datasets do not include the Becke method data. Table 3 summarizes the eigenvalues and the percentage of covariance (for unstandardized PCA) or correlation (for standardized PCA) accounted for by each principal component. In all cases, PC1 accounted for between 84.5% to 87.8% of the covariance or correlation while PC2 accounted for #7.1%. When using 19 charge methods with a complete basis set limit, PC2 in standardized PCA accounted for less than one variable's worth of correlation. On the other hand, when included all 25 charge methods, PC2 in standardized PCA accounted for 1.06 variable's worth of correlation; thus, PC2 may be considered signicant in this case.
Coefficients for the rst three principal components and correlation of each charge method to PC1 are listed in Table 4 (19 methods unstandardized PCA), i ) is the following normalized linear combination of the various independent descriptors: where C (k,j) is the coefficient for independent descriptor j in the kth principal component, and X (j) i is the value of independent descriptor j for the ith datapoint. In this work, there are 29 907 datapoints representing the different atoms in materials. In this work, the independent descriptors are the different methods for assigning atomic charges (e.g., DDEC6, Hirshfeld, QTAIM, VDD, etc.). The total number of independent descriptors (e.g., the number of different charge assignment methods) included in the PCA is V, and eqn (5) is the corresponding normalization condition for the kth principal component. 1    For unstandardized PCA with 19 and 25 methods, the QTAIM method (which has the largest average charge transfer magnitude) had the highest coefficient in PC1 but relatively low correlation to PC1. The QTAIM method also had the largest magnitude coefficient in PC2. For unstandardized PCA with 19 methods having a complete basis set limit, the Hirshfeld-I, DDEC6, MBIS, and ISA methods had the highest correlation to PC1. For unstandardized PCA including all 25 methods, MBSBickelhaupt, DDEC6, MBIS, and Hirshfeld-I had the highest correlation to PC1. These results are roughly consistent with Table 6 Principal component coefficients for unstandardized PCA of all 25 charge assignment methods. In the first four columns, the methods are ordered from largest to smallest coefficient in PC1. In the last two columns, the methods are ordered from largest to smallest correlation to PC1. The last column lists the correlation of each charge assignment method to PC1  For standardized PCA, the rank of methods according to correlation to PC1 is always identical to the rank according to coefficient in PC1. 1 For standardized PCA with 19 methods having a complete basis set limit, the DDEC6, MBIS, ISA, and Hirshfeld-I methods had the highest correlation to PC1. For standardized PCA including all 25 methods, the DDEC6, MBIS, MBSBickelhaupt, and Hirshfeld-I methods had the highest correlation to PC1. These rankings are identical to those when the Becke method is included, as previously reported in ref. 1. Specically, rankings of the 19 methods in standardized PCA reported here are identical to those for the 20 methods reported in ref. 1, except the Becke method gets the last (i.e., 20th ranking) when it is added to the dataset. With the exception of CM5 which is effectively tied with MBSMulliken, and i-ACP which is effectively tied with EEQ, the ranking of the 25 methods in standardized PCA reported here are identical to those for the 26 methods reported in ref. 1, except the Becke method gets the last (i.e., 26th ranking) when it is added to the dataset.
For a more complete understanding of rankings in standardized PCA, Table 8 (19 methods) and Table 9 (25 methods) show the method rankings according to three additional ranking criteria: (a) the sum of correlations between all of the individual charge assignment methods and a particular charge assignment method, (b) the number of charge assignment methods having correlation U ab > 0.8 to a particular charge assignment method, and (c) the number of charge assignment methods having correlation U ab > 0.9 to a particular charge assignment method. As proved in ref. 1, ranking criterion (a) is equivalent to ranking

Conclusions
In prior literature, a detailed standardized PCA was performed on a slightly different dataset than a detailed unstandardized PCA, even though both datasets were derived from a common parent dataset. 1,6 The slight differences in datasets made interpreting the differing conclusions of those two works difficult.
To address this issue, herein I compared standardized to unstandardized PCA for the same dataset of partial atomic charges computed across ∼2000 molecules using various charge assignment methods. This analysis was performed both for 19 charge assignment methods having a complete basis set limit and for all 25 charge assignment methods, which do not include the Becke method.
Analysis of maximum and minimum charge values together with box plots for each chemical element for each charge assignment method revealed important information. Most importantly, the reported Becke charges were found to be incorrectly computed. The Becke method is generally believed to be a stockholder-type charge partitioning approach that assigns a non-negative number of electrons to each atom in a material; however, the Becke charges reported by Cho et al. 6 showed several instances of assigning negative numbers of Table 9 Rank of each charge assignment method according to its amount of correlation to other charge assignment methods. The S a and U[a,f] ranking criteria always give the same order of methods. This electrons to atoms. Consequently, the Becke charge data was not included in the PCA of the present study. Many of the charge assignment methods exhibited large charge magnitudes for the Li 4 C, SiF 4 , and AlF 3 molecules. Each of these molecules has two chemical elements with a large electronegativity difference. To understand this behavior better, localized valence orbitals for these three molecules were plotted in Fig. 2. These localized valence orbitals showed high bond polarities with electron density concentrated on the more electronegative atom(s) and in the bonding regions between atoms.
The main takeaways from this work are as follows. First, standardized PCA yielded more consistent rankings both across different ranking criteria and with respect to adding or removing some methods from the analysis. Second, the following ve methods (listed here in alphabetical order) showed the greatest correlation to the rst principal component in standardized and unstandardized PCA: DDEC6, Hirshfeld-I, ISA, MBIS, and MBSBickelhaupt (note: MBSBickelhaupt does not appear in the 19 methods dataset). For standardized PCA, the DDEC6 method ranked rst followed closely by MBIS. For unstandardized PCA, Hirshfeld-I (19 methods) or MBSBickelhaupt (25 methods) ranked rst followed by DDEC6 in second place (both 19 and 25 methods).
For a proper context, the above conclusions of this work must also be considered in light of the following known properties (established in the prior literature not in this work) of these ve charge assignment methods. MBSBickelhaupt is not recommended, because its atomic charges are sensitive to rotation of the external coordinate system. 1 ISA oen gives erratic results for materials with buried atoms. 1,41,42 For molecules, the average charge transfer magnitudes follow the trend MBSBickelhaupt z MBIS z Hirshfeld-I > ISA > DDEC6 z electrostatic potential tting charges. 1 DDEC6 charges have been more thoroughly tested and shown to work across a wider range of material types including many dense solids. 13,43

Conflicts of interest
There are no conicts of interest to declare.