Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity

Structure characterization and classification is frequently based on local environment information of all or selected atomic sites in the crystal structure. Therefore, reliable and robust procedures to find coordinated neighbors and to evaluate the resulting coordination pattern (e.g., tetrahedral, square planar) are critically important for both traditional and machine learning approaches that aim to exploit site or structure information for predicting materials properties. Here, we introduce new local structure order parameters (LoStOPs) that are specifically designed to rapidly detect highly symmetric local coordination environments (e.g., Platonic solids such as a tetrahedron or an octahedron) as well as less symmetric ones (e.g., Johnson solids such as a square pyramid). Furthermore, we introduce a Monte Carlo optimization approach to ensure that the different LoStOPs are comparable with each other. We then apply the new local environment descriptors to define site and structure fingerprints and to measure similarity between 61 known coordination environments and 40 commonly studied crystal structures, respectively. After extensive testing and optimization, we determine the most accurate structure similarity assessment procedure to compute all 2.45 billion structure similarities between each pair of the ≈70 000 materials that are currently present in the Materials Project database.


Introduction
Crystal structure databases 1-16 play an increasingly important role in materials science, chemistry, and related elds. Publication statistics gathered from Web of Science 17 on November 23, 2019, indicate that this trend started in the early 1990s ( Fig. 1) and that the underlying potential is still not exhausted. The steady increase is (most likely) linked to continuously increasing computing power and memory storage, and it has fostered the creation of many different crystallographic databases that catalog existing materials such as the Cambridge Crystallographic Data Centre (CCDC) in 1965, 1 the Inorganic Crystal Structure Database (ICSD) in 1983, 2,3 CRYSTMET in 1993, 4,5 Pauling File in 2002, 6,7 the Crystallography Open Database (COD) in 2003 8 (previously known as "The American Mineralogist crystal structure database" 9 ), and the Pearson's Crystal Data (PCD) in 2007. 10 Furthermore, computational databases, which mainly use crystallographic databases as a source, and fully hypothetical structure databases are currently being created more and more: the Predicted Crystallography Open Database (PCOD) appeared in 2005, the Materials Project (MP) database 11,12 and the Automatic FLOW for Materials Discovery database (AFLOW) in 2011, 13 the Fig. 1 Publication statistics for research articles that include "database" as keyword in the topic field within the Web of Science™ Core Collection for category "Chemistry, Inorganic & Nuclear." Data was retrieved on November 23, 2019. Furthermore, we highlight the inception years of established materials databases.
Harvard Clean Energy Project (CEP) 14 and the Open Quantum Materials Database (OQMD) in 2013, 15 as well as the Novel Materials Discovery Laboratory (NOMAD) in 2015. 16 As computational resources still continue to grow 18 and to become more omnipresent and accessible, the computational chemistry, physics, and materials science communities have focused their efforts more and more on automation tools for materials database analysis and on employing statistical and machine learning (SML) 19,20 to help expedite materials discoveries and chemical innovations. 21 This includes, for example, predicting properties (e.g., formation energies, crystal structure dimensionalities, phase diagrams, band gaps, elastic moduli, ionic conductivity) of diverse materials from classes and families such as AX binary compounds, 22 M 2 AX ternary phases, 23 delafossite and related layered phases of composition ABX 2 , 24 conventional 25 and double perovskite halides (or elpasolites), 26 zeolites 27,28 and other silicates, 29 and other inorganic materials 30-36 as well as polymers; 37 indicating possible synthesis approaches by screening and predicting synthesis parameters and reactions of inorganic materials, 38,39 metal-organic frameworks, 40 and organic molecules; 41,42 generating interatomic potentials; [43][44][45][46] and expediting ab initio [47][48][49][50] calculations. [51][52][53][54][55] Another important scientic problem is, in this context, the classication and categorization of entire crystal structures and the assessment of similarity between two materials ( Fig. 2A). [56][57][58] Conventionally, crystal structures are characterized by their chemistry, crystal system, and space group. Other schemes employ the coordination number and pattern of the constituting atomic sites. 59 Because of the plethora of ways to classify structures, dening and automatically nding prototype structures is currently a very active research area. [60][61][62] In particular, the usage of coordination number and pattern has culminated in a larger current effort of the community to leverage ngerprinting. 58,[63][64][65][66] This is the process of combining crucial information about the structure and/or its constituting local environments around each atom into a vector that represents the structure as a whole and includes, for example: a twodimensional ngerprint based on simulated diffraction patterns; 57 the Coulomb matrix; 67 a many-body tensor representation; 68 deep tensor neural networks; 45 Voronoi tessellation; 69 radial distribution functions with 70 and without 71 incorporating partial atomic charges; and local environmentbased crystal ngerprints. 72 Crystal structure classication approaches that are specically based on coordination information of the constituent atomic sites have the advantage that, apart from globally classifying the structure, they carry easily interpretable local information. This facilitates to ensure causality between the descriptor-property relation and the underlying mechanism, 73 when using the corresponding local coordination descriptors for design rules or machine learning applications. A critically important ingredient is then the effectiveness of and comparability across coordination site descriptors (Fig. 2B) and the resulting site ngerprints that characterize the coordination environment.
To address this we introduce here (i) a new neighbor nding method, (ii) several new local structure order parameters (LoStOPs), which are optimized with a tailor-made Monte Carlo procedure, as well as (iii) new site ngerprint and (iv) new structure ngerprints (Section 2), which are freely available through the python 74 packages pymatgen 75,76 and matminer. 65,77 We extensively test and optimize the new tools on known coordination environments and commonly investigated crystalline prototype materials with the main purpose of assessing site and structure similarity (Section 3).

Methodology
In this section, we introduce our methods for nding neighbors of a given central site in a crystal structure, [78][79][80] for performing pattern matching on the resulting coordination environment, 81,82 and for using that information to generate ngerprints that aim to characterize coordination environments and, ultimately, entire crystal structures on the basis of the coordination descriptors. Furthermore, we optimize our novel coordination environment descriptors [local structural order parameters (LoStOPs)] to improve inter-motif comparability, 83 and we describe the similarity measures that we test for comparing site and structure ngerprints. All of the methods and algorithms listed here are freely available through the local_env module in pymatgen 75,76 and the featurizers module in matminer. 65,77 2.1 Neighbor nding approaches Two general and very popular neighbor nding approaches are employed in this work, which use (1) distance 80 and (2) topology-based, 78,79 information, respectively, to decide which neighbors are included in the near neighbor list. In both cases, an initial tentative neighbor list is constructed with a large hard cutoff radius (typically: 7-10ångström), which can, however, be dynamically increased, as we explain below. The methods are therefore approaches how to prune this initial (long) neighbor list.
The rst approach, which we call "minimum distance" neighbor nding (MDNF), consists of 3 basic steps (Fig. 3A): Find neighbor k that has the smallest distance, We choose the label "minimum distance" over "relative distance" for this neighbor nding approach in order to avoid confusion with similar methods 80 that use bond lengths 84 and/ or atomic/ionic radii 85 to compute dimensionless (relative) neighbor distances.
The second approach (VNF) uses Voronoi decomposition 78,79 to identify neighbors from the tentative list by employing the solid angle as a weighting measure. 86 In this case, we search for the largest solid angle among all tentative neighbors, divide all solid angles by this maximum angle, and use a threshold (typically, 0.05) to cut out all neighbors that have a fractional solid angle that is smaller than the threshold.
A third neighbor-nding scheme, which we label "Crys-talNN" (CNN), assigns probabilities to multiple coordination environments (Fig. 4). The algorithm begins similarly to the Voronoi strategy in which each neighbor is assigned a weight based on solid angle and facet area. Next, all weights are normalized such that the maximum weight is 1. All distinct values of normalized weight (multiple neighbor sites can belong to the same weight) are arranged from 0 to 1. For each distinct weight, the probability of all sites with that weight or greater being considered neighbors is proportional to the integral of the area under a curve (in our case, a semicircle that has a value of zero at zero weight and a value of 1 at a weight of 1) starting from 0 and ending in that weight. Thus, neighbors with higher weights are given a larger "section of the pie" in terms of likelihood to be part of the coordination environment. Also note that this method computes coordination likelihoods, w CN¼i , on the basis of these neighbor coordination probabilities, which quantify the probability that a given coordination environment  Schematic of the novel neighbor-finding approach, "Crys-talNN". (A) An example neighbor environment; the Voronoi weights of each neighboring site can be considered proportional to the length of its corresponding colored lined segment. (B) Demonstrating the calculation of coordination probabilities from the normalized Voronoi weights. For example, the section of the semicircle between the pink and purple segments indicates the small probability that only the strongest neighbor (pink) should be considered a neighbor. The area between the purple and blue segment indicates the larger probability that the system should be 2-coordinated (the two strongest neighbors, pink and purple). The highest coordination probability for this particular system would be 4-fold coordinated, corresponding to pink, purple, blue and green sites being neighbors.
should be considered i-fold coordinated. In an upcoming paper, we investigate the performance of various existing methods 79,80,84,85,87,88 and CNN to predict coordination numbers in inorganic materials, where our novel method performs particularly well.

Local structure order parameters
We use local structure order parameters (LoStOPs) such as those introduced by Peters 81 (q bcc ) and by Zimmermann et al. 80,82 (q tet and q oct ) to determine the degree to which the angles in a given observed coordination environment agree with those in the perfect target environment. The LoStOPs are being increasingly exploited for rating the feasibility 89,90 of zeolites, 27,28 nding interstitials and evaluating diffusional pathways in materials, 80 hierarchically visualizing structural similarity, 91 and predicting the magnetic ordering of inorganic materials. 92 In this work, we vastly extend the existing library from 3 to 20 by introducing new LoStOPs ( Table 1). The new LoStOPs permit the detection of both highly symmetric motifs (e.g., Platonic solids 93 such as a tetrahedron or an octahedron) as well as less symmetric motifs (e.g., Johnson solids 94 such as a square pyramid). This is possible because we change our ansatz slightly while keeping the key ideas.
We still test whether a given coordination environment (e.g., the blue T-motif in Fig. 5) resembles a perfect target motif (gray "template" in Fig. 5) by using the neighbors of the central site to locally set up spherical coordinate systems. 81,82 And, we still check all permutations explicitly; that is, we use each neighbor as a tentative North pole for creating the coordinate system (marked with a dark-blue "N" in each conguration) and each remaining neighbor as a prime meridian. We also stay with the general strategy of using cosine functions and Gaussian kernels [without normalization constant 1=ð ffiffiffiffiffiffi 2p p sÞ] to penalize positions of the remaining neighbors that are not at expected positions. The crucial difference to our earlier motif resemblance metrics 80-82 is that we do not average over all permutations anymore. Instead, we use the highest motif resemblance, given all the individual resemblance values, q m,j , each of which is obtained with one single neighbor j as the North pole for resemblance evaluation to motif type m around central site i (Fig. 5). For example, the LoStOP for the T-shaped coordination environment, q T , is given by:  where N nn is the number of those near neighbors that are to be considered for LoStOP calculation, q ijl is the polar angle of neighbor l (i.e., the angle between North pole neighbor j, central site i, and neighbor l), Dq is the parameter penalizing positions of l that are not forming a perfect 90 angle with the bond between i and j, and neighbor k is used to construct the prime meridian; hence, 4 ijkl is zero for l ¼ k. Note that we also redene the tetrahedral and octahedral LoStOPs using the new ansatz regarding the permutations.
The new LoStOPs allow detection of even more motifs than suggested by Table 1 (cf., Fig. 6) because the linear/bent LoStOP can be used for various bent angles, the trigonal bipyramid LoStOP can be used for regular see-saw motifs, and we can use q tet to detect triangular non-coplanar arrangements with tetrahedral angles. Later, when we dene site ngerprints, we dene multiple instances of the bent LoStOP with different target angles. Note that the new LoStOPs, just as the originals ones, are invariant to translational and rotational operations, which represents an important prerequisite to be used as an element of a numerical materials ngerprint. 63 Furthermore, the new LoStOPs still smoothly vary between 0 and 1, and a value of 1 ags, as usual, perfect resemblance with the underlying target motif, whereas 0 indicates no resemblance or match.
2.2.1 Optimization. Any LoStOP computation requires the a priori choice of certain calculation parameters such as the penalty for the neighbor that is closest to the South pole position in an octahedron-like environment to not exactly form a straight line with the (tentative) North pole neighbor and the central site. Our goal for setting these parameters was to maximize the comparability between different motifs (e.g., degree of resembling a perfect tetrahedron equals degree of a perfect square pyramid). Consequently, we had to make a decision about which distortion in one motif (e.g., moving one of the non-North pole neighbors in a tetrahedron slightly towards the South pole position) should be compared to another distortion in a different motif (e.g., moving two neighbors in the basal plane of the square pyramid closer together). Identifying all possible elemental distortions would have been unfeasible, notwithstanding intermediate distortion pathways (cf., Fig. 4 in ref. 80). Also, it would have been somewhat arbitrary to weight the different elemental and intermediate distortions for their occurence or relevance: should they contribute equally or are some less likely or less important than others?
To circumvent these issues we decided to employ a numerical procedure that leverages our recently introduced Einstein crystal test rig procedure. 80 This order parameter testing framework assigns Gaussian-distributed random perturbations to all sites in the initially perfect coordination environment using the polar form of Box-Muller transforms 95 as implemented in numpy. 96 The resulting spatial distributions of the atoms around their perfect motif positions resemble those in an Einstein crystal or molecule. [97][98][99] The atom displacement distribution width, s EM , is an input parameter and provides a well-controllable way to a priori dene the average distortion degree of the entire coordination environment. Thus, it also provides a denition for comparing distortion states between different motifs. Furthermore, because we aim at the assessment of real materials and their coordination patterns and because those can be subject to thermal uctuations, we, hence, use a physically meaningful reference model.
As a reference point, we use the Einstein molecule response behavior of the octahedral order parameter, q oct . For varying Einstein molecule distortion degrees (s EM ¼ 0.01/0.1 in 0.01 increments), we compute histograms of the order parameter by perturbing all atoms in an octahedral coordination motif 10 000 times. From each density distribution p(q oct |s EM ) (blue line in Fig. 7A), we compute cumulative probability distributions P(q oct |s EM ) (gold-brown line and points), which are smooth functions of q oct . To reduce the number of data points we This journal is © The Royal Society of Chemistry 2020 RSC Adv., 2020, 10, 6063-6081 | 6067 consider only 9 values of the cumulative distributions in the following: the points at P ¼ 0.1, 0.2, 0.3, ., 0.9. Below, we use and refer to these points in an inverse manner: the order parameter values as a function of the Einstein molecule distortion degree for a given cumulative probability value, q(s EM |P).
Using the octahedral order parameter results as reference data, our optimization procedure for any remaining order parameter (e.g., trigonal pyramid, q tri_pyr ) is a basic Monte Carlo approach and consists of following steps: (1) Create a set of new trial calculation parameters (e.g., for q tri_pyr , the penalty, Dq, for polar angles being different than their expected values of 90 ) by randomly perturbing the current parameters within a pre-dened maximum range.
(2) Compute the q(s EM |P) data (in Fig. 7B: gold-brown points) with the motif for which the OP is designed (here: q tri_pyr and a trigonal pyramid).
(3) Calculate the mean absolute error (MAE) between these data points and the octahedral reference at the same Einstein molecule distortion degrees (i.e., sum up the absolute difference between points and lines in Fig. 7B).
(4) Compare the new MAE, D new , to the MAE resulting from the previous parameter set, D old . If exp[Àk MC (D new À D old )] is larger than a random number drawn from a uniform distribution, then save the new trial parameter set; else, continue to use the previous parameter set. (5) Go to step 1. We usually perform 2000 MC trials and, in each separate trial, we compute the q i (s EM |P) data by 1000 Einstein molecule perturbations. We carry out at least two optimization runs. In the rst optimization, we used a small MC parameter, k MC ¼ 100, and a larger maximum parameter perturbation range, which allowed sampling of a larger parameter space. The second run proceeded with a larger MC parameter and a smaller maximum perturbation range, which forced the system to quickly go to the (nearest) MAE minimum. The rst optimization served as a check whether or not there is another minimum that might even have a lower MAE than the one that is closest to the initial state during optimization. Finally, the inset of Fig. 7B indicates that the procedure is appropriate because we obtain a convex parameter space.

Site ngerprint
We dene site ngerprints on the basis of the above described coordination likelihood and local structure order parameters. The site ngerprints are single-column vectors, the elements of which exhibit an ordered structure. Specically, the different coordination features are arranged in blocks that represent well-dened coordination numbers so that a site ngerprint v can be written as: where v i denote the (sub)ngerprint that contains all coordination features fullling the condition that the coordination number is i. For example, v 4 can carry the likelihood of being 4fold coordinated, the square planar LoStOP, and the tetrahedral LoStOP, but it cannot carry the likelihood of being 3-fold coordinated or the bcc (8-fold coordinated) LoStOP. While it might be necessary to increase the length to also include v 13 and v 14 , our tests have shown that there is little benet of doing so because the vast majority of sites in crystal structures typically have a coordination number #12. In the following, we describe specic details of three site ngerprint (CrystalNNFingerprint, OPSiteFingerprint, and ChemEnvFingerprint) as they are currently implemented in matminer 65,77 (version: 0.3.3). Note that there is a mutual strategy in computing the rst two ngerprints: grouping the features according to their underlying coordination number, as outlined in eqn (3). The CrystalNNFingerprint (CNN ngerprint) computes the ngerprint of a given site i as follows. First, we choose the coordination features to be included (e.g., w CN¼4 , q sq_plan , and q tet ). Second, the neighbors of site i are determined with the CrystalNN neighbor-nding algorithm. Then, we loop over all theoretically possible coordination numbers between 1 and the maximum of the set of coordination numbers that underly all This journal is © The Royal Society of Chemistry 2020 chosen features from step 1. For each coordination number j, we compute the coordination likelihood, w CN¼j . Subsequently, we compute each LoStOP (e.g., q tet ) that is to be considered for this coordination number (here: CN ¼ 4) using j neighbors (here: 4) with the highest coordination weight, we multiply the value with w CN¼j (here: q tet Â w CN¼4 ), and we then add the resulting value to the (growing) site ngerprint vector. If there is no feature type for a given coordination number j, a single zero entry is added for the entire j-block. Furthermore, note that the implementation of the CNN site ngerprint in matminer 65,77 features two convenient presets for rapid calculation setup: the "cn" preset only computes coordination likelihoods (w CN¼j ), whereas the "ops" preset adds all available LoStOP features in addition to the w j 's.
Similar to the CNN site ngerprint, the OPSiteFingerprint (OPS ngerprint) rst requires the choice of the features to be included. The main differences to the CNN ngerprint are: (1) The minimum distance neighbor nding method is used with the default fractional cutoff (1.1).
(2) A more elaborate binning scheme is used to determine whether neighbors belong to the same shell or not, which basically employs a bin width variation approach.
(3) Based on the variational neighbor nding results, multiple values for a given LoStOP are obtained, from which the most stable (i.e., most frequently occurring) value is extracted via histogramming.
(4) Coordination weights (w j ) are not used (i.e., the ngerprint fully relies on the LoStOP features). However, an additional distance variation factor, f d , can be chosen which is, for a motif with N nn neighbors around central site i given by: Lastly, the ChemEnvFingerprint (CE ngerprint) makes full use of the ChemEnv module in pymatgen, 75,76 which provides alternative tools for automatically identifying the coordination environments of atoms in materials. 100 Two presets are available ("simple" and "multi_weights"), which directly relate to the corresponding ChemEnv strategies. The principal neighbor nding approach is Voronoi decomposition (VNN). The features that are computed with the CE ngerprint are the continuous symmetry measures 101 (CSM) between a given motif and all available ideal coordination environments supported by ChemEnv. 100 In particular, the considered environments are: (1)

Structure ngerprint
On the basis of the site ngerprints of all atoms in a crystal structure we can compute meaningful structure ngerprints (Fig. 8). Our approach consists of four steps: (1) Choose a site ngerprint type.
(2) Choose the statistics to be computed [e.g., only the mean or the mean, the standard deviation, the minimum, and the maximum].
(3) Compute the site ngerprint feature vector of each atom in a structure. Note that all feature vectors have the same length and that each element in one site ngerprint is of the same type (e.g., tetrahedral LoStOP value) as the element at the same location in a ngerprint of another site.
(4) Calculate statistics across all values of a given feature vector element type (e.g., tetrahedral LoStOP value) and arrange them to a new ngerprint that is representative of the coordination patterns in the entire structure. In analogy to the site ngerprint, the structure ngerprints are arranged in an ascending order with respect to the coordination number underlying the different feature vector elements (e.g., rst come the statistics values from feature types relating to CN ¼ 1, then CN ¼ 2, .). For example, if we consider a generic site ngerprint that is only based on site features w CN¼1 , q sgl_bd |CN ¼ 1, w CN¼2 , and q L |CN ¼ 2 and if we, furthermore, only consider the mean, x i ; and the standard deviation, s x i , as statistics types to be computed, we obtain:

Similarity measures
We consider three common similarity measures for comparing two ngerprint vectors: the Euclidean distance, the dot product, and the cosine similarity. The Euclidean distance, d, is dened as the L 2 norm of the ngerprint difference vector, v i À v j : where v i denotes ngerprint vector i and N dim the size (or, number of elements) of the vector. The distance can assume values between 0, indicating highest possible similarity, and ffiffiffiffiffiffiffiffiffiffi N dim p for lowest theoretically possible similarity, the latter for the condition that the vector elements are constrained between 0 and 1. The L 2 norm is a frequently used (dis)similarity measure (cf., ref. 73 and references therein).
The dot product, s dot , is dened by: whereas the cosine similarity, s cos , is given by: Conceptually, the cosine similarity is especially attractive because we operate in positive space (i.e., vector elements are between 0 and 1) and the similarity measure itself will thus also assume values between 0 and 1. However, note that there should be straightforward ways to convert the distance from a non-normalized dissimilarity measure to a normalized similarity measure; for example, via: Finally, we measure the dissimilarity between two probability density distributions p 1 and p 2 via the overlapping coef-cient (OVL), 102 which is dened by the following integral: where X denotes the variable or observable over which the two distributions are dened.

Results & discussion
In this section, we present the benchmarking results for (i) using site ngerprints that are based on coordination number likelihood and the local structure order parameters (LoStOPs) to distinguish different local coordination environments and (ii) using structure ngerprints that are, in turn, based on site ngerprints to distinguish different prototype crystal structures.

Site ngerprints
We systematically test the performance of the here introduced site descriptors [local structure order parameters (LoStOPs)] in conjunction with the concept of a site ngerprint to help distinguishing different local coordination environments by employing 61 standard coordination motifs. 103,104 The motifs are available via the ChemEnv module 100 in pymatgen. 75,76 In order to focus on the LoStOPs performance we neglect the coordination number likelihood, and we decouple the pattern matching part (i.e., LoStOP calculation) from the neighbor nding part. The latter is possible because the separate motifs are well-dened by identication of a central site and the neighboring sites as provided by the ChemEnv module. 100 The (simplied) site ngerprint that we use to distinguish unique coordination environments is a vector, v OP , with 37 components. The vector elements, v OP i , are zero if the (observed) motif coordination number is not equal to the target coordination number, whereas the vector element is q j if the coordination numbers are equal: Since we only have a single motif-specic order parameter for coordination numbers beyond 8, we additionally evaluate the bond-orientational order parameters 105 q 2 , q 4 , and q 6 for coordination numbers from 8 to 13 because those OPs are also helpful in discerning different structural motifs. 80,106,107 Thus, the resulting vector is: 6070 | RSC Adv., 2020, 10, 6063-6081 This journal is © The Royal Society of Chemistry 2020 where q bent (a) refers to the bent LoStOP with a target angle of a, q tet |CN ¼ 3, aims at identifying trigonal non-coplanar environments, and q tri_bipyr |CN ¼ 4 should work together with q tri_pyr |CN ¼ 4 to identify regular see-saw motifs (angle: 120 ). In Fig. 9, we compare results obtained using four site ngerprint similarity metrics: the square root of the dot product, ffiffiffiffiffiffiffi ffi s dot p ; (top le), the modied (top right) and conventional (bottom le) cosine similarity, s * cos and s cos , respectively, and the distance (or L 2 norm), d, (bottom right). By construction of the simplied site ngerprint v OP , the square root of the dot product and the cosine similarities are zero (blue color) for any two site ngerprints obtained from coordination environments that have different coordination numbers [e.g., tetrahedron (CN ¼ 4) vs. octahedron (CN ¼ 6)] because those are strictly orthogonal. The conventional cosine similarity is, however, in many cases close to unity (yellow color) when we consider coordination environments which have the same coordination number (cf., yellow/gray triangles along the diagonal in bottom le panel).
To understand this potentially unexpected behavior let us consider some concrete examples: the 6-fold coordinated motifs octahedron, pentagonal pyramid, and trigonal prism. The three non-zero elements of the ngerprints are q hex_plan |CN ¼ 6, q oct |CN ¼ 6, as well as q pent_pyr |CN ¼ 6. Removing the zero entries, the resulting motif ngerprints are: Thus, the dot products are 1.02, 0.53, and 0.62 for motif pairs oct-pent_pyr, oct-trig_prism, and trig_prism-pent_pyr, respectively, whereas the corresponding cosine similarities are 0.8, 0.83, and 0.997. Since the maximum dot product is around 3.2 given our motif test set, all three dot products are small compared to this upper bound. On the contrary, the cosine similarities are all close to the inherent upper bound of 1. Evidently, the normalization term kv OP i k Â kv OP j k changes the dot products in an undesirable manner, which is true in particular for the last pair (trig_prism-pent_pyr). The problem is that the "length" of the ngerprint carries actually important information. The "angle" or alignment of the ngerprints alone is, thus, not sufficient information for similarity purposes (it is however crucially important and works well if we consider coordination environments with different CN, as we have seen). To clarify this point further consider the pair pentagonal pyramid-trigonal prism. The ngerprint of the rst motif is approximately the same as the ngerprint of the second motif multiplied by 2: v OP pent_pyr z 2 Â v OP trig_prism . Although the two ngerprints are parallel, they have very different interpretations. The rst ngerprint, v OP pent_pyr , has an order parameter that equals 1 and, thus, ags a perfect motif, whereas all elements of the second ngerprint, v OP trig_prism are markedly smaller than 1 (maximum: z0.5), thus, signifying no perfect motif match (given our set of LoStOPs). Hence, the cosine similarity can be a misleading similarity metric, especially if we consider samecoordination number motifs. Before moving on, we like to highlight at this point the nice symmetric behavior of q pent_pyr and q oct : q pent_pyr gives a value of 0.5 for a perfect octahedron and, q oct , in turn, gives 0.5 for a perfect pentagonal pyramid.
Omitting the normalization term of the cosine similarity but taking the square root of the dot product to preserve the units, we obtain a similarity metric that can alleviate the problem with putatively similar same-coordination number environments (cf., top le panel in Fig. 9), as can replacing the typical normalization constant kv OP i k$kv OP j k with max 2 (kv OP i k, kv OP j k) to some extent (top right panel). So, choosing the square root of the dot product instead of the cosine similarity has here the advantage that the vast majority of coordination environments are identied as dissimilar, as expected. On the other hand, we lose the nice behavior of the similarity metric to be constrained between zero and unity. The majority of square root of the dot products for same-coordination number motifs are #1. Thus, ffiffiffiffiffiffiffi ffi s dot p ¼ 1 seems to be a reasonable ad-hoc threshold value for distinguishing any coordination environments by means of the here introduced site ngerprints and the square root of the dot product similarity metric.
Finally, the distance dissimilarity metric (bottom right panel), for which we expected high values for the majority of cases (i.e., also blue color because the range was inverted), gives a rather diffuse picture. A subtle trend might be seen because high (z2: blue) and moderate (z1: gray) values are predominant along the axes, whereas small distances (z0: yellow) accumulate in the center of the graph. But the overall results strongly suggest that the square root of the dot product is the best similarity metric for coordination environment comparison using the here introduced order parameter-based site ngerprints.
The detailed analysis of the site ngerprints underscores that the LoStOPs can reliably distinguish between motifs sharing the same coordination number (e.g., tetrahedron vs. square planar). This capability has already been exploited to develop a computational tool that can automatically detect sites in crystal structures (e.g., "tet" and "oct") and translate the information into human readable text (e.g., "This site is a tetrahedral site"), 108 thus, demonstrating the unique power of the here introduced tools and concepts.

Structure ngerprints
The order parameter-based site ngerprints work well for distinguishing many different isolated coordination environments, especially if the square root of the dot product is used as a similarity metric. In this section, we aim to leverage this newly discovered powerful capability to accomplish a more complex, yet exceptionally important, task in materials science: automatically distinguishing and quantitatively comparing different (prototype) crystal structures using the earlier introduced concept of structure ngerprints.
In order to thoroughly test our structure ngerprints we have constructed a benchmark test set consisting of 40 groups of This journal is © The Royal Society of Chemistry 2020 RSC Adv., 2020, 10, 6063-6081 | 6071 relevant prototype structures (Fig. 10) ranging from metals over semiconductors to insulators. Some of the groups stem from one of our previous papers, 80 but we have systematically and signicantly increased the number of groups using a canonical reference: "The Major Ternary Structure Families" by Muller and Roy. 59 We included the main structure of any structure family given by ref. 59. For example, we included regular spinel, but we did not include spinels with Jahn-Teller distortions such as hausmannite. In the future, we will further increase the test set. Given a structure prototype described in ref. 59, we proceed by identifying the respective structure in the Materials Project 11 database-typically via the chemical formula and the space group. Subsequently, we use pymatgen's structure matcher in conjunction with a framework comparator for nding more structures of the same (framework) prototype but with different chemistry, as we did previously. 80 This procedure yields 6528 structures, each of which belongs to one of the 40 prototype groups and all together amounting to a test set that comprises almost 10% of the entire MP database 11 (currently: 12 69 640 inorganic structures).
We consider the cosine similarity and the distance dissimilarity metrics for testing the performance of distinguishing prototype structure groups on the basis of our structure ngerprints. Furthermore, we test 3 different site ngerprint denitions with different options and presets, resulting in 8 distinctly different site ngerprint types, and we investigate 7 This journal is © The Royal Society of Chemistry 2020 different combinations of tentatively relevant statistics types: the mean of each coordination feature across all sites; mean and maximum; mean and minimum; mean and standard deviation; mean, standard deviation, and minimum; mean, standard deviation, and maximum; as well as mean, standard deviation, minimum, and maximum. In summary, we thus test Fig. 10 The 40 prototype structures considered in this work for testing the performance of structure similarity assessment with our novel structure fingerprints. In parentheses, we provide the number of materials that we found in the MP database 11,12 for a given prototype group. 2 Â 8 Â 7 ¼ 112 different ways of assessing structure similarity for each of the 40 structure groups separately and for all groups together. Note that each structure group is weighted equally; that is, we remove the bias of a group that has many structure members (e.g., bcc + CsCl + Heusler: 3366) in comparison to a group that only features a few structures (low-cristobalite: 2). The best performing combination (OVL: 1.9%) is the Crys-talNNFingerprint with "ops" preset and all additional ags turned off, mean and maximum as statistics types, as well as distance as comparison metric (Fig. 11). The mean as statistics type alone already provides excellent results (OVL: 2.6%), which can be regarded as a reection of Pauling's 5th rule, the Rule of Parsimony: 109 "the number of essentially different kinds of constituents in a crystal tends to be small." Furthermore, it is interesting that the "cn" preset variant, which uses only coordination number likelihoods, w CN¼j , and, thus, no order parameters at all, performs similarly well (OVL: 1.8%) while having 183/72 ¼ 2.5 times less features in the structure ngerprint. Concepts like the Bayesian 110 or Akaike information criterion 111 that aim at guiding model selection might suggest here that we added too many parameters to our model. However, we stress that the local structure order parameter features are not simply "random" features that we test in a "black box" approach. The features have clear and, most importantly, desirable interpretations or capabilities: they can reliably identify coordination environments, as the previous section unequivocally highlighted. Since the structure ngerprint results underscore that the most basic coordination information suffices to reliably distinguish different structures, this hints at a different problem.
The issue seems to be that our test set might not be diverse enough in order to make the LoStOPs more critical components of the structure ngerprints. This can be understood when recalling that occurrence statistics of different coordination motifs are not uniform, but there are certain motifs that are observed disproportionately more frequently. In particular, tetrahedra and octahedra are the most frequently occurring coordination motifs for anions. For example, we found in the structures that we used from ref. 59 that 23 structures have (anion) octahedral sites, 21 structures with tetrahedral sites, as well as 3 with trigonal planar and with 2 trigonal non-coplanar sites. Furthermore, almost all of these tetrahedral and octahedral sites are nearly perfect Platonic polyhedra. Consequently, the issue or bias that there are only a few important perfect motifs and that the vast majority of coordination environments considered in the previous section do not occur frequently enough is a natural external condition in the present context. Nonetheless, our results indicate that the key to a reliable coordination environment analysis 80 and, thus, to a coordination environment-based structure similarity assessment lies in the accuracy of the neighbor nding method.
The full collection of results for structure group (dis)similarity is provided in the ESI. † In Fig. 12, we present the data using the setup that gives the smallest overall overlapping coefficient. The blue lines indicate the distributions obtained from calculating distances between ngerprints from structure of the same prototype group, whereas the orange lines are the distribution of structure ngerprint distances to members of other prototype groups. The results for each prototype group separately underline that our approach using the overlapping coefficient is a valuable choice because, in most cases, there is a nice consistent separation between the blue area on the le and orange area in the center and on the right. Very few groups exhibit slightly misleading results such as low-cristobalite (right column, 6th panel from the top). Despite the very low OVL (0.5%) observed for this group, there are a lot of unlike structures having a smaller distance to a member of the lowcristobalite group than the peak location representing the (little) distance variation within the low-cristobalite prototype group itself.
We emphasize that our new structure similarity quantication procedure has a very important advantage to conventional structure matching or similarity methods. It does not only provide a global similarity value (e.g., RMSD). The separate elements of the ngerprints and, thus, difference vectors also carry local coordination information in a very compact way, which can be readily used in machine learning approaches and in formulating design rules. For example, consider diamond (C, mp-66), rocksalt (NaCl, mp-22862), and a-iron (Fe, mp-13) as well as a Heusler (Cu 2 MnSn, mp-22221), a spinel (MgAlO 3 , mp-3536), and a perovskite material (CaTiO 3 , mp-5827). Ignoring zero entries, the values of the best performing ngerprint for similarity assessment ("CrystallNN*" with "ops" preset and mean and maximum values) are provided in Table 2.
Diamond's leading ngerprint elements are the mean 4-fold coordination likelihood and the mean tetrahedral LoStOP. Because diamond 112 consists exclusively of tetrahedrally coordinated carbon atoms both LoStOP entries are 1, whereas all other 4-fold coordinated LoStOPs are much smaller (#0.25). Similarly, rocksalt 113 has only octahedrally coordinated Na + and Cl À ions, for which reason the mean 6-fold coordination likelihood and the mean octahedral LoStOP is 1 and all other 6-fold coordination descriptors considerably smaller.
All sites in a-iron 114 and the full Heusler material 115 Cu 2 MnSn are expected to be coordinated 8-fold in a body centered cubic manner. Among the leading ngerprint elements, we nd in fact w CN¼8 and q bcc |CN ¼ 8, but their deviation from 1 (0.58 and 0.54, respectively) and a slightly higher maximum value together with max(w CN¼14 ) ag a special case. The second nearest neighbors in BCC structures are quite close to a given site in comparison to nearest neighbors, 80 and the structure produces rather large Voronoi facets for those second shell atoms. 107 Hence, it is not unreasonable to consider the second nearest neighbors as being directly coordinated to a given siteonly with a smaller coordination likelihood. Also note that, in the case of a-iron, the 6 second nearest neighbors exhibit an as small contribution as 0.21 to the 14-fold coordination likelihood if the contribution of the 8 nearest neighbors can be set to w CN¼8 ¼ 0.58.
The MgAlO 3 spinel structure possesses perfect MgO 4 tetrahedra, which share corners with slightly distorted AlO 6 octahedra. 59 Since the oxygen atoms are 4-fold coordinated just as the Mg ions, 59 80% of the sites are 4-fold coordinated, which explains the high value of w CN¼4 ¼ 0.7 (all Mg and O site w CN¼4 6074 | RSC Adv., 2020, 10, 6063-6081 This journal is © The Royal Society of Chemistry 2020 $ 0.98). The fact that the mean tetrahedrality is markedly larger than 1/5 indicates that there are more sites that have a certain tetrahedral character than only Mg sites with q tet ¼ 1: the oxygen sites with q tet z 0.5. However, several of the maximum values of the other 4-fold coordination motif LoStOPs are also around 0.5, thus, underlining that the oxygen atoms have no distinctly identiable coordination motif. The distorted AlO 6 octahedra are clearly represented by max(w CN¼6 ) ¼ 1 and max(q oct |CN ¼ 6) ¼ 0.83. Fig. 11 Overlapping coefficient, OVL, as a function of (i) the underlying site fingerprint and (ii) the statistics run across all site fingerprints in a structure, both together centrally define a given structure fingerprint, and (iii) as obtained by using the cosine similarity, s cos , (A) and the distance (dis)similarity metric, d, (B) for comparing two structure, respectively. Note that "CrystalNN, cn" and "CrystalNN, ops" refer to the Crys-talNNFingerprint in matminer using preset "cn" and "ops," respectively, an additional asterisk indicates that the two available fingerprint options are set to none, "ChemEnvSite, simple" and "ChemEnvSite, multi-weight" refer to the ChemEnvSiteFingerprint in matminer using the "simple" and "multi-weight" preset, respectively, "OPSite" and "OPSite, bond-OPs" refer to the OPSiteFingerprint with the default OP set and using bondorientational order parameters 105 for all coordination numbers tested (i.e., CN ¼ 1, ., 12), and "std. dev.", "min.", as well as "max." refer to standard deviation, minimum, and maximum, respectively.
This journal is © The Royal Society of Chemistry 2020 RSC Adv., 2020, 10, 6063-6081 | 6075 Ultimately, ideal perovskite materials such as CaTiO 3 have Ca atoms that are 12-fold coordinated by oxygen atoms in a cuboctahedron fashion, perfect TiO 6 octahedra, and oxygen atoms that are considered 6-fold coordinated (2 Ti atoms are nearest neighbors forming linear bonds, but the 4 next nearest Ca neighbors are also considered coordinated to oxygen atoms). 59 All these features are nicely reproduced by our structure ngerprint. 60% of the sites (oxygen atoms) can to some degree (w CN¼2 z 0.5) be considered 2-fold coordinated (0.6 Â w CN¼2 z w CN¼2 ¼ 0.31). And, we can infer the linear bond character of these sites by comparing max(w CN¼2 ) and max(q lin |CN ¼ 2) values. The perfect TiO 6 octahedron results in Fig. 12 Distributions of distances between structure fingerprints, kv struct,i À v struct,j k, for measuring dissimilarities between structures that belong to the same prototype group (blue) and between structures that belong to different prototype groups (orange) for the optimal fingerprint settings, etc. The overlapping coefficients (OVLs), which quantify the overlap between the two distributions in each panel, are also provided. The top left panel represents results averaged over data from all prototype structure groups, whereas the remaining panels display the results for a specific target prototype structure group. max(w CN¼6 ) ¼ max(q oct |CN ¼ 6) ¼ 1, whereas the corresponding mean values of z0.5 indicate that more sites than the Ti atoms possess distinct octahedral coordination geometry: the oxygen atoms. Finally, w CN¼12 ¼ 0.2 and max(w CN¼12 ) ¼ 1 clearly highlight the 20% of sites (Ca) that are perfectly 12-fold coordinated, and q cuboct |CN ¼ 12 ¼ 0.2 and max(q cuboct |CN ¼ 12) ¼ 1 that the coordination geometry is a perfect cuboctahedron.

Discussion
Two aspects of our work require additional discussion: our novel neighbor-nding algorithm "CrystallNN" (CNN) and the metric used to assess similarity-especially between structures.
3.3.1 Novel neighbor-nding method. The structure similarity analysis suggests that CNN is a valuable new way of nding neighbors and computing coordination numbers. It yields better structure similarity assessments than the minimum-distance near-neighbor nding method used in the OPSite structure ngerprint and the conventional Voronoi method used by the ChemEnvSite ngerprint (Fig. 11). Here, it does not matter whether or not to use our new order parameters for structure similarity assessment, given our structure test set. Essentially, the usage of CNN-based coordination likelihoods lends the CrystallNN structure ngerprint its similarity assessment power. Note also that we are currently investigating the performance of CNN for predicting coordination numbers in greater detail and that CNN frequently outperforms conventional approaches in the upcoming work.
We introduced, to the best of our knowledge, a new element to the neighbor-nding issue: the coordination likelihood computation using the normalized Voronoi weights in conjunction with a semicircle. We invented this approach in an adaptive data analysis 116 fashion. Given a small number of pathological cases, we optimized our algorithm via trial-anderror. We stress here however that we did not use the entire benchmark test set for structure similarity assessment (Fig. 10). Therefore, the algorithm development procedure resembled a model selection process, where we took care about not using the same test set for testing.
The new element of our method improved structure similarity assessment. However, we have to underscore that the approach might appear purely algorithmic in nature. It lacks any theory or physical model justifying, for example, the specic use of the semicircle. This can be interpreted as a conceptual disadvantage over more direct or intuitive approaches such as the minimum distance neighbor-nding method.
3.3.2 Similarity metric issue. The results of the site ngerprints clearly indicate that dot product-based similarity metrics are advantageous for more stringently distinguishing dissimilar coordination environments. The situation is more nuanced in the case of the structure ngerprints because the overlapping coefficient (OVL) suggests that the distance (dis) Table 2 Examples of structure fingerprints

Structure
Leading ngerprint elements This journal is © The Royal Society of Chemistry 2020 RSC Adv., 2020, 10, 6063-6081 | 6077 similarity metric gives the best performance. However, this outcome should be viewed cautiously.
On an average, the cosine similarity yields an OVL that is z0.01 smaller (or z11%) than the distance (dis)similarity. Using only the CNN data ("CrystalNN*, ops" and "CrystalNN*, cn" in Fig. 11), the trend is opposite (zÀ0.001, zÀ7%) but considerably weaker. Hence, the (dis)similarity metric does not signicantly impact the outcome of the best performing group of ngerprints that we have considered. This is conrmed by comparing results of individual material prototype groups of the best combination of ngerprint and (dis)similarity metric (Fig. 12) with the same ngerprint but cosine similarity metric (ESI: Fig. 26 †). The OVLs are usually almost identical. There are in fact only two cases (tetragonal BaTiO 3 and ilmenite) for which the OVLs differ more evidently (zÀ5% and z+4%, respectively). However, the corresponding OVLs are also exceptionally large (36% and 12%, respectively) so that we conclude that even these few extreme cases represent only moderate discrepancies between the two (dis)similarity metrics. Moreover, distribution features such as fragmentation (tetragonal BaTiO 3 ) and second similarity peaks (aragonite) are similarly clearly observable with both (dis)similarity metrics.
Because normalization is very desirable in most machine learning approaches, 117 we conclude that the cosine similarity metric is-at the very least-an oen unjustiably ignored metric. We can solidly recommend to use the cosine similarity as an ad-hoc assessment metric for computational materials science applications.

Conclusions
Despite the facts that coordination chemistry was already established more than a century ago 118 and Brunner, more than 4 decades ago, still realized that "the term coordination ha[d] no satisfying denition," 87 determination and evaluation of coordinated neighbors in crystal structures is even today continuing to be a nontrivial scientic task-let alone automated versions of those processes. To aid in lling this gap, we have, in this work, introduced a new neighbor nding method and several new local structure order parameters (LoStOPs). We have used a Monte Carlo framework to maximize comparability between different LoStOPs. Subsequently, we have employed the new descriptors to dene feature vectors that are characteristic of known coordination environments (site ngerprints and similarities) and that are capable of distinguishing commonly investigated crystalline prototype materials (structure ngerprints and similarities). In-depth testing has enabled us to give recommendations which type of ngerprint should be combined with which similarity metric in order to most reliably categorize site environments and crystal structures. We actively utilize our novel capabilities on the Materials Project 11 website 12 to assess the crystal structure similarity between different materials. To this end, we have computed all z2.45 billion structure similarity distances between each pair of the z70 000 materials in the MP database. The structure similarity data greatly facilitate browsing of the website and, thus, exploration of the MP materials database because it adds an intuitive connection mechanism.
We hope that our novel coordination descriptors will be helpful in the context of inverse design approaches, 119,120 where, specically, the distinction between perfect and slightly distorted coordination environments 121 or building blocks 27,28 is decisive for the target property. Finally, we believe that the here introduced methods and concepts, which are freely available through the python 74 packages pymatgen 75,76 and matminer, 65,77 will greatly facilitate future data-driven and machine learning studies at the interface between materials science, chemistry, and engineering, for example, by producing standardized metadata. 122

Conflicts of interest
There are no conicts to declare.