Classification of spatially resolved molecular fingerprints for machine learning applications and development of a codebase for their implementation

Mardochee Reveil * and Paulette Clancy
Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY 14853, USA. E-mail: mr937@cornell.edu

Received 9th January 2018 , Accepted 19th February 2018

First published on 20th February 2018


Direct mapping between material structures and properties for various classes of materials is often the ultimate goal of materials researchers. Recent progress in the field of machine learning has created a unique path to develop such mappings based on empirical data. This new opportunity warranted the need for the development of advanced structural representations suitable for use with current machine learning algorithms. A number of such representations termed “molecular fingerprints” or descriptors have been proposed over the years for this purpose. In this paper, we introduce a classification framework to better explain and interpret existing fingerprinting schemes in the literature, with a focus on those with spatial resolution. We then present the implementation of SEING, a new codebase to computing those fingerprints, and we demonstrate its capabilities by building k-nearest neighbor (k-NN) models for force prediction that achieve a generalization accuracy of 0.1 meV Å−1 and an R2 score as high as 0.99 at testing. Our results indicate that simple and generally overlooked k-NN models could be very promising compared to approaches such as neural networks, Gaussian processes, and support vector machines, which are more commonly used for machine learning-based predictions in computational materials science.



Design, System, Application

Within the structure-property relationship framework, the atomistic structure of a material is a strong determinant of its properties and suitability for specific applications. Therefore, material design projects usually involve a proper understanding of underlying structures acquirable using Quantum or classical calculations. An exciting development in computational material design is the use of artificial intelligence (AI) and machine learning (ML) techniques to accelerate the analysis, design, and discovery of “structural signatures” that impact properties. However, a key requirement for this new progression to become more useful has been the availability of spatially resolved numerical representations of atomic structures that are invariant under all property-preserving operations such as geometric translations and rotations. Although this requirement has proven more elusive than expected, significant progress has been accomplished over the past decade or so, with proposed representations accomplishing various degrees of performance. In this paper, those representation schemes traditionally termed “fingerprints”, are categorized and reviewed. A new package called SEING is also created and released to the community to streamline and facilitate the use of those fingerprints in ML applications. This work can help accelerate the use of AI in materials science applications for better understanding and easier discovery of a wide range of new materials.

1. Introduction

Recent developments in machine learning (ML) algorithms have afforded significant progress in diverse application areas, including medical diagnosis,1–3 speech recognition,4–6 face recognition,7,8 the discovery of new planets,9 and many others. Following some profound successes in such areas, researchers in materials science are increasingly interested in investigating the promise of similar techniques for the prediction of material properties,10–12 the discovery of new materials,13–15 force field development16–19 for use in molecular dynamics simulations, and so on. In these applications, it has been shown that data from both experiments and simulations can be leveraged to make predictions of a large number of properties including gas uptake capacity in metal organic frameworks,11 crystal structure predictions,20 band gap energies of double perovskites,21 heat capacity of organic molecules,10etc.

“Feature extraction,” or “feature engineering,” is a key step in machine learning studies to ensure that accurately predictive models can be built from available data. In this context, feature engineering refers to the process of identifying, evaluating and applying one or more transformations (usually mathematical operations) to a dataset to enhance its suitability for machine learning use. For applications involving prediction of properties based on molecular structures, this process usually leads to the computation of a vector or matrix called a “fingerprint,” or “descriptor,” that represents a chemical structure or environment (see Fig. 1). Beyond machine learning applications, fingerprints obtained in this manner can also be used for a systematic study of structure–property relationships, including structural analysis and structural comparison for the identification of configurations of interest and for property prediction.18


image file: c8me00003d-f1.tif
Fig. 1 Comparison between the current way of conducting machine learning-based molecular studies (given in dashed lines and highlighted in red) and the proposed method using SEING (highlighted in blue). By providing a unified framework and package for fingerprint generation in an “off-the-shelf” fashion, SEING will make such machine learning studies more straightforward and effective.

To be suitable for machine learning studies, specially force field development, fingerprints have to satisfy a number of requirements such as invariance under rotation, translation and permutation of atoms of the same chemical nature.18,22,23 Differentiability and speed of computation are also desirable.18,23 The need for symmetrization of atomic coordinates to satisfy the requirement of invariance under permutation of chemically identical atoms, especially for use in neural networks, has been recognized for at least two decades.24 This need arises because Cartesian coordinates, traditionally used to represent molecular systems, are not invariant under those geometric operations. For example, permutation of two atoms of the same type does not affect the properties of molecular systems, but it changes their Cartesian coordinates. Hence the need arises for alternative representations that are permutation-invariant. Using the same argument, one can conclude that rotation and translation-invariance are also necessary. Differentiability is usually desirable for force predictions.

Many different fingerprints have been proposed over the years for machine learning applications. Some examples include Coulomb matrix,25 bag of bonds,26 symmetry functions,27 bispectrum,28 Zernike,23 and others. More recently, Artrith et al. proposed a way to generate fingerprints for compositions containing many species.29 A method to automatically find appropriate descriptors was suggested by Ghiringhelli et al.30 A fingerprint called a “depth map” (D-map) that readily inverts into molecular geometries was tested by Yao et al.31 Other descriptors such as bonds-in-molecule32 and many-body expansion approaches have also been recently explored. Sadeghi et al. evaluated a number of fingerprints of the form n × n matrices, where n is the number of atoms in the system,33 including contact34 and overlap35 matrices, useful for evaluating similarity between structures. Since there are so many different fingerprints proposed in the literature, we propose a fingerprinting categorization scheme that affords an easier and more systematic analysis.

With such a large number of available descriptors, the use of machine learning for molecular studies usually begins with an evaluation to find the best fingerprint to use for the problem at hand. Until now, most fingerprints are not readily available to the community in an “off-the-shelf” code. As a result, researchers invariably have to implement the fingerprints from scratch, which constitutes a time-consuming and counter-productive effort (a wasteful duplication of effort across the community). This lack of an accessible codebase “fingerprinting” resource provides one of the main motivations for this paper.

Here we introduce a new C/C++ package called SEING (an old French word for a signature) that focuses solely on fingerprint generation. The subsequent steps of selecting, and using, a suitable machine learning algorithm is left outside of the scope of this platform as there are already several mature tools available for such tasks from the machine learning community (e.g. Tensorflow,36 scikit-learn,37,38 keras,39 and so on). SEING aims to be a comprehensive repository for molecular fingerprints proposed in the literature that are specifically intended for machine learning applications. As such, SEING is built in a modular fashion to facilitate the expansion of the code such that it can readily allow future new fingerprint techniques to be easily incorporated. As illustrated in Fig. 1, this code is intended to significantly reduce the time for implementing and generating spatially-resolved molecular fingerprints from hours/days to seconds/minutes. In this sense, it will alleviate, if not eliminate, perceived barriers towards implementing and conducting machine learning studies of molecular systems by increasing accessibility the same way, say, LAMMPS increased accessibility to conduct molecular dynamic simulations.

A number of other fingerprints exist in the field of drug design to which the larger materials science community probably owes the term “molecular fingerprint”.40–42 Those molecular representations, for the most part, were created years before the relatively new fingerprints addressed in this paper and were specifically designed for comparisons and quantitative structure–activity relationship studies of molecules relevant to biological applications. Notwithstanding their usefulness otherwise, most of those fingerprints are not particularly suitable for spatially resolved machine learning studies of molecular systems as they usually do not satisfy the requirements discussed above. They are intentionally left outside of the scope of this paper and the SEING package. Other tools such as CDK,43 RDKit44 and PaDEL45 have been created and released to the therapeutic drug discovery community to help streamline the process of generating those fingerprints.

The remainder of this paper is organized as follow: with a focus on spatially resolved fingerprinting methods, we start by proposing a categorization scheme (a taxonomy of sorts) which is capable of incorporating most, if not all, of the existing fingerprints available in the literature that are suitable for molecular systems in materials science applications. We then review a handful of selected fingerprints based on their relevance to, and prior use in, force field development. A comprehensive review of all existing fingerprinting strategies is beyond the scope of this study, but the reader is referred to the following partial reviews for more details: Ward and Wolverton,22 Behler46,18 and Faber et al.10 Finally, we describe the philosophy behind the development of the SEING package and finish with sample applications of a SEING-based workflow for force predictions. The reader should note that the terms “fingerprint” and “descriptor” are used interchangeably in this paper and, unless noted otherwise, refer to molecular representations developed for machine learning applications.

2. Fingerprints

2.1. Overview of fingerprinting strategies

As mentioned in the preceding section, Cartesian coordinates are deemed inappropriate for machine learning applications. Therefore, a number of alternative representations or descriptors have been proposed over the years. In Table 1, we list major fingerprints in the current literature designated by type and the year in which they were first proposed, covering the past ten years. From a high-level perspective, fingerprints can be divided into atom-centered (or local) and global descriptors, based on whether they represent the environment around a given atom or are constructed to represent an entire molecule or crystal (see Fig. 2). Such a differentiation has already been used previously in the literature.18,23,46 Atom-centered fingerprints are particularly suitable to build machine learning-based force fields and for the prediction of local properties. Global fingerprints are, by design, more suitable for the prediction of macroscopic properties.
Table 1 Overview of the main fingerprinting strategies for machine learning applications proposed over the past ten years using the classification scheme developed in this paper
Fingerprints Classification 2007–2010 2011–2013 2014–2016 2017
Atom-centered Symmetry functions Behler–Parinello27 Behler16 AGNI47 AGNI2.0 (ref. 48)
Jose–Artrith–Behler49 wACSF50
Basis set decomposition Bispectrum28 Zernike23 Artrith–Urban–Ceder29
Kernel-based SOAP51 GRAPE52
Global Pair distances only Extended connectivity53 Coulomb matrix25 PRDF54 Bonds in molecule32
Contact matrix34 Bag of bonds26
Overlap matrix35 Fourier series55
Hamiltonian matrix33
Hessian matrix33
Distances, angles, dihedrals BAML56 Many-body expansion31
MBTR57
MARAD10
HDAD10
Graph-based Molecular graph58 Property-labeled59
Materials fragments
Others ALF60 Motif-based61 D-Map31
RAC62



image file: c8me00003d-f2.tif
Fig. 2 Two distinct strategies exist to generate fingerprints: global and atom-centered. The former is generally suitable for macroscopic property predictions whereas the latter can be used for local property predictions and are particularly relevant to force field development.

Atom-centered fingerprints can be further divided into whether they are based on symmetry functions, basis set decomposition or kernels. Symmetry function-based representations use summations over bonds, angles or dihedrals of the same nature, or summation over functions thereof, for all atoms within a given sphere around the atom of interest. Fingerprints that fall within this category include Behler–Parrinello,16,27 and Jose–Artrith–Behler.49 They usually differ by the functional forms used in the summation. Basis set decompositions, on the other hand, rely on a local atomic density function defined around the atom of interest which is then expressed as a linear combination of a suitable basis set. Subsequent operations are usually used to ensure invariance under rotation and/or translation, and so on. Example such fingerprints include Bispectrum,28 Zernike,23 and the Artrith–Urban–Ceder method, which was specially developed to support multi-component systems.29 The smooth overlap of atomic positions (SOA)51 and graph-approximated energy (GRAPE)52 methods fall into a third category in which local environments are compared directly with the help of an appropriate kernel function without the explicit computation of a fingerprint.

Global fingerprints, on the other hand, can be divided into ones that involve only pair distances and ones that include higher-order interactions. The former subcategory has the largest number of fingerprints and includes extended connectivity,53 Coulomb matrix,25 as well as contact,34 overlap,35 Hessian and Hamiltonian matrices,33 partial radial distribution functions (PRDF),54 bag-of-bonds,26 Fourier series decomposition of radial distribution functions,55 bonds-in-molecule,32 connectivity count and encoded distances.63 Those fingerprints are built using different schemes or functions of some or all pair distances in the molecule or crystal of interest. Sometimes, the fingerprints also use information related to the nature of the different elements, such as partial charges, atomic numbers, etc.

The latter subcategory includes recently developed fingerprints such as bond angle machine learning (BAML),56 many-body expansion (MBE),31 many-body tensor representation,57 molecular atomic radial angular distribution (MARAD),10 and the histogram of distances, angles and dihedral angles (HDAD).10 Those fingerprints include higher-order interaction terms such as angles and dihedrals and usually lead to more accurate predictions compared to their pair distances-only counterparts.

A third subcategory of global fingerprints is comprised of graph-based descriptors, such as molecular graphs58 and property-labelled materials fragments,59 which are suitable for use in graph-based convolutional neural networks. Some other descriptors can be included in their own subcategory; these include the atomic local frame (ALF),60 motif-based fingerprint61 and the revised autocorrelation (RAC) fingerprints62 which are inspired by descriptors used in cheminformatics. This subcategory would also include the D-map31 fingerprint that can be readily mapped back to chemical structures.

2.2. Symmetry function fingerprints

In general, symmetrization strategies involve a direct summation over pairwise distances and/or angles between three atoms or a summation over parametric functions of those distances and angles. These “symmetry functions” have the effect of making atoms lose their individuality to all other symmetrically and chemically equivalent atoms. Examples include the atom-centered Grad and Gang functions proposed by Behler and Parrinello (BP)16,27 and given by eqn (1) and (2), respectively. For a given atom i, the BP method uses distances Rij to all atoms j within a cutoff distance Rc, as well as angles θijk centered on atom i and involving atoms j and k with jk. The cutoff function, fc, ensures a smooth transition to a contribution of zero for atoms that are outside of Rc. An example of a cutoff function is given in eqn (3). Generating a finite number of those symmetry functions based on a set of predefined parameter values corresponding to the width η and center Rs of the Gaussians in Grad as well as λ, η and ζ of the Gaussians in Gang ensures fingerprints are an effective representation of the chemical environments that they seek to capture. The number of parameters to use, as well as their specific values defining the spatial resolution achieved, become a design decision left to the discretion of the modeler.
 
image file: c8me00003d-t1.tif(1)
 
image file: c8me00003d-t2.tif(2)
 
image file: c8me00003d-t3.tif(3)

Other types of symmetry and cutoff functions have been proposed by different authors. For example, pair-centered symmetry functions have been used by Jose et al.49 to construct neural network potentials. Another approach, based on permutation-invariant polynomials, has been suggested by Jiang et al.64 and by Li et al.65 In this method, Cartesian coordinates are replaced by a summation over symmetrized Morse-like monomials that include all possible nuclear permutations in the system. AGNI47,48 fingerprints are another example of symmetry function-based fingerprints developed to predict atomic forces directly. As shown by eqn (4), AGNI fingerprints are very similar to the Grad components of the BP fingerprints (see eqn (1)). However, they include a direction-specific coefficient given by the ratio of the α component (where α = x, y or z) of the pair distance, rαij, divided by the pair distance rij between atoms i and j, akin to a derivative of the BP fingerprints in that direction. In addition, specially in their newer version of AGNI, those authors showed that use of Gaussian centers ak (labeled Rs in Grad) as the main parameter, rather than the width w (labeled η in Grad) of the Gaussians, led to better predictive capabilities for the machine learning models.

 
image file: c8me00003d-t4.tif(4)

Although symmetry functions have been successfully used for a number of applications, one significant drawback of this approach has been the size increase of the descriptors with the number of elements. Traditionally, a new set of radial functions are created for each atom type and a new set of angular functions for every combination of elements, leading to a significant increase of fingerprint size with number of species. This usually makes it impractical to use symmetry functions for systems containing more than four elements.18 However, the recent weighted atom-centered symmetry functions (wACSF)50 method proposed by Gastegger et al. attributes a species-specific weight to each term in the sum for Grad and Gang, hence avoiding the fingerprint dimensionality increase with number of elements. A similar approach was used by Artrith et al.29

2.3. Bispectrum fingerprints

The bispectrum fingerprinting approach, previously used for image or pattern recognition,66 was first proposed by Bartók et al.28 for the representation of molecular structures. It relies on the atomic density around a center atom, i, as defined by eqn (5) where fcut = 1/2 + cos(πr/rcut) and δ is the Dirac delta function. The local atomic density (LAD) is invariant to permutation by construction. It is further projected onto the four-dimensional unit sphere and expressed in terms of 4D spherical harmonics. The bispectrum components, Bj1,j2,j, are then built from the coefficients, cjm,m, of the expansion according to eqn (6), where image file: c8me00003d-t5.tif are Clebsch–Gordan coefficients and j, j1, j2Jmax with Jmax being the only parameter. This additional transformation ensures invariance of the fingerprint to rotation and translation.
 
image file: c8me00003d-t6.tif(5)
 
image file: c8me00003d-t7.tif(6)

This method has two main advantages over symmetry functions. First, it eliminates the need to choose a number of parameters defining the spatial resolution of the fingerprints which increases the fingerprint size, as in the case of symmetry functions and, second, it can be systematically improved with the addition of more spherical harmonics in the summation (as given by Jmax). This method can also be easily extended to multi-component systems using a factor wj for the contribution of each atom j in the LAD. In their original paper, Bartók et al. used Gaussian process regression with the bispectrum representations to predict a number of properties for carbon, silicon and germanium crystals.28 They called their resulting model a Gaussian approximation potential (GAP), which differs from another bispectrum-based approach by Thompson et al. called the spectral neighbor analysis method (SNAP) in which atom energies are assumed to be linearly dependent on the bispectrum components.67 This new assumption afforded easier fitting of the resulting potential. However, a direct and comprehensive comparison between those two methods has yet not been performed.

2.4. Zernike fingerprints

Zernike moments, originally used for 3D shape retrieval in the machine learning community,68 were recently proposed by Khorshidi et al. to build molecular fingerprints.23 In their approach, a LAD function similar to eqn (5) is defined and expanded with 3D Zernike basis functions. Zernike basis functions are defined as products of Zernike polynomials (which are radial basis functions defined inside the unit sphere) and spherical harmonics (which form a basis set on the surface of the unit sphere). This approach is similar to the construction of the bispectrum fingerprint in the sense that an expansion of the LAD with respect to a basis set is sought; but the two methods differ in the use of radial basis functions in the Zernike case versus use of 4D spherical harmonics for the bispectrum approach. This procedure leads to Zernike moments, cmnl, which are the coefficients in the expansion given by eqn (8) where Zmnl are Zernike basis functions. The coefficients are calculated using eqn (7). Invariance under rotation is then achieved by building the Zernike fingerprint with values corresponding to the norm of the vector cnl = (c−1nl,[thin space (1/6-em)]c−1 + 1nl,[thin space (1/6-em)]…,[thin space (1/6-em)]c1nl)for different values of n and l.
 
cmnl = 〈Zmnl([r with combining tilde], θ, ϕ), ρ([r with combining tilde], θ, ϕ)〉(7)
 
image file: c8me00003d-t8.tif(8)

Since the procedure is so similar to the one used for bispectrum fingerprints, Zernike fingerprints share the advantages offered by bispectrum fingerprints and are, moreover, more computationally efficient to calculate.

2.5. SOAP fingerprints

The smooth overlap of atomic positions (SOAP) proposed by Bartók et al.51 is a completely different approach to fingerprinting. In this method, a similarity measure (called SOAP) between two atomic environments given by their respective values of LAD ρ and ρ′ is used directly for learning and predictions, instead of a descriptor. The SOAP, given by S, is defined as the inner product between the two LADs of the reference atoms (see eqn (9)). The similarity kernel, k, is then obtained by integrating S over all possible rotations ([R with combining circumflex]) of one of the environments, as shown in eqn (10). A Gaussian-based LAD given by eqn (11) is used for SOAP instead of Dirac delta function to avoid underestimating similarity between two slightly different environments. If the Gaussian-based LAD is expressed in terms of radial basis functions and spherical harmonics, it can be shown that the kernel given by eqn (10) becomes eqn (12), where b and b′ are the bispectrum of the two environments as defined in eqn (6). This significant result shows that the need to construct descriptors can sometimes be circumvented with no loss of generality in favor of similarity measures that allow for direct comparison between atomic environments.
 
image file: c8me00003d-t9.tif(9)
 
image file: c8me00003d-t10.tif(10)
 
image file: c8me00003d-t11.tif(11)
 
image file: c8me00003d-t12.tif(12)

2.6. Discussion on fingerprint cataloging

In this section, we have proposed a cataloging scheme for fingerprints for machine learning applications. It is important to note that the fingerprints mentioned in this paper are primarily based on the spatial coordinates of atoms in the molecular system of interest. This essentially assumes that such coordinates must have been acquired through means such as ab initio calculations, prior to performing the machine learning studies. Moreover, though they are usually designed to include all atoms in a molecular system, global fingerprints can sometimes be used as local fingerprints by including only atoms that fall within a given cutoff distance of a central atom of interest.

A whole other class of fingerprints can be constructed based solely on topological or connectivity information and/or nature of the chemical species and chemical motifs (such as fragments, rings, etc.) in the structure. Such fingerprints do not explicitly encode spatial coordinates of atoms but can successfully be used to predict various properties using machine learning techniques.11,21,69 A number of the fingerprints from cheminformatics (which are excluded from this study) would also fall into that class. While the categorization scheme suggested here is arbitrary, it provides a suitable framework to better understand and analyze different fingerprints proposed in the literature.

3. The SEING package

Given the increasing number of fingerprints proposed in the literature, it would clearly be desirable for potential users to have a software package that allows quick and efficient evaluations of the suitability of different fingerprinting options for a given problem. Ideally, such a software package would include options to use all existing fingerprints. It should also be modular so that newly proposed fingerprints can be easily incorporated. Due to the computational cost of calculating some fingerprints, such as bispectrums, efficient speed of computation would also be a requirement. Moreover, such a package should be easy to use and incorporate into the overall flow of using machine learning for molecular systems.

SEING is the name of a package that we have developed and are hereby releasing with those requirements in mind. SEING is written in C/C++ for fast computation of fingerprints and is designed in a modular fashion for extensibility. Packages such as AMP23 and AeNET70 and tensormol71 include utilities for fingerprint calculations. But their primary focus is to use neural network approaches for machine learning force field development, whereas SEING focuses solely on the fingerprinting methods. As such, SEING allows more flexibility for the choice of which machine learning algorithm to use and allow applications beyond machine learning force field development.

An overview flowchart showing how SEING works is given in Fig. 3. The atomic coordinates of the system are typically provided in an XYZ file which is read and manipulated as an “AtomicSystem” object within SEING and used to instantiate the fingerprint calculator of interest. Support for other coordinate file formats will be added in the future. Other inputs such as the parameter values for the fingerprint of interest are provided in an input file. SEING also implements its own neighbor-searching algorithm for faster computation of local fingerprints. In SEING, every fingerprint is implemented as a separate calculator. When a local fingerprint is needed, the “calculate_fingerprint” function of the calculator instance is called, with the atom of interest and its neighbors as arguments. In the case of a global fingerprint, the entire “AtomicSystem” object is used. From a development perspective, this allows any fingerprint-specific logic to be implemented within the calculator class which remains valid as long as an appropriate “calculate_fingerprint” function is exposed.


image file: c8me00003d-f3.tif
Fig. 3 Flowchart showing the general procedure used by SEING to compute fingerprints.

From a user's perspective, SEING has minimal requirements for installation and can be easily compiled on most operating systems. Using the code requires a coordinate file and an input file containing the type of fingerprint needed and any fingerprint-specific parameters. SEING implements two strategies to account for systems with multiple species: augmented and weighted. The “augmented” strategy increases the dimensionality of a given fingerprint by appending sub-fingerprints for each species and species combination whereas in the “weighted” strategy, any summation over atoms is modified by assigning a species-specific weight to each term. This weight can be the atomic number, electro-negativity, or any other value chosen by the user. Also, when available, derivatives of a fingerprint can be easily calculated and appended to the feature vector. More details on code installation, instructions for using the code, and how to contribute to the code are provided in the official documentation and user guide, accessible at https://seing.readthedocs.io. The source code is hosted on Github at https://github.com/mreveil/seing.

Our intention is that the availability of SEING will allow researchers to forgo a custom implementation of every fingerprint that they wish to use; this will allow them to focus on the predictive task at hand. Current fingerprinting methods implemented in SEING include symmetry functions, bispectrum, AGNI and Zernike with more options in the pipeline for future additions. Since it is open-source, SEING also welcomes contributions from the community for bug-tracking and bug-solving, as well as implementation of new fingerprints and the addition of new features. In the next section, we will present examples of using SEING in a machine learning workflow.

4. Sample case study using SEING

To illustrate how SEING can be used in a machine learning workflow, we use a reference dataset published by Huan et al. as part of the development of their machine learning-based AGNI48 force field. The use of a published reference dataset allows us to better evaluate the value of using SEING for quick fingerprinting. It also provides a better comparison with published machine learning accuracy achieved on the same dataset. Density functional theory (DFT)-based molecular dynamics simulation data for Al, Cu, Ti and W performed at a temperature of 300 K are taken from this dataset to use in this case study. More details on the dataset and how it was generated can be found in the paper published by Huan et al.48

We start by extracting all the frames from the MD trajectories and then using SEING to generate BP, Zernike and AGNI fingerprints for each atom in all the frames for all four systems (Al, Cu, Ti and W). Parameters used for the BP fingerprints are η = {0.05, 20.0, 50.0, 100.0} for Grad radial components, and η = 0.005, γ = {1.0, −1.0} and ζ = {1.0, 4.0} for Gang angular components. Derivatives of Grad and Gang were also computed and appended to the fingerprints, leading to a fingerprint dimensionality of 16. A value of nmax = 5 was used to create Zernike fingerprints for which derivatives were also computed and added to the descriptor, leading to a fingerprint dimensionality of 24. For the AGNI fingerprints, a Gaussian width of 0.1 Å was used as suggested in the AGNI paper48 and 32 uniformly distributed Gaussian centers ak between a distance of 0.0 Å and cutoff of 6.5 Å were used as parameters leading to a fingerprint dimensionality of 32 (compared to a dimensionality of 48 for the original paper with a cutoff of 8 Å). A cutoff of 6.5 Å was used for all the systems and fingerprinting schemes. The reference database size of the fingerprints and associated forces for Al, Cu, Ti and W was, respectively, 9568, 9568, 3584 and 4784.

We then use the k-nearest neighbors algorithm (k-NN) as implemented in the python scikit package37,38 to train a model using a randomly selected subset representing 20% of the dataset, reserving the other 80% for subsequent testing/evaluation. k-NN is a non-parametric machine learning algorithm that can be used for classification72 and regression.73 When used for regression, the k-NN prediction for a query point is based on the average value associated with the k nearest neighbors, where k is a tunable hyper-parameter. Here we used a weighted average of the force associated with the neighbors (in fingerprint space) of a given fingerprint, where contributions to the average decrease as inverse neighbor distance.

Hyper-parameter tuning was performed with 10-fold cross-validation whereby the training dataset was randomly split into ten subsets of approximately equal size: training was performed on nine of them; the tenth subset was reserved for validation. This process was repeated ten times with the same k value, and the mean score achieved was recorded as an estimate of the generalization error. A simple grid search on k values showed that, for both BP and Zernike fingerprints, a k-value of 3 is sufficient for accurate predictions and was therefore used for training and subsequent testing. Although not shown here, we have found that the R2 score decreases for high k values. This is attributed to the fact that the further away the neighbors, the more dissimilar they are with the chemical environment of interest and therefore should not be used in the force-average.

As shown in Fig. 4, for all twelve cases studied here, we found that the k-nearest neighbor models lead to excellent force prediction, with root mean square (RMS) errors on the order of 0.0001 eV Å−1 and mean absolute errors (MAE) on the order of 0.01 eV Å−1. Overall, the AGNI fingerprints show superior performance compared to both the BP and Zernike fingerprints. The least accurate model among the twelve cases is the BP-based Aluminum model which showed an R2 error of 0.88 and an RMS of 0.0016 eV Å−1. We have verified (not shown here) that this relatively low performance is due to the training size of only 20% of the dataset and we attribute this poor performance for a small training set to the fact that BP fingerprints are unable to properly capture subtle differences in Al configurations. Better performance could be achieved if the BP parameters were tuned to allow for higher spatial resolution of the BP fingerprints. Compared to the original study published by Huan et al.,48 the best RMS achieved was 0.016 eV Å−1, one order of magnitude higher than the worst RMS achieved in this study. However, Huan et al.'s study included other comprehensive and rigorous quality metrics (in addition to the RMS error) which are not used in this comparison with our method.


image file: c8me00003d-f4.tif
Fig. 4 Performance of k-nearest neighbor machine learning models for three different fingerprinting schemes: AGNI (a, d, g and j), BP (b, e, h and k) and Zernike (c, f, i and l) and four different materials systems: Al (a–c), W (d–f), Cu (g–i) and Ti (j–l). Root mean square errors of as low as 0.03 meV Å−1 are achieved for the force prediction. The AGNI fingerprint outperforms both BP and Zernike in this study.

Beyond illustrating how the capabilities afforded by SEING can be leveraged to quickly build machine learning-based predictive models, this study also shows that k-NN is a promising alternative to neural networks,18,74 Gaussian process,28,75 Kernel ridge regression,57,76 and support vector machine regression77 models, which are commonly used by the community.10 Some advantages offered by k-NN include faster training, easier implementation and the ability to model highly unusual functions with no assumptions regarding its form. Moreover, the neighbor distance can be used as a proxy for the quality of the prediction. Although this latter advantage is not exclusive to k-NN, it is more interesting here because the entire algorithm is already based on neighbor distances. In practice, this means that if the nearest neighbors are too far from the test point, one can reasonably have less confidence in the prediction and use that as an opportunity to populate that space with more training points. A similar idea was previously suggested by Janet et al.69 The k-NN algorithm also allows the user to forgo clustering and sampling strategies such as the ones used in the development of the AGNI method,48 whilst still achieving excellent predictive capabilities. The suitability of k-NN based force models to conduct molecular dynamics simulations has not been explored here and remains an open question. Also, since the k-NN method is, by design, a local interpolation in the region of the test point, it is not expected to generalize to highly dissimilar systems. However, extended extrapolation capabilities are usually not expected for non-physics based methods such as in machine learning-based force fields.

5. Conclusions

The increasing number of molecular fingerprints for machine learning studies proposed in the literature have warranted the need for a software package that can serve as a repository for existing and future advances in fingerprinting. SEING, the C/C++ package that we have built, is intended for exactly that purpose. It can significantly streamline and reduce the time needed to perform machine learning-based computational materials science studies. In this paper, after a succinct review of existing fingerprints in the literature, we provided the key design principles of SEING and illustrated how it can be leveraged to build quick models for force predictions as in the case of Al, Cu, Ti and W crystals. Our k-NN models show excellent predictive capabilities by achieving generalization RMS scores as low as 0.0003 eV Å−1 on reserved test data accounting for 80% of the reference dataset. With the open-source release of SEING, we hope the community will embrace its use and further contribute to its development.

Conflicts of interest

There are no conflicts to declare.

Acknowledgements

The authors wish to acknowledge the following members of the Clancy group for useful discussions: Henry Herbol and Nikita Sengar. Mardochee Reveil thanks the GEM Consortium and the Colman family for generous funding through the GEM and Colman Fellowship at Cornell. Mardochee Reveil also thanks Corning Incorporated for their support of his graduate research. This work benefited from computing resources provided by the Cornell Institute of Computational Science and Engineering (ICSE).

References

  1. I. Kononenko, Artif. Intell. Med., 2001, 23, 89–109 CrossRef CAS PubMed .
  2. K. K. Wong, L. Wang and D. Wang, Comput. Med. Imaging Graph., 2017, 57, 1–3 CrossRef PubMed .
  3. K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis and D. I. Fotiadis, Comput. Struct. Biotechnol. J., 2015, 13, 8–17 CrossRef CAS PubMed .
  4. A. Graves, A. R. Mohamed and G. Hinton, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649 Search PubMed .
  5. L. Deng, G. Hinton and B. Kingsbury, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 8599–8603 Search PubMed .
  6. A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and A. Y. Ng, ArXiv e-prints, 2014 Search PubMed .
  7. D. Yi, Z. Lei, S. Liao and S. Z. Li, ArXiv e-prints, 2014 Search PubMed .
  8. S.-J. Wang, H.-L. Chen, W.-J. Yan, Y.-H. Chen and X. Fu, Neural Process Lett., 2014, 39, 25–43 CrossRef .
  9. S. E. Thompson, F. Mullally, J. Coughlin, J. L. Christiansen, C. E. Henze, M. R. Haas and C. J. Burke, Astrophys. J., 2015, 812, 46 CrossRef .
  10. F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley and O. A. von Lilienfeld, J. Chem. Theory Comput., 2017, 13, 5255–5264 CrossRef CAS PubMed .
  11. M. Fernandez, N. R. Trefiak and T. K. Woo, J. Phys. Chem. C, 2013, 117, 14095–14105 CAS .
  12. W. W. Tipton and R. G. Hennig, J. Phys.: Condens. Matter, 2013, 25, 495401 CrossRef PubMed .
  13. S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R. H. Taylor, L. J. Nelson, G. L. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo and O. Levy, Comput. Mater. Sci., 2012, 58, 227–235 CrossRef CAS .
  14. B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. RÃijhl and C. Wolverton, npj Comput. Mater., 2015 Search PubMed .
  15. G. Hautier, C. C. Fischer, A. Jain, T. Mueller and G. Ceder, Chem. Mater., 2010, 22, 3762–3767 CrossRef CAS .
  16. J. Behler, J. Chem. Phys., 2011, 134, 074106 CrossRef PubMed .
  17. J. Behler, Phys. Chem. Chem. Phys., 2011, 13, 17930–17955 RSC .
  18. J. Behler, Angew. Chem., Int. Ed., 2017, 56, 12828–12840 CrossRef CAS PubMed .
  19. C. M. Handley and P. L. A. Popelier, J. Phys. Chem. A, 2010, 114, 3371–3383 CrossRef CAS PubMed .
  20. A. R. Oganov and C. W. Glass, J. Chem. Phys., 2006, 124, 244704 CrossRef PubMed .
  21. G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga, R. Ramprasad, J. E. Gubernatis and T. Lookman, Sci. Rep., 2016, 19375 CrossRef CAS PubMed .
  22. L. Ward and C. Wolverton, Curr. Opin. Solid State Mater. Sci., 2017, 21, 167–176 CrossRef CAS .
  23. A. Khorshidi and A. A. Peterson, Comput. Phys. Commun., 2016, 207, 310–324 CrossRef CAS .
  24. H. Gassner, M. Probst, A. Lauenstein and K. Hermansson, J. Phys. Chem. A, 1998, 102, 4596–4605 CrossRef CAS .
  25. M. Rupp, A. Tkatchenko, K.-R. Müller and O. A. von Lilienfeld, Phys. Rev. Lett., 2012, 108, 058301 CrossRef PubMed .
  26. K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.-R. MÃijller and A. Tkatchenko, J. Phys. Chem. Lett., 2015, 6, 2326–2331 CrossRef CAS PubMed .
  27. J. Behler and M. Parrinello, Phys. Rev. Lett., 2007, 98, 146401 CrossRef PubMed .
  28. A. P. Bartók, M. C. Payne, R. Kondor and G. Csányi, Phys. Rev. Lett., 2010, 104, 136403 CrossRef PubMed .
  29. N. Artrith, A. Urban and G. Ceder, Phys. Rev., 2017, 96, 014112 CrossRef .
  30. L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl and M. Scheffler, Phys. Rev. Lett., 2015, 114, 105503 CrossRef PubMed .
  31. K. Yao, J. E. Herr and J. Parkhill, J. Chem. Phys., 2017, 146, 014106 CrossRef PubMed .
  32. K. Yao, J. E. Herr, S. N. Brown and J. Parkhill, J. Phys. Chem. Lett., 2017, 8, 2689–2694 CrossRef CAS PubMed .
  33. A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill and S. Goedecker, J. Chem. Phys., 2013, 139, 184118 CrossRef PubMed .
  34. F. Pietrucci and W. Andreoni, Phys. Rev. Lett., 2011, 107, 085504 CrossRef PubMed .
  35. L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Rostami, S. A. Ghasemi, A. Sadeghi, M. Grauzinyte, C. Wolverton and S. Goedecker, J. Chem. Phys., 2016, 144, 034203 CrossRef PubMed .
  36. M. Abadi, et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, https://tensorflow.org, 2015 Search PubMed .
  37. L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt and G. Varoquaux, ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122 Search PubMed .
  38. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed .
  39. F. Chollet, et al., Keras, https://github.com/fchollet/keras, 2015 Search PubMed .
  40. D. Filimonov, V. Poroikov, Y. Borodina and T. Gloriozova, J. Chem. Inf. Comput. Sci., 1999, 39, 666–670 CrossRef CAS .
  41. M. Sastry, J. F. Lowrie, S. L. Dixon and W. Sherman, J. Chem. Inf. Model., 2010, 50, 771–784 CrossRef CAS PubMed .
  42. P. Willett, Drug Discovery Today, 2006, 11, 1046–1053 CrossRef CAS PubMed .
  43. C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann and E. Willighagen, J. Chem. Inf. Comput. Sci., 2003, 43, 493–500 CrossRef CAS PubMed .
  44. G. Landrum, RDKit: Open-source cheminformatics, http://rdkit.org Search PubMed .
  45. C. W. Yap, J. Comput. Chem., 2011, 32, 1466–1474 CrossRef CAS PubMed .
  46. J. Behler, J. Chem. Phys., 2016, 145, 170901 CrossRef PubMed .
  47. V. Botu and R. Ramprasad, Phys. Rev. B: Condens. Matter Mater. Phys., 2015, 92, 094306 CrossRef .
  48. T. D. Huan, R. Batra, J. Chapman, S. Krishnan, L. Chen and R. Ramprasad, npj Comput. Mater., 2017, 3, 89–109 Search PubMed .
  49. K. V. J. Jose, N. Artrith and J. Behler, J. Chem. Phys., 2012, 136, 194111 CrossRef PubMed .
  50. M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi and P. Marquetand, ArXiv e-prints, 2017 Search PubMed .
  51. A. P. Bartók, R. Kondor and G. Csányi, Phys. Rev. B: Condens. Matter Mater. Phys., 2013, 87, 184115 CrossRef .
  52. G. Ferré, T. Haut and K. Barros, J. Chem. Phys., 2017, 146, 114107 CrossRef PubMed .
  53. D. Rogers and M. Hahn, J. Chem. Inf. Model., 2010, 50, 742–754 CrossRef CAS PubMed .
  54. K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller and E. K. U. Gross, Phys. Rev. B: Condens. Matter Mater. Phys., 2014, 89, 205118 CrossRef .
  55. O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp and A. Knoll, Int. J. Quantum Chem., 2015, 115, 1084–1093 CrossRef CAS .
  56. B. Huang and O. A. von Lilienfeld, J. Chem. Phys., 2016, 145, 161102 CrossRef PubMed .
  57. H. Huo and M. Rupp, ArXiv e-prints, 2017 Search PubMed .
  58. S. Kearnes, K. McCloskey, M. Berndl, V. Pande and P. Riley, J. Comput.-Aided Mol. Des., 2016, 30, 595–608 CrossRef CAS PubMed .
  59. O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo and A. Tropsha, Nat. Commun., 2017, 8, 15679 CrossRef CAS PubMed .
  60. S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles and P. L. A. Popelier, J. Comput. Chem., 2013, 34, 1850–1861 CrossRef CAS PubMed .
  61. T. D. Huan, A. Mannodi-Kanakkithodi and R. Ramprasad, Phys. Rev. B: Condens. Matter Mater. Phys., 2015, 92, 014106 CrossRef .
  62. J. P. Janet and H. J. Kulik, J. Phys. Chem. A, 2017, 121, 8939–8954 CrossRef CAS PubMed .
  63. C. R. Collins, G. J. Gordon, O. A. von Lilienfeld and D. J. Yaron, arXiv, 2016, https://arxiv.org/abs/1701.06649.
  64. B. Jiang and H. Guo, J. Chem. Phys., 2013, 139, 054112 CrossRef PubMed .
  65. J. Li, B. Jiang and H. Guo, J. Chem. Phys., 2013, 139, 204103 CrossRef PubMed .
  66. R. Kondor, CoRR, 2007, abs/cs/0701127, year Search PubMed .
  67. A. Thompson, L. Swiler, C. Trott, S. Foiles and G. Tucker, J. Comput. Phys., 2015, 285, 316–330 CrossRef CAS .
  68. M. Novotni and R. Klein, Comput. Aided Des., 2004, 36, 1047–1062 CrossRef .
  69. J. P. Janet and H. J. Kulik, Chem. Sci., 2017, 8, 5137–5152 RSC .
  70. N. Artrith and A. Urban, Comput. Mater. Sci., 2016, 114, 135–150 CrossRef CAS .
  71. K. Yao, J. E. Herr, D. W. Toth, R. Mcintyre and J. Parkhill, ArXiv e-prints, 2017 Search PubMed .
  72. T. Cover and P. Hart, IEEE Trans. Inf. Theory, 2006, 13, 21–27 CrossRef .
  73. L. Devroye, L. Gyorfi, A. Krzyzak and G. Lugosi, Ann. Stat., 1994, 22, 1371–1385 CrossRef .
  74. T. B. Blank, S. D. Brown, A. W. Calhoun and D. J. Doren, J. Chem. Phys., 1995, 103, 4129–4137 CrossRef CAS .
  75. A. P. BartÃşk and G. CsÃąnyi, Int. J. Quantum Chem., 2015, 115, 1051–1057 CrossRef .
  76. M. Rupp, Int. J. Quantum Chem., 2015, 115, 1058–1073 CrossRef CAS .
  77. R. M. Balabin and E. I. Lomakina, Phys. Chem. Chem. Phys., 2011, 13, 11710–11718 RSC .

This journal is © The Royal Society of Chemistry 2018