Mardochee
Reveil
* and
Paulette
Clancy

Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY 14853, USA. E-mail: mr937@cornell.edu

Received
9th January 2018
, Accepted 19th February 2018

First published on 20th February 2018

Direct mapping between material structures and properties for various classes of materials is often the ultimate goal of materials researchers. Recent progress in the field of machine learning has created a unique path to develop such mappings based on empirical data. This new opportunity warranted the need for the development of advanced structural representations suitable for use with current machine learning algorithms. A number of such representations termed “molecular fingerprints” or descriptors have been proposed over the years for this purpose. In this paper, we introduce a classification framework to better explain and interpret existing fingerprinting schemes in the literature, with a focus on those with spatial resolution. We then present the implementation of SEING, a new codebase to computing those fingerprints, and we demonstrate its capabilities by building k-nearest neighbor (k-NN) models for force prediction that achieve a generalization accuracy of 0.1 meV Å^{−1} and an R^{2} score as high as 0.99 at testing. Our results indicate that simple and generally overlooked k-NN models could be very promising compared to approaches such as neural networks, Gaussian processes, and support vector machines, which are more commonly used for machine learning-based predictions in computational materials science.

## Design, System, ApplicationWithin the structure-property relationship framework, the atomistic structure of a material is a strong determinant of its properties and suitability for specific applications. Therefore, material design projects usually involve a proper understanding of underlying structures acquirable using Quantum or classical calculations. An exciting development in computational material design is the use of artificial intelligence (AI) and machine learning (ML) techniques to accelerate the analysis, design, and discovery of “structural signatures” that impact properties. However, a key requirement for this new progression to become more useful has been the availability of spatially resolved numerical representations of atomic structures that are invariant under all property-preserving operations such as geometric translations and rotations. Although this requirement has proven more elusive than expected, significant progress has been accomplished over the past decade or so, with proposed representations accomplishing various degrees of performance. In this paper, those representation schemes traditionally termed “fingerprints”, are categorized and reviewed. A new package called SEING is also created and released to the community to streamline and facilitate the use of those fingerprints in ML applications. This work can help accelerate the use of AI in materials science applications for better understanding and easier discovery of a wide range of new materials. |

“Feature extraction,” or “feature engineering,” is a key step in machine learning studies to ensure that accurately predictive models can be built from available data. In this context, feature engineering refers to the process of identifying, evaluating and applying one or more transformations (usually mathematical operations) to a dataset to enhance its suitability for machine learning use. For applications involving prediction of properties based on molecular structures, this process usually leads to the computation of a vector or matrix called a “fingerprint,” or “descriptor,” that represents a chemical structure or environment (see Fig. 1). Beyond machine learning applications, fingerprints obtained in this manner can also be used for a systematic study of structure–property relationships, including structural analysis and structural comparison for the identification of configurations of interest and for property prediction.^{18}

To be suitable for machine learning studies, specially force field development, fingerprints have to satisfy a number of requirements such as invariance under rotation, translation and permutation of atoms of the same chemical nature.^{18,22,23} Differentiability and speed of computation are also desirable.^{18,23} The need for symmetrization of atomic coordinates to satisfy the requirement of invariance under permutation of chemically identical atoms, especially for use in neural networks, has been recognized for at least two decades.^{24} This need arises because Cartesian coordinates, traditionally used to represent molecular systems, are not invariant under those geometric operations. For example, permutation of two atoms of the same type does not affect the properties of molecular systems, but it changes their Cartesian coordinates. Hence the need arises for alternative representations that are permutation-invariant. Using the same argument, one can conclude that rotation and translation-invariance are also necessary. Differentiability is usually desirable for force predictions.

Many different fingerprints have been proposed over the years for machine learning applications. Some examples include Coulomb matrix,^{25} bag of bonds,^{26} symmetry functions,^{27} bispectrum,^{28} Zernike,^{23} and others. More recently, Artrith et al. proposed a way to generate fingerprints for compositions containing many species.^{29} A method to automatically find appropriate descriptors was suggested by Ghiringhelli et al.^{30} A fingerprint called a “depth map” (D-map) that readily inverts into molecular geometries was tested by Yao et al.^{31} Other descriptors such as bonds-in-molecule^{32} and many-body expansion approaches have also been recently explored. Sadeghi et al. evaluated a number of fingerprints of the form n × n matrices, where n is the number of atoms in the system,^{33} including contact^{34} and overlap^{35} matrices, useful for evaluating similarity between structures. Since there are so many different fingerprints proposed in the literature, we propose a fingerprinting categorization scheme that affords an easier and more systematic analysis.

With such a large number of available descriptors, the use of machine learning for molecular studies usually begins with an evaluation to find the best fingerprint to use for the problem at hand. Until now, most fingerprints are not readily available to the community in an “off-the-shelf” code. As a result, researchers invariably have to implement the fingerprints from scratch, which constitutes a time-consuming and counter-productive effort (a wasteful duplication of effort across the community). This lack of an accessible codebase “fingerprinting” resource provides one of the main motivations for this paper.

Here we introduce a new C/C++ package called SEING (an old French word for a signature) that focuses solely on fingerprint generation. The subsequent steps of selecting, and using, a suitable machine learning algorithm is left outside of the scope of this platform as there are already several mature tools available for such tasks from the machine learning community (e.g. Tensorflow,^{36} scikit-learn,^{37,38} keras,^{39} and so on). SEING aims to be a comprehensive repository for molecular fingerprints proposed in the literature that are specifically intended for machine learning applications. As such, SEING is built in a modular fashion to facilitate the expansion of the code such that it can readily allow future new fingerprint techniques to be easily incorporated. As illustrated in Fig. 1, this code is intended to significantly reduce the time for implementing and generating spatially-resolved molecular fingerprints from hours/days to seconds/minutes. In this sense, it will alleviate, if not eliminate, perceived barriers towards implementing and conducting machine learning studies of molecular systems by increasing accessibility the same way, say, LAMMPS increased accessibility to conduct molecular dynamic simulations.

A number of other fingerprints exist in the field of drug design to which the larger materials science community probably owes the term “molecular fingerprint”.^{40–42} Those molecular representations, for the most part, were created years before the relatively new fingerprints addressed in this paper and were specifically designed for comparisons and quantitative structure–activity relationship studies of molecules relevant to biological applications. Notwithstanding their usefulness otherwise, most of those fingerprints are not particularly suitable for spatially resolved machine learning studies of molecular systems as they usually do not satisfy the requirements discussed above. They are intentionally left outside of the scope of this paper and the SEING package. Other tools such as CDK,^{43} RDKit^{44} and PaDEL^{45} have been created and released to the therapeutic drug discovery community to help streamline the process of generating those fingerprints.

The remainder of this paper is organized as follow: with a focus on spatially resolved fingerprinting methods, we start by proposing a categorization scheme (a taxonomy of sorts) which is capable of incorporating most, if not all, of the existing fingerprints available in the literature that are suitable for molecular systems in materials science applications. We then review a handful of selected fingerprints based on their relevance to, and prior use in, force field development. A comprehensive review of all existing fingerprinting strategies is beyond the scope of this study, but the reader is referred to the following partial reviews for more details: Ward and Wolverton,^{22} Behler^{46,18} and Faber et al.^{10} Finally, we describe the philosophy behind the development of the SEING package and finish with sample applications of a SEING-based workflow for force predictions. The reader should note that the terms “fingerprint” and “descriptor” are used interchangeably in this paper and, unless noted otherwise, refer to molecular representations developed for machine learning applications.

Fingerprints | Classification | 2007–2010 | 2011–2013 | 2014–2016 | 2017 |
---|---|---|---|---|---|

Atom-centered | Symmetry functions | Behler–Parinello^{27} |
Behler^{16} |
AGNI^{47} |
AGNI2.0 (ref. 48) |

Jose–Artrith–Behler^{49} |
wACSF^{50} |
||||

Basis set decomposition | Bispectrum^{28} |
Zernike^{23} |
Artrith–Urban–Ceder^{29} |
||

Kernel-based | SOAP^{51} |
GRAPE^{52} |
|||

Global | Pair distances only | Extended connectivity^{53} |
Coulomb matrix^{25} |
PRDF^{54} |
Bonds in molecule^{32} |

Contact matrix^{34} |
Bag of bonds^{26} |
||||

Overlap matrix^{35} |
Fourier series^{55} |
||||

Hamiltonian matrix^{33} |
|||||

Hessian matrix^{33} |
|||||

Distances, angles, dihedrals | BAML^{56} |
Many-body expansion^{31} |
|||

MBTR^{57} |
|||||

MARAD^{10} |
|||||

HDAD^{10} |
|||||

Graph-based | Molecular graph^{58} |
Property-labeled^{59} |
|||

Materials fragments | |||||

Others | ALF^{60} |
Motif-based^{61} |
D-Map^{31} |
||

RAC^{62} |

Atom-centered fingerprints can be further divided into whether they are based on symmetry functions, basis set decomposition or kernels. Symmetry function-based representations use summations over bonds, angles or dihedrals of the same nature, or summation over functions thereof, for all atoms within a given sphere around the atom of interest. Fingerprints that fall within this category include Behler–Parrinello,^{16,27} and Jose–Artrith–Behler.^{49} They usually differ by the functional forms used in the summation. Basis set decompositions, on the other hand, rely on a local atomic density function defined around the atom of interest which is then expressed as a linear combination of a suitable basis set. Subsequent operations are usually used to ensure invariance under rotation and/or translation, and so on. Example such fingerprints include Bispectrum,^{28} Zernike,^{23} and the Artrith–Urban–Ceder method, which was specially developed to support multi-component systems.^{29} The smooth overlap of atomic positions (SOA)^{51} and graph-approximated energy (GRAPE)^{52} methods fall into a third category in which local environments are compared directly with the help of an appropriate kernel function without the explicit computation of a fingerprint.

Global fingerprints, on the other hand, can be divided into ones that involve only pair distances and ones that include higher-order interactions. The former subcategory has the largest number of fingerprints and includes extended connectivity,^{53} Coulomb matrix,^{25} as well as contact,^{34} overlap,^{35} Hessian and Hamiltonian matrices,^{33} partial radial distribution functions (PRDF),^{54} bag-of-bonds,^{26} Fourier series decomposition of radial distribution functions,^{55} bonds-in-molecule,^{32} connectivity count and encoded distances.^{63} Those fingerprints are built using different schemes or functions of some or all pair distances in the molecule or crystal of interest. Sometimes, the fingerprints also use information related to the nature of the different elements, such as partial charges, atomic numbers, etc.

The latter subcategory includes recently developed fingerprints such as bond angle machine learning (BAML),^{56} many-body expansion (MBE),^{31} many-body tensor representation,^{57} molecular atomic radial angular distribution (MARAD),^{10} and the histogram of distances, angles and dihedral angles (HDAD).^{10} Those fingerprints include higher-order interaction terms such as angles and dihedrals and usually lead to more accurate predictions compared to their pair distances-only counterparts.

A third subcategory of global fingerprints is comprised of graph-based descriptors, such as molecular graphs^{58} and property-labelled materials fragments,^{59} which are suitable for use in graph-based convolutional neural networks. Some other descriptors can be included in their own subcategory; these include the atomic local frame (ALF),^{60} motif-based fingerprint^{61} and the revised autocorrelation (RAC) fingerprints^{62} which are inspired by descriptors used in cheminformatics. This subcategory would also include the D-map^{31} fingerprint that can be readily mapped back to chemical structures.

(1) |

(2) |

(3) |

Other types of symmetry and cutoff functions have been proposed by different authors. For example, pair-centered symmetry functions have been used by Jose et al.^{49} to construct neural network potentials. Another approach, based on permutation-invariant polynomials, has been suggested by Jiang et al.^{64} and by Li et al.^{65} In this method, Cartesian coordinates are replaced by a summation over symmetrized Morse-like monomials that include all possible nuclear permutations in the system. AGNI^{47,48} fingerprints are another example of symmetry function-based fingerprints developed to predict atomic forces directly. As shown by eqn (4), AGNI fingerprints are very similar to the G^{rad} components of the BP fingerprints (see eqn (1)). However, they include a direction-specific coefficient given by the ratio of the α component (where α = x, y or z) of the pair distance, r^{α}_{ij}, divided by the pair distance r_{ij} between atoms i and j, akin to a derivative of the BP fingerprints in that direction. In addition, specially in their newer version of AGNI, those authors showed that use of Gaussian centers a_{k} (labeled R_{s} in G^{rad}) as the main parameter, rather than the width w (labeled η in G^{rad}) of the Gaussians, led to better predictive capabilities for the machine learning models.

(4) |

Although symmetry functions have been successfully used for a number of applications, one significant drawback of this approach has been the size increase of the descriptors with the number of elements. Traditionally, a new set of radial functions are created for each atom type and a new set of angular functions for every combination of elements, leading to a significant increase of fingerprint size with number of species. This usually makes it impractical to use symmetry functions for systems containing more than four elements.^{18} However, the recent weighted atom-centered symmetry functions (wACSF)^{50} method proposed by Gastegger et al. attributes a species-specific weight to each term in the sum for G^{rad} and G^{ang}, hence avoiding the fingerprint dimensionality increase with number of elements. A similar approach was used by Artrith et al.^{29}

(5) |

(6) |

This method has two main advantages over symmetry functions. First, it eliminates the need to choose a number of parameters defining the spatial resolution of the fingerprints which increases the fingerprint size, as in the case of symmetry functions and, second, it can be systematically improved with the addition of more spherical harmonics in the summation (as given by J_{max}). This method can also be easily extended to multi-component systems using a factor w_{j} for the contribution of each atom j in the LAD. In their original paper, Bartók et al. used Gaussian process regression with the bispectrum representations to predict a number of properties for carbon, silicon and germanium crystals.^{28} They called their resulting model a Gaussian approximation potential (GAP), which differs from another bispectrum-based approach by Thompson et al. called the spectral neighbor analysis method (SNAP) in which atom energies are assumed to be linearly dependent on the bispectrum components.^{67} This new assumption afforded easier fitting of the resulting potential. However, a direct and comprehensive comparison between those two methods has yet not been performed.

c^{m}_{nl} = 〈Z^{m}_{nl}(, θ, ϕ), ρ(, θ, ϕ)〉 | (7) |

(8) |

Since the procedure is so similar to the one used for bispectrum fingerprints, Zernike fingerprints share the advantages offered by bispectrum fingerprints and are, moreover, more computationally efficient to calculate.

(9) |

(10) |

(11) |

(12) |

A whole other class of fingerprints can be constructed based solely on topological or connectivity information and/or nature of the chemical species and chemical motifs (such as fragments, rings, etc.) in the structure. Such fingerprints do not explicitly encode spatial coordinates of atoms but can successfully be used to predict various properties using machine learning techniques.^{11,21,69} A number of the fingerprints from cheminformatics (which are excluded from this study) would also fall into that class. While the categorization scheme suggested here is arbitrary, it provides a suitable framework to better understand and analyze different fingerprints proposed in the literature.

SEING is the name of a package that we have developed and are hereby releasing with those requirements in mind. SEING is written in C/C++ for fast computation of fingerprints and is designed in a modular fashion for extensibility. Packages such as AMP^{23} and AeNET^{70} and tensormol^{71} include utilities for fingerprint calculations. But their primary focus is to use neural network approaches for machine learning force field development, whereas SEING focuses solely on the fingerprinting methods. As such, SEING allows more flexibility for the choice of which machine learning algorithm to use and allow applications beyond machine learning force field development.

An overview flowchart showing how SEING works is given in Fig. 3. The atomic coordinates of the system are typically provided in an XYZ file which is read and manipulated as an “AtomicSystem” object within SEING and used to instantiate the fingerprint calculator of interest. Support for other coordinate file formats will be added in the future. Other inputs such as the parameter values for the fingerprint of interest are provided in an input file. SEING also implements its own neighbor-searching algorithm for faster computation of local fingerprints. In SEING, every fingerprint is implemented as a separate calculator. When a local fingerprint is needed, the “calculate_fingerprint” function of the calculator instance is called, with the atom of interest and its neighbors as arguments. In the case of a global fingerprint, the entire “AtomicSystem” object is used. From a development perspective, this allows any fingerprint-specific logic to be implemented within the calculator class which remains valid as long as an appropriate “calculate_fingerprint” function is exposed.

From a user's perspective, SEING has minimal requirements for installation and can be easily compiled on most operating systems. Using the code requires a coordinate file and an input file containing the type of fingerprint needed and any fingerprint-specific parameters. SEING implements two strategies to account for systems with multiple species: augmented and weighted. The “augmented” strategy increases the dimensionality of a given fingerprint by appending sub-fingerprints for each species and species combination whereas in the “weighted” strategy, any summation over atoms is modified by assigning a species-specific weight to each term. This weight can be the atomic number, electro-negativity, or any other value chosen by the user. Also, when available, derivatives of a fingerprint can be easily calculated and appended to the feature vector. More details on code installation, instructions for using the code, and how to contribute to the code are provided in the official documentation and user guide, accessible at https://seing.readthedocs.io. The source code is hosted on Github at https://github.com/mreveil/seing.

Our intention is that the availability of SEING will allow researchers to forgo a custom implementation of every fingerprint that they wish to use; this will allow them to focus on the predictive task at hand. Current fingerprinting methods implemented in SEING include symmetry functions, bispectrum, AGNI and Zernike with more options in the pipeline for future additions. Since it is open-source, SEING also welcomes contributions from the community for bug-tracking and bug-solving, as well as implementation of new fingerprints and the addition of new features. In the next section, we will present examples of using SEING in a machine learning workflow.

We start by extracting all the frames from the MD trajectories and then using SEING to generate BP, Zernike and AGNI fingerprints for each atom in all the frames for all four systems (Al, Cu, Ti and W). Parameters used for the BP fingerprints are η = {0.05, 20.0, 50.0, 100.0} for G^{rad} radial components, and η = 0.005, γ = {1.0, −1.0} and ζ = {1.0, 4.0} for G^{ang} angular components. Derivatives of G^{rad} and G^{ang} were also computed and appended to the fingerprints, leading to a fingerprint dimensionality of 16. A value of n_{max} = 5 was used to create Zernike fingerprints for which derivatives were also computed and added to the descriptor, leading to a fingerprint dimensionality of 24. For the AGNI fingerprints, a Gaussian width of 0.1 Å was used as suggested in the AGNI paper^{48} and 32 uniformly distributed Gaussian centers a_{k} between a distance of 0.0 Å and cutoff of 6.5 Å were used as parameters leading to a fingerprint dimensionality of 32 (compared to a dimensionality of 48 for the original paper with a cutoff of 8 Å). A cutoff of 6.5 Å was used for all the systems and fingerprinting schemes. The reference database size of the fingerprints and associated forces for Al, Cu, Ti and W was, respectively, 9568, 9568, 3584 and 4784.

We then use the k-nearest neighbors algorithm (k-NN) as implemented in the python scikit package^{37,38} to train a model using a randomly selected subset representing 20% of the dataset, reserving the other 80% for subsequent testing/evaluation. k-NN is a non-parametric machine learning algorithm that can be used for classification^{72} and regression.^{73} When used for regression, the k-NN prediction for a query point is based on the average value associated with the k nearest neighbors, where k is a tunable hyper-parameter. Here we used a weighted average of the force associated with the neighbors (in fingerprint space) of a given fingerprint, where contributions to the average decrease as inverse neighbor distance.

Hyper-parameter tuning was performed with 10-fold cross-validation whereby the training dataset was randomly split into ten subsets of approximately equal size: training was performed on nine of them; the tenth subset was reserved for validation. This process was repeated ten times with the same k value, and the mean score achieved was recorded as an estimate of the generalization error. A simple grid search on k values showed that, for both BP and Zernike fingerprints, a k-value of 3 is sufficient for accurate predictions and was therefore used for training and subsequent testing. Although not shown here, we have found that the R^{2} score decreases for high k values. This is attributed to the fact that the further away the neighbors, the more dissimilar they are with the chemical environment of interest and therefore should not be used in the force-average.

As shown in Fig. 4, for all twelve cases studied here, we found that the k-nearest neighbor models lead to excellent force prediction, with root mean square (RMS) errors on the order of 0.0001 eV Å^{−1} and mean absolute errors (MAE) on the order of 0.01 eV Å^{−1}. Overall, the AGNI fingerprints show superior performance compared to both the BP and Zernike fingerprints. The least accurate model among the twelve cases is the BP-based Aluminum model which showed an R^{2} error of 0.88 and an RMS of 0.0016 eV Å^{−1}. We have verified (not shown here) that this relatively low performance is due to the training size of only 20% of the dataset and we attribute this poor performance for a small training set to the fact that BP fingerprints are unable to properly capture subtle differences in Al configurations. Better performance could be achieved if the BP parameters were tuned to allow for higher spatial resolution of the BP fingerprints. Compared to the original study published by Huan et al.,^{48} the best RMS achieved was 0.016 eV Å^{−1}, one order of magnitude higher than the worst RMS achieved in this study. However, Huan et al.'s study included other comprehensive and rigorous quality metrics (in addition to the RMS error) which are not used in this comparison with our method.

Beyond illustrating how the capabilities afforded by SEING can be leveraged to quickly build machine learning-based predictive models, this study also shows that k-NN is a promising alternative to neural networks,^{18,74} Gaussian process,^{28,75} Kernel ridge regression,^{57,76} and support vector machine regression^{77} models, which are commonly used by the community.^{10} Some advantages offered by k-NN include faster training, easier implementation and the ability to model highly unusual functions with no assumptions regarding its form. Moreover, the neighbor distance can be used as a proxy for the quality of the prediction. Although this latter advantage is not exclusive to k-NN, it is more interesting here because the entire algorithm is already based on neighbor distances. In practice, this means that if the nearest neighbors are too far from the test point, one can reasonably have less confidence in the prediction and use that as an opportunity to populate that space with more training points. A similar idea was previously suggested by Janet et al.^{69} The k-NN algorithm also allows the user to forgo clustering and sampling strategies such as the ones used in the development of the AGNI method,^{48} whilst still achieving excellent predictive capabilities. The suitability of k-NN based force models to conduct molecular dynamics simulations has not been explored here and remains an open question. Also, since the k-NN method is, by design, a local interpolation in the region of the test point, it is not expected to generalize to highly dissimilar systems. However, extended extrapolation capabilities are usually not expected for non-physics based methods such as in machine learning-based force fields.

- I. Kononenko, Artif. Intell. Med., 2001, 23, 89–109 CrossRef CAS PubMed .
- K. K. Wong, L. Wang and D. Wang, Comput. Med. Imaging Graph., 2017, 57, 1–3 CrossRef PubMed .
- K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis and D. I. Fotiadis, Comput. Struct. Biotechnol. J., 2015, 13, 8–17 CrossRef CAS PubMed .
- A. Graves, A. R. Mohamed and G. Hinton, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649 Search PubMed .
- L. Deng, G. Hinton and B. Kingsbury, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 8599–8603 Search PubMed .
- A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and A. Y. Ng, ArXiv e-prints, 2014 Search PubMed .
- D. Yi, Z. Lei, S. Liao and S. Z. Li, ArXiv e-prints, 2014 Search PubMed .
- S.-J. Wang, H.-L. Chen, W.-J. Yan, Y.-H. Chen and X. Fu, Neural Process Lett., 2014, 39, 25–43 CrossRef .
- S. E. Thompson, F. Mullally, J. Coughlin, J. L. Christiansen, C. E. Henze, M. R. Haas and C. J. Burke, Astrophys. J., 2015, 812, 46 CrossRef .
- F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley and O. A. von Lilienfeld, J. Chem. Theory Comput., 2017, 13, 5255–5264 CrossRef CAS PubMed .
- M. Fernandez, N. R. Trefiak and T. K. Woo, J. Phys. Chem. C, 2013, 117, 14095–14105 CAS .
- W. W. Tipton and R. G. Hennig, J. Phys.: Condens. Matter, 2013, 25, 495401 CrossRef PubMed .
- S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R. H. Taylor, L. J. Nelson, G. L. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo and O. Levy, Comput. Mater. Sci., 2012, 58, 227–235 CrossRef CAS .
- B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. RÃijhl and C. Wolverton, npj Comput. Mater., 2015 Search PubMed .
- G. Hautier, C. C. Fischer, A. Jain, T. Mueller and G. Ceder, Chem. Mater., 2010, 22, 3762–3767 CrossRef CAS .
- J. Behler, J. Chem. Phys., 2011, 134, 074106 CrossRef PubMed .
- J. Behler, Phys. Chem. Chem. Phys., 2011, 13, 17930–17955 RSC .
- J. Behler, Angew. Chem., Int. Ed., 2017, 56, 12828–12840 CrossRef CAS PubMed .
- C. M. Handley and P. L. A. Popelier, J. Phys. Chem. A, 2010, 114, 3371–3383 CrossRef CAS PubMed .
- A. R. Oganov and C. W. Glass, J. Chem. Phys., 2006, 124, 244704 CrossRef PubMed .
- G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga, R. Ramprasad, J. E. Gubernatis and T. Lookman, Sci. Rep., 2016, 19375 CrossRef CAS PubMed .
- L. Ward and C. Wolverton, Curr. Opin. Solid State Mater. Sci., 2017, 21, 167–176 CrossRef CAS .
- A. Khorshidi and A. A. Peterson, Comput. Phys. Commun., 2016, 207, 310–324 CrossRef CAS .
- H. Gassner, M. Probst, A. Lauenstein and K. Hermansson, J. Phys. Chem. A, 1998, 102, 4596–4605 CrossRef CAS .
- M. Rupp, A. Tkatchenko, K.-R. Müller and O. A. von Lilienfeld, Phys. Rev. Lett., 2012, 108, 058301 CrossRef PubMed .
- K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.-R. MÃijller and A. Tkatchenko, J. Phys. Chem. Lett., 2015, 6, 2326–2331 CrossRef CAS PubMed .
- J. Behler and M. Parrinello, Phys. Rev. Lett., 2007, 98, 146401 CrossRef PubMed .
- A. P. Bartók, M. C. Payne, R. Kondor and G. Csányi, Phys. Rev. Lett., 2010, 104, 136403 CrossRef PubMed .
- N. Artrith, A. Urban and G. Ceder, Phys. Rev., 2017, 96, 014112 CrossRef .
- L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl and M. Scheffler, Phys. Rev. Lett., 2015, 114, 105503 CrossRef PubMed .
- K. Yao, J. E. Herr and J. Parkhill, J. Chem. Phys., 2017, 146, 014106 CrossRef PubMed .
- K. Yao, J. E. Herr, S. N. Brown and J. Parkhill, J. Phys. Chem. Lett., 2017, 8, 2689–2694 CrossRef CAS PubMed .
- A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill and S. Goedecker, J. Chem. Phys., 2013, 139, 184118 CrossRef PubMed .
- F. Pietrucci and W. Andreoni, Phys. Rev. Lett., 2011, 107, 085504 CrossRef PubMed .
- L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Rostami, S. A. Ghasemi, A. Sadeghi, M. Grauzinyte, C. Wolverton and S. Goedecker, J. Chem. Phys., 2016, 144, 034203 CrossRef PubMed .
- M. Abadi, et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, https://tensorflow.org, 2015 Search PubMed .
- L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt and G. Varoquaux, ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122 Search PubMed .
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed .
- F. Chollet, et al., Keras, https://github.com/fchollet/keras, 2015 Search PubMed .
- D. Filimonov, V. Poroikov, Y. Borodina and T. Gloriozova, J. Chem. Inf. Comput. Sci., 1999, 39, 666–670 CrossRef CAS .
- M. Sastry, J. F. Lowrie, S. L. Dixon and W. Sherman, J. Chem. Inf. Model., 2010, 50, 771–784 CrossRef CAS PubMed .
- P. Willett, Drug Discovery Today, 2006, 11, 1046–1053 CrossRef CAS PubMed .
- C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann and E. Willighagen, J. Chem. Inf. Comput. Sci., 2003, 43, 493–500 CrossRef CAS PubMed .
- G. Landrum, RDKit: Open-source cheminformatics, http://rdkit.org Search PubMed .
- C. W. Yap, J. Comput. Chem., 2011, 32, 1466–1474 CrossRef CAS PubMed .
- J. Behler, J. Chem. Phys., 2016, 145, 170901 CrossRef PubMed .
- V. Botu and R. Ramprasad, Phys. Rev. B: Condens. Matter Mater. Phys., 2015, 92, 094306 CrossRef .
- T. D. Huan, R. Batra, J. Chapman, S. Krishnan, L. Chen and R. Ramprasad, npj Comput. Mater., 2017, 3, 89–109 Search PubMed .
- K. V. J. Jose, N. Artrith and J. Behler, J. Chem. Phys., 2012, 136, 194111 CrossRef PubMed .
- M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi and P. Marquetand, ArXiv e-prints, 2017 Search PubMed .
- A. P. Bartók, R. Kondor and G. Csányi, Phys. Rev. B: Condens. Matter Mater. Phys., 2013, 87, 184115 CrossRef .
- G. Ferré, T. Haut and K. Barros, J. Chem. Phys., 2017, 146, 114107 CrossRef PubMed .
- D. Rogers and M. Hahn, J. Chem. Inf. Model., 2010, 50, 742–754 CrossRef CAS PubMed .
- K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller and E. K. U. Gross, Phys. Rev. B: Condens. Matter Mater. Phys., 2014, 89, 205118 CrossRef .
- O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp and A. Knoll, Int. J. Quantum Chem., 2015, 115, 1084–1093 CrossRef CAS .
- B. Huang and O. A. von Lilienfeld, J. Chem. Phys., 2016, 145, 161102 CrossRef PubMed .
- H. Huo and M. Rupp, ArXiv e-prints, 2017 Search PubMed .
- S. Kearnes, K. McCloskey, M. Berndl, V. Pande and P. Riley, J. Comput.-Aided Mol. Des., 2016, 30, 595–608 CrossRef CAS PubMed .
- O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo and A. Tropsha, Nat. Commun., 2017, 8, 15679 CrossRef CAS PubMed .
- S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles and P. L. A. Popelier, J. Comput. Chem., 2013, 34, 1850–1861 CrossRef CAS PubMed .
- T. D. Huan, A. Mannodi-Kanakkithodi and R. Ramprasad, Phys. Rev. B: Condens. Matter Mater. Phys., 2015, 92, 014106 CrossRef .
- J. P. Janet and H. J. Kulik, J. Phys. Chem. A, 2017, 121, 8939–8954 CrossRef CAS PubMed .
- C. R. Collins, G. J. Gordon, O. A. von Lilienfeld and D. J. Yaron, arXiv, 2016, https://arxiv.org/abs/1701.06649.
- B. Jiang and H. Guo, J. Chem. Phys., 2013, 139, 054112 CrossRef PubMed .
- J. Li, B. Jiang and H. Guo, J. Chem. Phys., 2013, 139, 204103 CrossRef PubMed .
- R. Kondor, CoRR, 2007, abs/cs/0701127, year Search PubMed .
- A. Thompson, L. Swiler, C. Trott, S. Foiles and G. Tucker, J. Comput. Phys., 2015, 285, 316–330 CrossRef CAS .
- M. Novotni and R. Klein, Comput. Aided Des., 2004, 36, 1047–1062 CrossRef .
- J. P. Janet and H. J. Kulik, Chem. Sci., 2017, 8, 5137–5152 RSC .
- N. Artrith and A. Urban, Comput. Mater. Sci., 2016, 114, 135–150 CrossRef CAS .
- K. Yao, J. E. Herr, D. W. Toth, R. Mcintyre and J. Parkhill, ArXiv e-prints, 2017 Search PubMed .
- T. Cover and P. Hart, IEEE Trans. Inf. Theory, 2006, 13, 21–27 CrossRef .
- L. Devroye, L. Gyorfi, A. Krzyzak and G. Lugosi, Ann. Stat., 1994, 22, 1371–1385 CrossRef .
- T. B. Blank, S. D. Brown, A. W. Calhoun and D. J. Doren, J. Chem. Phys., 1995, 103, 4129–4137 CrossRef CAS .
- A. P. BartÃşk and G. CsÃąnyi, Int. J. Quantum Chem., 2015, 115, 1051–1057 CrossRef .
- M. Rupp, Int. J. Quantum Chem., 2015, 115, 1058–1073 CrossRef CAS .
- R. M. Balabin and E. I. Lomakina, Phys. Chem. Chem. Phys., 2011, 13, 11710–11718 RSC .

This journal is © The Royal Society of Chemistry 2018 |