Thermochemistry of gas-phase and surface species via LASSO-assisted subgraph selection
Graph theory-based regression techniques, such as group additivity, have widely been implemented for fast estimation of thermochemistry of large molecules. The essence of these techniques lies in graphs that molecules are decomposed to. These graphs are selected based on heuristics and as a result, they may not give optimal accuracy and are hard to choose for non-nearest-neighbor electronic effects such as ring strain, steric hindrance, and resonance structures. Here, we explore LASSO, a feature selection algorithm, to select the optimal set of graph descriptors for predicting the standard enthalpy of formation, ΔfH°. We gather hydrocarbon gas-phase data from the NIST Webbook and the Burcat's databases. We find that models using LASSO-based graph descriptors from the exhaustively enumerated graph descriptor space predict ΔfH° more accurately than the traditional group additivity. We compare our framework with state-of-the-art machine-learning models for the QM9 data set. The mean absolute error of 1.39 kcal mol−1 is comparable to published machine learning models. To cope with the computational cost of complete enumeration, we present: (1) a semi-supervised LASSO learning method and (2) an adsorbate subgraph mining algorithm. The former prunes the graph descriptor space on-the-fly during the LASSO regression and is applied to a gas-phase hydrocarbon data set. The latter enumerates a truncated graph descriptor space from adsorbate graphs of surface science data. For lignin monomer adsorbates on Pt(111), considered here as an illustrative example, descriptors selected from the adsorbate subgraph space result in a mean absolute error and a root mean square error of 2.08 and 3.03 kcal mol−1, respectively. We discuss a simple method that identifies outliers in descriptor space that result in large model errors so the accuracy can be improved with the addition of suitable data.