Jump to main content
Jump to site search
PLANNED MAINTENANCE Close the message box

Scheduled maintenance work on Wednesday 27th March 2019 from 11:00 AM to 1:00 PM (GMT).

During this time our website performance may be temporarily affected. We apologise for any inconvenience this might cause and thank you for your patience.

Issue 4, 2018
Previous Article Next Article

Thermochemistry of gas-phase and surface species via LASSO-assisted subgraph selection

Author affiliations


Graph theory-based regression techniques, such as group additivity, have widely been implemented for fast estimation of thermochemistry of large molecules. The essence of these techniques lies in graphs that molecules are decomposed to. These graphs are selected based on heuristics and as a result, they may not give optimal accuracy and are hard to choose for non-nearest-neighbor electronic effects such as ring strain, steric hindrance, and resonance structures. Here, we explore LASSO, a feature selection algorithm, to select the optimal set of graph descriptors for predicting the standard enthalpy of formation, ΔfH°. We gather hydrocarbon gas-phase data from the NIST Webbook and the Burcat's databases. We find that models using LASSO-based graph descriptors from the exhaustively enumerated graph descriptor space predict ΔfH° more accurately than the traditional group additivity. We compare our framework with state-of-the-art machine-learning models for the QM9 data set. The mean absolute error of 1.39 kcal mol−1 is comparable to published machine learning models. To cope with the computational cost of complete enumeration, we present: (1) a semi-supervised LASSO learning method and (2) an adsorbate subgraph mining algorithm. The former prunes the graph descriptor space on-the-fly during the LASSO regression and is applied to a gas-phase hydrocarbon data set. The latter enumerates a truncated graph descriptor space from adsorbate graphs of surface science data. For lignin monomer adsorbates on Pt(111), considered here as an illustrative example, descriptors selected from the adsorbate subgraph space result in a mean absolute error and a root mean square error of 2.08 and 3.03 kcal mol−1, respectively. We discuss a simple method that identifies outliers in descriptor space that result in large model errors so the accuracy can be improved with the addition of suitable data.

Graphical abstract: Thermochemistry of gas-phase and surface species via LASSO-assisted subgraph selection

Back to tab navigation

Publication details

The article was received on 18 Dec 2017, accepted on 13 Feb 2018 and first published on 13 Feb 2018

Article type: Paper
DOI: 10.1039/C7RE00210F
Citation: React. Chem. Eng., 2018,3, 454-466

  •   Request permissions

    Thermochemistry of gas-phase and surface species via LASSO-assisted subgraph selection

    G. H. Gu, P. Plechac and D. G. Vlachos, React. Chem. Eng., 2018, 3, 454
    DOI: 10.1039/C7RE00210F

Search articles by author