‘Diet GMTKN55’ offers accelerated benchmarking through a representative subset approach

Tim Gould
Qld Micro- and Nanotechnology Centre, Griffith University, Nathan, Qld 4111, Australia. E-mail: t.gould@griffith.edu.au

Received 1st September 2018 , Accepted 24th October 2018

First published on 24th October 2018

The GMTKN55 benchmarking protocol introduced by [Goerigk et al., Phys. Chem. Chem. Phys., 2017, 19, 32184] allows comprehensive analysis and ranking of density functional approximations with diverse chemical behaviours. But this comprehensiveness comes at a cost: GMTKN55's 1500 benchmarking values require energies for around 2500 systems to be calculated, making it a costly exercise. This manuscript introduces three subsets of GMTKN55, consisting of 30, 100 and 150 systems, as ‘diet’ substitutes for the full database. The subsets are chosen via a stochastic genetic approach, and consequently can reproduce key results of the full GMTKN55 database, including ranking of approximations. Some results are also included for the recent MGCDB84 database.

Density functional theory (DFT) has established itself as a dominant approach for computational modelling of chemical systems. But this success has come at a cost – there is now a “zoo” of hundreds, if not thousands of approximations to choose from, e.g.ref. 1–4. Selecting a functional (here called a density functional approximation, DFA) has thus become a highly non-trivial task. This task is made harder by selection bias in scientific publishing: it is rare to find a new approach that doesn't claim to solve some important problem or other, whereas there is a delay until problems introduced by new approaches are publicised. Thus, there is significant trial and error in determining if a newly introduced (or indeed old) DFA is likely to offer systematic improvements over existing methods.

Benchmarking studies, which rank DFAs on their ability to reproduce key chemical physics calculated at a higher level of theory have thus become an integral part of method selection. The recently introduced “general main group thermochemistry, kinetics, and non-covalent interactions” (GMTKN55) benchmark database5 establishes an unsurpassed level of rigour in establishing the overall quality of DFAs, and must be considered state-of-art for benchmarking on chemical systems. The reactions, atomisation energies, and other energy differences included in the database incorporate an extensive range of important chemical processes. The ranking systems introduced in that work critically evaluate DFAs on a comprehensive range of tests using a set of 1500 energy differences. Approaches which perform well on these tests should thus be expected to perform well in main group chemistry generally.

But to test a given method against GMTKN55 requires almost 2500 single point calculations be carried out. While this number of calculations is quite feasible in modern, well-developed quantum chemistry codes, it is more difficult to achieve using the planewave codes typically used in material science problems. Furthermore, methods that are highly innovative are often implemented inefficiently at first, with optimisation coming only after initial successes (e.g., the work of Román-Pérez and Soler6 for vdW-DFs7). Comprehensive testing across GMTKN55 may thus be infeasible or impossible at precisely the stage when it is most needed – when methods are being designed.

There is therefore a clear need for smaller databases that can be used to accurately assess methods for which comprehensive testing is infeasible. Ideally, such a database should not sacrifice rigour, and must capture the key features of existing state-of-art. Along these lines, very recent work by Chan8 introduced the MG8 database, comprised of 64 reactions that could, after some linear fitting, reproduce key properties of the 5000 element MGCDB82 database.9 Earlier work along these lines has also been done on smaller benchmark sets.10,11

This work establishes that a key ranking system of GMTKN55, called WTMAD-2, can be accurately reproduced using just 100 systems selected from the full database without any fitting. The number of selected systems can be arbitrarily varied, which lets us produce a 150 system set with slightly better accuracy than the 100 element set, and a 30 system ‘starvation’ set with degraded results.

The rest of the paper is arranged as follows: first, the process used to find the selected systems is described. Then, some results are presented showing how well the smaller sets reproduce results from the full database. The list of 100 systems is then presented and discussed, with the lists for 30 and 150 systems included in the ESI. Finally, the paper concludes by discussing an important feature of the reduced database – namely, its potential use in a physically comprehensive benchmark database that includes systems in the solid phase.

Before proceeding, we must first note that GMTKN55 is not the only very large benchmarking database available. Mardirossian and Head-Gordon recently introduced the larger MGCDB84 database with almost 5000 data points.9 However, unlike GMTKN55, Mardirossian and Head-Gordon do not propose a clear ranking metric that assesses an approximation on its performance on the whole database, other than an unweighted mean absolute error. Thus, for present purposes of ranking, the WTMAD-2 metric of GMTKN55 is taken as state-of-art.

This work draws from the full GMTKN55 database provided on its official website.12 This data was “scraped”, (i.e., the website was algorithmically navigated to obtain data) and then all errors stored as ΔEdI in a table on DFA d and system I indices. No new calculations were carried out. Interestingly, this process returned just 1499 systems, not the 1505 identified in the original paper. However, WTMAD-2 results calculated with the scraped data were within 2.1% of those reported on the GMTKN55 database, with most well inside 1%, making it more than fit for purpose. The one exception was results for BLYP (and its dispersion corrected counterparts), which were consequently discarded, giving us 213 DFAs, not 216.

The mean absolute deviation (MAD) for each DFA d was then found using the WTMAD-2 scheme (see Section 4 of GMTKN555 for details), giving

image file: c8cp05554h-t1.tif(1)
Here the weights WI depend only on the benchmark set B(I) containing I, and take the form
image file: c8cp05554h-t2.tif(2)
where NB(I) is the number of elements in B(I) and ΔEJ is the energy for each system J in B(I). These weights normalise the deviations to ensure that sets with large energies, where moderate errors are not so important, do not dominate over sets with small energies, where moderate errors can be critical.

This data then had to be condensed to a representative set of Ns systems, where Ns is the target subset size. This process involved two optimisation phases, both loosely based on evolution:

(1) First, a primeval soup was introduced, in which “species” S = {I1,I2,…,INs} comprised of Ns systems are selected at random. The mean absolute deviation (MAD) was then calculated for the species, and an error

image file: c8cp05554h-t3.tif(3)
assigned to each species. Here image file: c8cp05554h-t4.tif is the MAD for the subset. The full set MAD(d) is defined in (1), with WI defined in (2). The MAD of MADs is then taken over all DFAs d, to obtain a single number.

If S was lower in error that the previous lowest error system, it was allowed to live, and joined a set L = {S1, S2,…} of live contenders, where Err(Si+1) < Err(Si) by construction. Typically 50[thin space (1/6-em)]000 random selections yielded about 10–20 elements in L.

(2) In the next phase, breeding and survival of the fittest were introduced. Two elements, P1 and P2 of the live species L were selected at random to serve as parents. These produced a child, C, composed of half the elements I of P1, selected at random, and half the elements of P2, ensuring no duplication. This process is analogous to sharing genes. Errors [using (3)] were then calculated for both parents and the child.

In the event that the child had a lower error than one parent, the parent with the highest error was discarded, and replaced by the child, e.g., L = {…P1,…P2,…} → {…C,…P2,…} when C < P1. Otherwise the child died and L was left unchanged. This cycle was repeated over 50[thin space (1/6-em)]000 generations, allowing the population to evolve to lower errors. Eventually, the species with the highest fitness (overall lowest error) was selected, leaving the error quite small.

The optimisation was thus able to evolve to a subset of Ns systems, specifically, the most fit species found above, that could reproduce MAD(d) with good accuracy across all DFAs d. Calculations take a matter of minutes on a laptop, and the best runs out of multiple trials were taken as the final lists. The accuracy of the sets reported here, with Ns = 30, 100, 150 are respectively Err(S30) = 0.362, Err(S100) = 0.165 and Err(S150) = 0.126, or 3.9%, 1.8% and 1.4% of the average MAD, image file: c8cp05554h-t5.tif.

Fig. 1 shows WTMAD-2 errors for the full database and the various subsets found via the evolutionary algorithm. As in the original GMTKN55 paper, we separate the approaches into: (i) double-hybrids involving mixtures of Hartree–Fock (HF), post-HF correlation and density functional (DF) terms; (ii) hybrids involving HF and DF terms; (iii) meta-GGAs which involve densities, gradients and kinetic energy density terms; and (iv) generalised gradient approximations (GGAs) involving the density and its gradient only. This division reflects the broad families of low-cost modern quantum chemistry approaches. In terms of Perdew's Jacob's Ladder classification,56 these correspond to rungs five, three, four and two, respectively. The order used here is determined by the highest accuracy achieved within the given category.

image file: c8cp05554h-f1.tif
Fig. 1 Comparison of errors found using Ns = 30, 100, 150 (top to bottom) subsets compared with the full database. Here, results are divided into four categories: double-hybrid DFAs (reds), hybrid DFAs (greens), meta-GGAs (earth tones) and GGAs (blues). The paler colour in each category represent MADs for the subset, and the darker colour is for the full database. The blended colour represents overlap. Only the ten best methods are shown in each category.

The high quality of the subsets is evident in the plot, which shows the subset database and full database giving very similar values and rankings across all types of methods. It is clear that rankings from the reduced subsets can be used as a substitute for the full set, albeit with some errors. The small 30 element ‘starvation’ set is quantitatively worse than the others, but nonetheless manages to reproduce key trends. Additional plots showing the worst cases are included in the ESI, and are of similar quality.

Kendall rank correlation coefficients57

image file: c8cp05554h-t6.tif(4)
are reported in Table 1 to quantify agreement between the full database and the subsets. These highlight the general accuracy of the subsets as a ranking tool, as the worst case is 0.79 for Ns = 30, which still indicates a strong degree of correlation in the rankings from the subsets and the full database.

Table 1 Kendall rank correlation coefficients τ [eqn (4)] for each subset, separated by approach. Values close to one indicate good agreement between rankings from the full database, and the chosen subset
N s Double-hybrid Hybrid Meta-GGA GGA
30 0.883 0.854 0.789 0.901
100 0.860 0.933 0.918 0.948
150 0.930 0.938 0.906 0.967

Finally, Table 2 shows the full set of systems used in the Ns = 100 data set. The systems included come from 41 benchmark sets, 75% of the full list of 55, and involve between one and nine elements from each set. The high fraction of sets that are sampled is interesting, and suggests that the evolutionary algorithm finds key systems that exemplify the different chemical physics scrutinised in the various benchmark sets that form the full GMTKN55 database. The fraction of benchmark sets sampled for Ns = 30 is 20%, while 89% of sets are sampled for Ns = 150. This result may indicate a direction to explore for future improvements in benchmarking, as it suggests that selected case studies can yield almost as much information as results for the full set, at least when combined with other tests. Furthermore, some benchmark sets are never sampled in the reduced subsets, suggesting they might duplicate information from other sets.

Table 2 Full list of benchmark sets and specific elements used in the Ns = 100 ‘diet GMTKN55’ ranking set. These structures are sufficient to reproduce the full WTMAD-2 ranking with good accuracy, using the weights W listed here. Results for Ns = 30 and 150 are presented in the ESIb
Set W IDs Set W IDs Set W IDs
a Contains systems from one of the GMTKN databases.5,44–46 b These include additional benchmark sets.47–55
ACONFa 30.99 2, 5, 8 ADIM6a 16.93 2 AHB2113 2.53 1, 15
ALKBDE1014 0.56 1 Amino20x4a 23.31 23, 48, 56 BH7615,16 3.05 6, 18, 54, 74
BH76RCa 2.66 16 BHPERI17 2.72 1, 17, 23, 26 BHROT27a 9.06 1, 10, 20
BSR3618 3.51 4, 21 BUT14DIOL19 20.30 28, 39, 63 CARBHB12a 9.42 3, 4, 6, 11
CDIE2020 14.02 2, 14 CHB613 2.12 1 DC13a 1.03 6, 12
FH5121 1.83 25 G21EA22 1.69 5, 21, 25 G21IP22,23 0.22 28
G2RC24 1.11 15 HAL59a[thin space (1/6-em)]25,26 12.38 1, 8, 21, 38, 43, 56 HEAVY2827 45.79 5, 24
ICONFa 17.40 16 IL1613,28 0.52 3 INV2429 1.78 12, 19, 22
ISO3430 3.90 7, 12 ISOL2431 2.59 13 MB16–4332 0.12 4, 31
MCONF33 11.43 7, 22, 30, 42 PArela 12.28 3, 5, 10, 20 PCONF2134 35.05 7, 8, 18
PNICO2335 13.30 13, 17 RC2136 1.59 4 RG18a 98.00 1, 15
RSE4337 7.48 28, 30, 43 S6638 10.40 11, 17, 24, 31, 45 SCONF39 12.36 10
SIE4x4a 1.69 7, 10 UPU2340 9.93 5 W4–1141 0.19 12, 37, 39, 58, 85, 92, 104, 136, 138
WATER2742 0.70 17, 20 YBDE1843 1.15 1

Due to the reduced number of systems, an assessment using one of the subsets reported here would involve, at worst, a few hundred calculations (specifically, 82/240/336 calculations for 30/100/150 systems), a small fraction of the 2462 calculations required for the full GMTKN55 database. This is a substantial saving in terms of time, with little cost to accuracy if only rankings and overall quality estimates are required. The 30 element subset is probably too inaccurate to be used on its own for detailed assessment, however, but may be sufficient when used in combination with other tests or in parameter optimisation.

To conclude, this work derives and presents ‘diet’ versions of the GMTKN55 database for use in future benchmarking studies. The reduced sets, involving 30, 100 and 150 energy differences, are able to reproduce not just quantitative errors, but also rankings, with a generally good degree of accuracy. They are thus able to serve as reliable proxies for the full GMTKN55 database, to be used when, e.g., calculations are very expensive. Density functional developers who cannot afford to test on the full GMTKN55 set are thus advised to use the reduced sets reported here, to establish a metric of relative success that is comparable to best practice. The same procedure can be applied to other databases, such as MGCDB84.9 The ESI reports the equivalent to Fig. 1 for MGCDB84, calculated using an adapted WTMAD-2.

That said, it must be stressed that GMTKN55 has far greater analytic power than any subset thereof, and its use is certainly advised whenever possible. The reduced sets introduced here are not intended to replace comprehensive benchmarking, but to supplement it. One possible application would be to use a subset for pre-screening, e.g. while developing a new approach, followed by tests on GMTKN55 once best candidates have been established. Another possibility might be to use a subset for optimisation of empirical methods.

Another important benefit of the reduced sets, and in fact the initial driving motivation behind this work, is that it offers a route to a truly comprehensive benchmarking protocol for general chemical physics that covers materials science and solid state problems, in addition to the comprehensive gas phase chemistry results in GMTKN55. To understand why the ‘diet’ sets will help here, note that it is unlikely that we will have 1500 solid phase benchmark standard energy differences in the near future, due to: (a) the increased methodological challenges posed by such systems (e.g., many methods fail for small and zero gap systems), and (b) the significantly higher numerical cost of calculating such systems to benchmark accuracy. One hundred diverse systems (or perhaps just thirty, initially) is a much more feasible size for a solid state benchmark set. To avoid sample size bias, such a set must thus be paired with a comparably sized, but representative, sample of standard chemistry, such as the ones presented here. Work along these lines is being pursued.

Conflicts of interest

There are no conflicts to declare.


  1. K. Burke, J. Chem. Phys., 2012, 136, 150901 CrossRef PubMed.
  2. A. D. Becke, J. Chem. Phys., 2014, 140, 18A301 CrossRef PubMed.
  3. R. O. Jones, Rev. Mod. Phys., 2015, 87, 897 CrossRef.
  4. S. Grimme and P. R. Schreiner, Angew. Chem., Int. Ed., 2018, 57, 4170–4176 CrossRef CAS PubMed.
  5. L. Goerigk, A. Hansen, C. Bauer, S. Ehrlich, A. Najibi and S. Grimme, Phys. Chem. Chem. Phys., 2017, 19, 32184–32215 RSC.
  6. G. Román-Pérez and J. M. Soler, Phys. Rev. Lett., 2009, 103, 096102 CrossRef PubMed.
  7. D. C. Langreth, M. Dion, H. Rydberg, E. Schröder, P. Hyldgaard and B. I. Lundqvist, Int. J. Quantum Chem., 2005, 101, 599–610 CrossRef CAS.
  8. B. Chan, J. Chem. Theory Comput., 2018, 14, 4254–4262 CrossRef CAS PubMed.
  9. N. Mardirossian and M. Head-Gordon, Mol. Phys., 2017, 115, 2315–2372 CrossRef CAS.
  10. B. J. Lynch and D. G. Truhlar, J. Phys. Chem. A, 2003, 107, 8996–8999 CrossRef CAS.
  11. R. Haunschild and W. Klopper, Theor. Chem. Acc., 2012, 131, 1112 Search PubMed.
  12. GMTKN55 – A database for general main group thermochemistry, kinetics, and non-covalent interactions, 2018, https://www.chemie.uni-bonn.de/pctc/mulliken-center/software/GMTKN/gmtkn55.
  13. K. U. Lao, R. Schäffer, G. Jansen and J. M. Herbert, J. Chem. Theory Comput., 2015, 11, 2473–2486 CrossRef CAS PubMed.
  14. H. Yu and D. G. Truhlar, J. Chem. Theory Comput., 2015, 11, 2968–2983 CrossRef CAS PubMed.
  15. Y. Zhao, B. J. Lynch and D. G. Truhlar, Phys. Chem. Chem. Phys., 2005, 17, 43–52 RSC.
  16. Y. Zhao, N. González-García and D. G. Truhlar, J. Phys. Chem. A, 2005, 109, 2012–2018 CrossRef CAS PubMed.
  17. A. Karton and L. Goerigk, J. Comput. Chem., 2015, 36, 622–632 CrossRef CAS PubMed.
  18. S. N. Steinmann, G. Csonka and C. Corminboeuf, J. Chem. Theory Comput., 2009, 5, 2950–2958 CrossRef CAS PubMed.
  19. S. Kozuch, S. M. Bachrach and J. M. Martin, J. Phys. Chem. A, 2014, 118, 293–303 CrossRef CAS PubMed.
  20. L.-J. Yu and A. Karton, Chem. Phys., 2014, 441, 166–177 CrossRef CAS.
  21. J. Friedrich and J. Hänchen, J. Chem. Theory Comput., 2013, 9, 5381–5394 CrossRef CAS PubMed.
  22. L. A. Curtiss, K. Raghavachari, G. W. Trucks and J. A. Pople, J. Chem. Phys., 1991, 94, 7221–7230 CrossRef CAS.
  23. L. Goerigk and S. Grimme, J. Chem. Theory Comput., 2010, 6, 107–126 CrossRef CAS PubMed.
  24. L. A. Curtiss, K. Raghavachari, P. C. Redfern and J. A. Pople, J. Chem. Phys., 1997, 106, 1063–1079 CrossRef CAS.
  25. S. Kozuch and J. M. L. Martin, J. Chem. Theory Comput., 2013, 9, 1918–1931 CrossRef CAS.
  26. J. Rezac, K. E. Riley and P. Hobza, J. Chem. Theory Comput., 2012, 8, 4285–4292 CrossRef CAS.
  27. S. Grimme, J. Antony, S. Ehrlich and H. Krieg, J. Chem. Phys., 2010, 132, 154104 CrossRef PubMed.
  28. S. Zahn, D. R. MacFarlane and E. I. Izgorodina, Phys. Chem. Chem. Phys., 2013, 15, 13664–13675 RSC.
  29. L. Goerigk and R. Sharma, Can. J. Chem., 2016, 94, 1133–1143 CrossRef CAS.
  30. S. Grimme, M. Steinmetz and M. Korth, J. Org. Chem., 2007, 72, 2118–2126 CrossRef CAS PubMed.
  31. R. Huenerbein, B. Schirmer, J. Moellmann and S. Grimme, Phys. Chem. Chem. Phys., 2010, 12, 6940–6948 RSC.
  32. M. Korth and S. Grimme, J. Chem. Theory Comput., 2009, 5, 993–1003 CrossRef CAS PubMed.
  33. U. R. Fogueri, S. Kozuch, A. Karton and J. M. L. Martin, J. Phys. Chem. A, 2013, 117, 2269–2277 CrossRef CAS PubMed.
  34. D. Rěha, H. Valdes, J. Vondrasek, P. Hobza, A. Abu-Riziq, B. Crews and M. S. de Vries, Chem. – Eur. J., 2005, 11, 6803–6817 CrossRef PubMed.
  35. D. Setiawan, E. Kraka and D. Cremer, J. Phys. Chem. A, 2015, 119, 1642–1656 CrossRef CAS PubMed.
  36. S. Grimme, Angew. Chem., Int. Ed., 2013, 52, 6306–6312 CrossRef CAS PubMed.
  37. F. Neese, T. Schwabe, S. Kossmann, B. Schirmer and S. Grimme, J. Chem. Theory Comput., 2009, 5, 3060–3073 CrossRef CAS PubMed.
  38. J. Rězáč, K. E. Riley and P. Hobza, J. Chem. Theory Comput., 2011, 7, 2427–2438 CrossRef PubMed.
  39. G. I. Csonka, A. D. French, G. P. Johnson and C. A. Stortz, J. Chem. Theory Comput., 2009, 5, 679–692 CrossRef CAS PubMed.
  40. H. Kruse, A. Mladek, K. Gkionis, A. Hansen, S. Grimme and J. Sponer, J. Chem. Theory Comput., 2015, 11, 4972–4991 CrossRef CAS PubMed.
  41. A. Karton, S. Daon, J. M. L. Martin and B. Ruscic, Chem. Phys. Lett., 2011, 510, 165–178 CrossRef CAS.
  42. V. S. Bryantsev, M. S. Diallo, A. C. T. van Duin and W. A. Goddard III, J. Chem. Theory Comput., 2009, 5, 1016–1026 CrossRef CAS PubMed.
  43. Y. Zhao, H. T. Ng, R. Peverati and D. G. Truhlar, J. Chem. Theory Comput., 2012, 8, 2824–2834 CrossRef CAS PubMed.
  44. L. Goerigk and S. Grimme, J. Chem. Theory Comput., 2009, 6, 107–126 CrossRef PubMed.
  45. L. Goerigk and S. Grimme, J. Chem. Theory Comput., 2010, 7, 291–309 CrossRef PubMed.
  46. L. Goerigk and S. Grimme, Phys. Chem. Chem. Phys., 2011, 13, 6670–6688 RSC.
  47. P. Jurečka, J. Sponer, J. Cerny and P. Hobza, Phys. Chem. Chem. Phys., 2006, 8, 1985–1993 RSC.
  48. L. Goerigk and S. Grimme, J. Chem. Theory Comput., 2011, 7, 291–309 CrossRef CAS PubMed.
  49. S. Grimme, H. Kruse, L. Goerigk and G. Erker, Angew. Chem., Int. Ed., 2010, 49, 1402–1405 CrossRef CAS PubMed.
  50. R. Sure, A. Hansen, P. Schwerdtfeger and S. Grimme, Phys. Chem. Chem. Phys., 2017, 19, 14296–14305 RSC.
  51. A. Karton, R. J. O'Reilly, B. Chan and L. Radom, J. Chem. Theory Comput., 2012, 8, 3128–3136 CrossRef CAS PubMed.
  52. A. Karton, R. J. O'Reilly and L. Radom, J. Phys. Chem. A, 2012, 116, 4211–4221 CrossRef CAS PubMed.
  53. T. Schwabe and S. Grimme, Phys. Chem. Chem. Phys., 2007, 9, 3397–3406 RSC.
  54. S. Grimme, Angew. Chem., Int. Ed., 2006, 45, 4460–4464 CrossRef CAS PubMed.
  55. T. Takatani, E. G. Hohenstein, M. Malagoli, M. S. Marshall and C. D. Sherrill, J. Chem. Phys., 2010, 132, 144104 CrossRef PubMed.
  56. J. P. Perdew and K. Schmidt, AIP Conference Proceedings, 2001, pp. 1–20 Search PubMed.
  57. M. G. Kendall, Biometrika, 1938, 30, 81–93 CrossRef.


Electronic supplementary information (ESI) available. See DOI: 10.1039/c8cp05554h

This journal is © the Owner Societies 2018