Open Access Article

This Open Access Article is licensed under a Creative Commons Attribution-Non Commercial 3.0 Unported Licence

Jiawen
Li
^{a},
Jinzhe
Zhang
^{a},
Ryo
Tamura
*^{abcd} and
Koji
Tsuda
*^{acd}
^{a}Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 2778561, Japan. E-mail: tsuda@k.u-tokyo.ac.jp
^{b}International Center for Materials Nanoarchitectonics (WPI-MANA), National Institute for Materials Science, 1-1 Namiki, Tsukuba, Ibaraki 305-0044, Japan. E-mail: tamura.ryo@nims.go.jp
^{c}Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science, 1-1 Namiki, Tsukuba, Ibaraki 305-0044, Japan
^{d}RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan

Received
27th November 2021
, Accepted 4th April 2022

First published on 4th April 2022

In automatic materials design, samples obtained from black-box optimization offer an attractive opportunity for scientists to gain new knowledge. Statistical analyses of the samples are often conducted, e.g., to discover key descriptors. Since most black-box optimization algorithms are biased samplers, post hoc analyses may result in misleading conclusions. To cope with the problem, we propose a new method called self-learning entropic population annealing (SLEPA) that combines entropic sampling and a surrogate machine learning model. Samples of SLEPA come with weights to estimate the joint distribution of the target property and a descriptor of interest correctly. In short peptide design, SLEPA was compared with pure black-box optimization in estimating the residue distributions at multiple thresholds of the target property. While black-box optimization was better at the tail of the target property, SLEPA was better for a wide range of thresholds. Our result shows how to reconcile statistical consistency with efficient optimization in materials discovery.

Samples obtained from black-box optimization offer an attractive opportunity for scientists to gain new knowledge. Statistical analyses of the samples are often conducted to discover new structure–property relationships,^{10} or key descriptors that explain their favorable property.^{11} However, such analyses have to be done with caution, because most of the existing black-box optimizers are biased samplers. Fig. 1a and b show a schematic picture about this issue. This paper aims to develop algorithmic methods that achieve efficient optimization and statistical consistency at the same time.

Density-of-states (DoS) refers to the probability distribution of observables of a physical system such as spin glass.^{12} Observables may include energy, pressure and other physical parameters. If DoS is obtained, the free energy of the target physical system can be evaluated and the system is regarded as understood. Exact estimation of DoS requires enumeration of all possible physical configurations which involves prohibitive computational cost. Estimation of DoS is nevertheless possible with sampling methods including the Wang–Landau method^{13} and stochastic approximation Monte Carlo.^{14} These methods are known under the umbrella term entropic sampling. They have mechanisms to obtain samples from low energy regions, which are hardly found by naive sampling procedures due to the small occurrence probability. In this paper, we use entropic sampling in the context of interpretable materials design. The target property and a descriptor are designated as the energy of the system and an additional observable, respectively, and the joint distribution is estimated as DoS.

Among several methods of entropic sampling, we employ entropic population annealing (EPA)^{15} as our base method, and extend it to self-learning EPA (SLEPA) by incorporating a machine learning model. In both methods, Boltzmann distributions at multiple temperatures are introduced and some samples are obtained from each distribution using Markov chain Monte Carlo (MCMC) (Fig. 1c). As the temperature is decreased, the samples are gradually concentrated towards optimal solutions. Then, DoS is estimated from the samples using the multiple histogram method (MHM).^{16} The difference between EPA and SLEPA is that EPA accesses the energy (i.e., target property) whenever a new sample is created, while SLEPA uses a predicted value from the surrogate machine learning model. SLEPA tends to be more cost effective, but the distribution shift made by the prediction error has to be corrected eventually in DoS estimation.

In computational experiments, SLEPA and EPA were evaluated in a peptide design problem, where scientists are interested in finding sequential signatures that determine the properties.^{17} We are thus committed to estimating the residue distributions of qualified sequences that correspond to a small percentile in the target property. Multiple qualification thresholds ranging from 0.01 to 10 percentiles were used. When an evolution strategy (ES),^{6} a pure black-box optimizer, was applied, its estimation accuracy was best for the tightest thresholds. Simple random sampling was best for the loosest thresholds. For a wide range of thresholds, however, SLEPA and EPA were better than them. SLEPA showed better estimation accuracies than EPA, indicating that the use of a surrogate model enhances sample efficiency. This result illustrates how to reconcile statistical consistency with efficient optimization and shows the usefulness of our approach when acquiring knowledge is more important than optimization.

(1) |

The observable is determined according to users' interest. Let us suppose that one is interested in the descriptor value to yield a favorable target property, e(x) ≤ δ, where δ is a qualification threshold. The observable should then be set as O(d(x), e(x)) = d(x)I(e(x) ≤ δ), where I(·) is the indicator function which is one if the condition in the parentheses is satisfied, zero otherwise. Also, if one needs to draw a two dimensional distribution as in Fig. 1, the density at the grid e_{l} ≤ e(x) ≤ e_{h}, d_{l} ≤ d(x) ≤ d_{h} can be obtained by setting O(d(x), e(x)) = I(d_{l} ≤ d(x) ≤ d_{h}, e_{l} ≤ e(x) ≤ e_{h}).

In understanding our method, it is beneficial to reformulate 〈O〉 from another viewpoint, i.e., histogram-based view. Let E and D denote all possible values of e(x) and d(x), respectively. The observable expectation of eqn (1) can be rewritten as

(2) |

(3) |

P_{β}(x) = exp(−βe(x) + f), | (4) |

(5) |

(6) |

q_{m} ∝ exp(−(β_{i+1} − β_{i})e(x_{m})). | (7) |

The particle set is updated by drawing M particles with replacement. As a result, a particle may vanish or multiply (see Fig. 2). Reweighting is similar to natural selection in evolution algorithm,^{7} because poor particles with high e are more likely to vanish. It is known that, when the temperature difference |β_{i+1} − β_{i}| is too large, resampling may lead to loss of diversity, i.e., all particles end up having the same ancestor.^{23}

(8) |

(9) |

The weights r_{i}(e) are determined to minimize the estimation error of n(e). MHM assumes that h_{i}(e) is subject to Poisson distribution whose mean is N_{i}n(e)exp(−β_{i}e + f_{i}). Since the variance of a Poisson distribution is equal to its mean, the variance of h_{i}(e) is described as V[h_{i}(e)] = N_{i}n(e)exp(−β_{i}e + f_{i}). So the variance of p_{i}(e) ≈ h_{i}(e)/N_{i} is derived as

(10) |

Due to the statistical independence of samples at different temperatures, we obtain the estimation error of n(e), i.e., the variance of n(e) as

(11) |

The optimization problem for the weights r_{i}(e) is then defined as

(12) |

Using a Lagrange multiplier, we obtain

(13) |

Substituting it to eqn (8), we obtain the optimized representation of n(e),

(14) |

The DoS n(e) and free energy f_{i} are obtained by solving the nonlinear system consisting of eqn (14) and (9) by fixed point iteration. When f_{i} is obtained, r_{i}(e) and 〈O〉 can be estimated.

q_{ij} ∝ exp[−β_{i}(e(x_{ij}) − ẽ_{i}(x_{ij}))]. | (15) |

Fig. 4 shows the Hellinger distance of the residue distribution obtained by each method to the ground truth distribution obtained by complete enumeration. The Hellinger distance is defined as the Euclidean distance of square roots of probabilities. It is robust against zero probabilities. In all methods, the distance gets smaller as the number of particles M is increased. In all ranges of M, the performance of ES, one of our baseline methods, deteriorates as the qualification threshold is relaxed. This is intelligible, because ES performs pure optimization and does not have a mechanism to keep statistical consistency. Simple random sampling, on the other hand, performs poorly when the qualification threshold is small due to lack of any optimization mechanism. SLEPA and EPA show stably good results at all qualification thresholds, reflecting the fact that these methods are designed to optimize while maintaining statistical consistency. SLEPA exhibits better results than EPA in most cases. This result indicates that the use of machine learning is effective in increasing the number of MCMC steps to improve distribution estimation. The best distribution estimates (M = 200) by SLEPA, EPA, and ES are shown in Fig. 5.

Fig. 7 (a) Hellinger distance to the ground truth distribution at M = 200. SLEPA (K_{i} = 10 and 100) and EPA (K_{i} = 1, 10, and 100) are compared. (b) Number of qualified samples. |

In our experiments, SLEPA with a Gaussian process was better than EPA, but the performance of SLEPA depends on the choice of the surrogate model and its prediction ability. The use of a surrogate model can be disadvantageous in the occasions where the model works poorly.

While SLEPA can reveal correlations between variables, it is well known that correlation does not necessarily mean causation.^{27} According to multiple studies, however, if the joint distribution satisfies certain criteria, it is possible to identify the causal direction from it.^{28,29} The combination of SLEPA and such causal discovery methods may detect causal relationships among variables about materials properties.

The parameters of SLEPA consist of the number of MCMC steps K_{i}, the number of particles M, the inverse temperature array β_{i} and the hyperparameters of the adopted machine learning model. First of all, the number of MCMC steps should be set as large as possible to ensure the convergence in each iteration, but setting it over 1000 would be meaningless. For accurate DoS estimation, the number of particles M should be large too, but it is also affected by the capacity of the evaluator. For example, if SLEPA is used together with a simulator that can evaluate z materials in parallel, setting M to a value larger than z may result in unwanted latency. The inverse temperature array is determined first by setting the range [β_{min}, β_{max}] and dividing it into τ equally spaced points. As mentioned, a small τ leads to a large change of temperature that is harmful to resampling, while a large τ causes slow optimization. As in most black-box optimizers, we start from uniformly random samples, thus β_{min} = 0. Setting β_{max} would be the trickiest part, because, if too high, the probability is concentrated on one particle in almost all iterations and, if too low, the accuracy of DoS would be poor. To set β_{max} appropriately, we recommend computation of the unnormalized probability exp(−β_{max}e(x)) for all initial examples to check if the probability is concentrated or scattered too much. Finally, the hyperparameters should be set according to the guideline of the machine learning model. In our experiments, we used the automatic hyperparameter initialization and update functions of PHYSBO.^{24}

The integration of statistical physics, machine learning, and materials design offers a great opportunity of algorithmic development. When SLEPA is applied in the latent space of variational autoencoders,^{2} highly structural objects such as chemical compounds^{30} and crystal structures^{31} can be generated. To solve multiobjective optimization problems, SLEPA can be extended to estimate multidimensional DoS.^{32} In addition, other entropic samplers such as replica-exchange Monte Carlo,^{18} broad histogram^{33} and transition matrix Monte Carlo^{34} may find their own advantages in the context of materials design.

- K. Terayama, M. Sumita, R. Tamura and K. Tsuda, Acc. Chem. Res., 2021, 54, 1334–1346 CrossRef CAS PubMed.
- B. Sanchez-Lengeling and A. Aspuru-Guzik, Science, 2018, 361, 360–365 CrossRef CAS PubMed.
- N. Szymanski, Y. Zeng, H. Huo, C. Bartel, H. Kim and G. Ceder, Mater. Horiz., 2021, 8, 2169–2198 RSC.
- D. P. Tran, S. Tada, A. Yumoto, A. Kitao, Y. Ito, T. Uzawa and K. Tsuda, Sci. Rep., 2021, 11, 10630 CrossRef CAS PubMed.
- D. R. Jones, M. Schonlau and W. J. Welch, J. Glob. Optim., 1998, 13, 455–492 CrossRef.
- I. Rechenberg, Comput. Methods Appl. Mech. Eng., 2000, 186, 125–140 CrossRef.
- P. C. Jennings, S. Lysgaard, J. S. Hummelshøj, T. Vegge and T. Bligaard, npj Comput. Mater., 2019, 5, 1–6 CrossRef.
- A. Zhavoronkov, Y. A. Ivanenkov, A. Aliper, M. S. Veselov, V. A. Aladinskiy, A. V. Aladinskaya, V. A. Terentiev, D. A. Polykovskiy, M. D. Kuznetsov and A. Asadulaev, et al. , Nat. Biotechnol., 2019, 37, 1038–1040 CrossRef CAS PubMed.
- S. Ju, T. Shiga, L. Feng, Z. Hou, K. Tsuda and J. Shiomi, Phys. Rev. X, 2017, 7, 021024 Search PubMed.
- Y. Zhang, J. Zhang, K. Suzuki, M. Sumita, K. Terayama, J. Li, Z. Mao, K. Tsuda and Y. Suzuki, Appl. Phys. Lett., 2021, 118, 223904 CrossRef CAS.
- H. Sahu, W. Rao, A. Troisi and H. Ma, Adv. Energy Mater., 2018, 8, 1801032 CrossRef.
- K. Binder and D. W. Heermann, Monte Carlo simulation in statistical physics: an introduction, Springer, Heidelberg, 5th edn, 2010 Search PubMed.
- F. Wang and D. P. Landau, Phys. Rev. Lett., 2001, 86, 2050–2053 CrossRef CAS PubMed.
- F. Liang, C. Liu and R. J. Carroll, J. Am. Stat. Assoc., 2007, 102, 305–320 CrossRef CAS.
- L. Barash, J. Marshall, M. Weigel and I. Hen, New J. Phys., 2019, 21, 073065 CrossRef CAS.
- A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett., 1989, 63, 1195–1198 CrossRef CAS PubMed.
- A. T. Müller, G. Gabernet, J. A. Hiss and G. Schneider, Bioinformatics, 2017, 33, 2753–2755 CrossRef PubMed.
- K. Hukushima and K. Nemoto, J. Phys. Soc. Jpn., 1996, 65, 1604–1608 CrossRef CAS.
- K. Hukushima and Y. Iba, AIP Conf. Proc., 2003, 200–206 CrossRef CAS.
- S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Science, 1983, 220, 671–680 CrossRef CAS PubMed.
- H. Christiansen, M. Weigel and W. Janke, Phys. Rev. Lett., 2019, 122, 060602 CrossRef CAS PubMed.
- W. Wang, J. Machta and H. G. Katzgraber, Phys. Rev. E: Stat., Nonlinear, Soft Matter Phys., 2015, 92, 013303 CrossRef PubMed.
- T. Li, T. P. Sattar and S. Sun, Signal Process., 2012, 92, 1637–1645 CrossRef.
- Y. Motoyama, R. Tamura, K. Yoshimi, K. Terayama, T. Ueno and K. Tsuda, 2021, arXiv: 2110.7900.
- G. Zhao and E. London, Protein Sci., 2006, 15, 1987–2001 CrossRef CAS PubMed.
- M. Cocchi and E. Johansson, Quant. Struct.-Act. Relat., 1993, 12, 1–8 CrossRef CAS.
- J. Pearl, Causality, Cambridge University Press, 2009 Search PubMed.
- D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel and B. Schölkopf, Artif. Intell., 2012, 182, 1–31 CrossRef.
- S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer and K. Bollen, J. Mach. Learn. Res., 2011, 12, 1225–1248 Search PubMed.
- R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams and A. Aspuru-Guzik, ACS Cent. Sci., 2018, 4, 268–276 CrossRef PubMed.
- C. J. Court, B. Yildirim, A. Jain and J. M. Cole, J. Chem. Inf. Model., 2020, 60, 4518–4535 CrossRef CAS PubMed.
- M. K. Fenwick, J. Chem. Phys., 2008, 129, 09B619 CrossRef PubMed.
- M. Kastner, M. Promberger and J. Munoz, Phys. Rev. E: Stat. Phys., Plasmas, Fluids, Relat. Interdiscip. Top., 2000, 62, 7422 CrossRef CAS PubMed.
- J.-S. Wang and R. H. Swendsen, J. Stat. Phys., 2002, 106, 245–285 CrossRef.

## Footnote |

† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d1dd00043h |

This journal is © The Royal Society of Chemistry 2022 |