GFlowNets for AI-Driven Scientific Discovery

Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the pace of scientific discovery. While science has traditionally relied on trial and error and even serendipity to a large extent, the last few decades have seen a surge of data-driven scientific discoveries. However, in order to truly leverage large-scale data sets and high-throughput experimental setups, machine learning methods will need to be further improved and better integrated in the scientific discovery pipeline. A key challenge for current machine learning methods in this context is the efficient exploration of very large search spaces, which requires techniques for estimating reducible (epistemic) uncertainty and generating sets of diverse and informative experiments to perform. This motivated a new probabilistic machine learning framework called GFlowNets, which can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. GFlowNets learn to sample from a distribution given indirectly by a reward function corresponding to an unnormalized probability, which enables sampling diverse, high-reward candidates. GFlowNets can also be used to form efficient and amortized Bayesian posterior estimators for causal models conditioned on the already acquired experimental data. Having such posterior models can then provide estimators of epistemic uncertainty and information gain that can drive an experimental design policy. Altogether, here we will argue that GFlowNets can become a valuable tool for AI-driven scientific discovery, especially in scenarios of very large candidate spaces where we have access to cheap but inaccurate measurements or to expensive but accurate measurements. This is a common setting in the context of drug and material discovery, which we use as examples throughout the paper.


Introduction
The climate crisis, antibiotic resistance and the prospect of new pandemics are some of the biggest threats to humanity, posing immense risks to global health and food security. One important common aspect to all these threats and others is that significant new scientific discoveries are required to mitigate them. According to the 2022 report by the Intergovernmental Panel on Climate Change (IPCC) [1], limiting global warming will require the adoption of alternative fuels, as well as improvements in the efficiency of energy production and material synthesis. The discovery of new materials, such as electrocatalysts that improve the energy efficiency of chemical reactions, can therefore play a crucial role in such a transition. Correspondingly, growing risks of antimicrobial resistance and pandemics make it essential to accelerate the pipeline for discovery of new drugs. Consequently, the well-being of our societies will strongly depend on the pace of our scientific discoveries. Historically, scientific discovery has been the outcome of either serendipity-such as penicilin and teflon [2]-or the rather slow experimental science loop: observations are accumulated from past experiments, which are carefully analyzed by experts who produce new hypotheses and design experiments that will eventually yield new, valuable observations to continue the cycle (see Figure 1). While this model has well served the progress of science for centuries and will continue to do so in certain domains, the fully manual version of this cycle is too slow for the pressing emergencies of our time. A bottleneck in the cycle occurs when the analysis of data, production of hypotheses and experimental specification are manual. This is further exacerbated when the search space of candidates is dauntingly large, as is the case for drug discovery where there exist 10 60 feasible small molecules, according to estimations [3].
The scale at which scientific experiments can be conducted is rapidly increasing, enabled by advances in robotics, biotechnology and computational capabilities, among others [4]. For example, we can now easily and cheaply collect high-dimensional images and videos, electron microscopy data or the gene expression of millions of cells. Furthermore, we can also conduct thousands or even millions of experiments in parallel to screen new candidate molecules, experiment with a sequence of reactions, etc. If experimental interventions can be combined, we can sample from a combinatorially large space of possible experiments at each step. The avenues opened by such large-scale availability of data and compute have been identified as the fourth paradigm in scientific discovery [5]. Nonetheless, our current tools are not enough to truly utilize all the information and resources at our disposal [6]. In this context, the maturation of tailored machine learning (ML) methodologies offers the possibility to not only analyze and make sense of the data, but also to improve the generation of hypotheses and design of experiments, accelerating the experimental science loop.
ML techniques have been employed in all of the main steps of the experimental science loop illustrated in Figure 1: (a) analyzing and modeling the data accumulated from experiments, (b) characterizing and generating hypotheses compatible with the data, (c) designing the next experiments, and (d) performing the experiments. Analysis and modeling of data is a naturally appealing scenario for ML methods, which are typically designed for extracting predictive patterns from large data sets. For instance, machine learning methods have been quite successful in modeling the quantum mechanical properties of small molecules [7]. ML approaches have also been studied in the context of designing candidate experiments [8]. A standard example of this is leveraging tools from reinforcement learning (RL) and Bayesian optimization (BO) for searching candidates that optimize (as a reward) some property of the candidate [9,10]. However, as we discuss in detail in Section 2, existing approaches often lack a principled treatment of the challenges introduced by limited data, uncertainty and underspecification of the objectives in scientific domains.
In this paper, we discuss how a novel machine learning (ML) framework, called Generative Flow Networks [11,12]-or GFlowNets for short-can help in addressing the shortcomings of existing ML approaches in scientific domains. GFlowNets are general purpose inference machines, which enable generating samples with a probability proportional to some reward function. As a consequence, GFlowNets have emerged as a potentially transformative tool for scientific discovery, as they can be used to generate hypotheses, design experiments and model the experimental observations, key steps of the experimental science loop (Figure 1)-note that ML-augmented robotics can also be useful in the experimental step, but we do not discuss this here.
Throughout the paper, we consider the following motivating examples to make the discussion more concrete.
Example 1.1. An important part of fighting growing antimicrobial resistance and emergent infectious diseases is speeding up the discovery of novel small organic molecules (and peptides) that inhibit the action of target bacteria or one to several target proteins. The primary goal is to search the space of molecules (including peptides) for candidates that bind to the target of interest and inhibit or activate its function. The space of molecules is combinatorially large, and the accurate evaluation of the desired activity (e.g. binding affinity) in-vitro or even in-silico is expensive. Additionally, aside from binding to a target, there are several other pharmacological criteria which a molecule needs to satisfy for use as a therapeutic, such as low toxicity to humans, good Absorption, Distribution, Metabolism, and Excretion ("ADME"), and ease of synthesis. Example 1.2. Discovery of novel materials for applications in the generation, storage, and use of clean energy that have better efficiency and rely on sustainable raw materials are critical in aiding efforts to reduce rising global temperatures. The goal here is discovering novel inorganic as well as organic materials, and there are often several specific metrics to optimize simultaneously. A representative example is the development of storage materials for lithium/sodium ions in lithium/sodium ion batteries, where the class of metal oxides (e.g. Figure 2) alone already represents a combinatorially large search space. Inside a lithium/sodium ion battery, a good candidate metal oxide for energy storage should show high energy density, together with other metrics such as high capacity retention, low irreversible capacity, and low cost of synthesis. Example 1.3. Modeling the genetic pathways through which diseases progress within humans plays a critical role in our understanding of human biology. These causal models can help us understand the behavior of various interventions such as therapeutics within the complex environment of the human body. The goal is to learn such causal models with the help of targeted interventions. Even the causal structure of a single cell is a major challenge and raises difficult questions to appropriately scale algorithms and combine learning from data with prior knowledge from biology.
Organization. In Section 3 we discuss existing ML methodology, highlighting the distinctive features of GFlowNets. Next in Section 4 we introduce ideas around amortized inference, and discuss how GFlowNets emerge from these ideas. In Section 4.2, we highlight scientific discovery problems where GFlowNets have been applied successfully. In Section 4.3, we discuss how GFlowNets can also be used to represent a distribution over causal models linking multiple random variables of interest and to estimate the posterior distributions of interest along with the marginalized quantities, such as Bayesian posterior densities, that are important to evaluate the information gain from an experiment. To conclude, in Section 5 we chart a potential path towards a unified framework for scientific discovery driven by GFlowNets.

Challenges for AI in Scientific Discovery
In the last few decades, ML has enabled remarkable technological advances ranging from agents that can surpass humans at the game of Go [13] to breakthrough advances in protein folding prediction [14]. These advances have been enabled, in part, by the availability of extremely large datasets and often of a well-specified objective to be optimized. In many scientific discovery applications, however, the limited available data, the uncertainty intrinsic to measurements and the underspecification of objectives pose serious challenges for leveraging ML approaches. In this section, we discuss the relevance of these challenges and argue how GFlowNets can circumvent them.

Limited Data and Uncertainty
One critical challenge in leveraging learning-based approaches for scientific discovery is the limited availability of data and the associated uncertainty. Part of the uncertainty is due to unreliable measurements, called aleatoric uncertainty, and part is due to having a limited amount of training data, called epistemic uncertainty, which is the uncertainty associated with theories or models and their parameters [15], due to the finite size of datasets. Epistemic uncertainty emerges because multiple theories or models or settings of parameters can be compatible with the given data, and Bayesian posteriors on these can capture the epistemic uncertainty. By design, current state-of-the-art ML approaches rely on access to large data sets to extract useful patterns. But owing to experimental limitations, it can be extremely expensive or impossible to obtain large amounts of accurate and precise data at the scale required my modern ML approaches in many applications of interest. In drug discovery, for example, obtaining experimental binding affinities for ligands with a target protein, at the scale required for ML methods, is often very challenging. Furthermore, the differences in experimental techniques and measurement noise can lead to different binding affinities for the same ligand by orders of magnitude. The same is true in materials science: as an example, in the research for new battery materials, a few thousands of data points are already considered high-throughput [16], and the measurement metrics such as energy density can vary significantly based on minute details of the experimental setup. Thus, it is essential for models to account for the aleatoric and epistemic uncertainty within the context of scientific discovery.
Additionally, scientific phenomena occurring in nature tend to be complex and often a product of complicated processes. Standard machine learning models that learn a function mapping an input to an output from data can fail to generalize to unseen scenarios where the phenomenon occurs. Incorporating the causal structure of the phenomenon can introduce effective inductive biases that can allow models to generalize to novel scenarios [17]. Whereas a given model will account for uncertainty in outcomes, i.e., aleatoric uncertainty, by modeling Bayesian posteriors we can account for all the models that fit well the data, thus also capturing the epistemic uncertainty, which is what we want to reduce through experiments. As we discuss in Section 4.3, GFlowNets can be employed to model the posterior distribution of causal models that fit the data well.

Underspecification and Diversity
ML approaches often assume access to some reward signal or scoring function to evaluate the quality and utility of experimental designs. For instance, to design drug-like molecules, the true objective is to find ligands that specifically inhibit the target protein within the human body. However, this objective can hardly be specified and conveniently quantified as a simple scalar reward. In practice, an estimate of the binding affinity of the molecule with the target protein is used instead as the reward signal to search for candidate molecules. This comes with two caveats: first, the estimation of binding affinity is not immune to systematic and random errors; second, the binding energy alone cannot account for many of the factors that can influence the effect of the ligand within the human body. For instance, a molecule (or a series of similar molecules) that only optimizes this binding energy may be potentially useless or even harmful in-vivo because it could bind to many other proteins and thus be toxic (more broadly, the molecule could have poor ADME). These aspects make it critical to find diverse hypotheses-in this case, diverse motifs of molecules-to account for the underspecification and uncertainty in the reward signal. Nonetheless, popular approaches to ML-aided scientific discovery, like RL [9] and Bayesian optimization [10] aim to discover a single maximizer of the the reward signal, not accounting for underspecification of the reward signal itself. Diversity of candidate solutions is also particularly relevant in the evolution of new generations of technologies. For example, before the use of pervoskites (or other new materials) in solar cells, optimization of power conversion efficiency and manufacturing in silicon-based solar cells was yielding diminishing returns, and only the use of radically different materials was able to change this situation. By sampling proportional to the reward, GFlowNets, can mitigate the problem of missing out potentially interesting findings due to underspecification in the target reward, generating a diverse set of high-reward candidate solutions to the discovery problem at hand.

Background
In this section, we review the relevant fundamental areas in machine learning and set the stage for Section 4 where we discuss how GFlowNets can address the challenges presented in Section 2.
Preliminaries. . To establish notation, consider an example problem from organic synthesis where chemists want to find new and/or efficient synthetic pathways to obtain a desired molecule (e.g. a drug candidate).
• Design. Let x ∈ X be the design for an experiment, where X denotes the design space of all possible experiments. For Example 1.1 and Example 1.2, each experimental design, x, specifies a candidate molecule, antibody or metal oxide material.
x can also represent interventions on specific genes for Example 1.3, or experimental parameters for a specific procedure (e.g. synthesis conditions). If an experiment is modeled in-silico, it can also include the fidelity of the approximations used, with the associated computational cost.
• Parameters. We use θ to represent the parameters of our mathematical model of the underlying phenomenon of interest. Like the experimental design x, θ can parameterize a wide range of objects. For instance, θ could represent the parameters of the physical process of supramolecular interactions/protein binding in Example 1.1, or a set of parameters describing the model of energy capacity and mechanisms of capacity loss in a battery for Example 1.2, or a structural causal model describing the causal interaction genetic pathways in Example 1.3, or more general cases such as parameters of a neural network.
• Outcome. We use y to denote the experimental outcome, for example the yield of a chemical reaction. The outcome, y, may also be multi-dimensional, and may include all measurements recorded during the experiment. y can represent the binding affinity to a target, toxicity to humans and synthetic accessibility in Example 1.1. For Example 1.2, y can represent energy capacity and retention loss, as well as materials purity and X-ray diffraction patterns. In Example 1.3, y can even include images and videos of specifc interventions on a population of cells. The experimental outcomes are often structurally rich but might lack the abstractions necessary for effective modeling.
• Dataset. Finally, we denote the data collected from previous experiments as We model the outcome y as a consequence of the design x. The likelihood -denoted p(y | θ, x)-is parameterized by θ, and is a measure of how likely each experimental outcome, y, is to occur, given a particular design, x, and model parameters, θ. This likelihood acts as our abstract model of the underlying phenomenon. If the likelihood is known analytically or can be computed efficiently it is called an explicit model. For example, we might model the outcome of an experiment with a Gaussian distribution, with mean as some function f θ (x), such that p(y | θ, x) = N (f θ (x); σ). Alternatively, if the likelihood is intractable, it is called an implicit model. Implicit models are common in the context of scientific discovery because we often model processes via simulators that allow us to sample from p(y | θ, x), without being able to evaluate the likelihood.
We assume that the data set of observed variables D = {x 1 , . . . , x n }, is drawn from the joint distribution p(x, θ) for some θ. In order to use that data to update our model of likely outcomes, we need an approach to solving the inference problem: estimating the posterior probability distribution p(θ | D) over the latent variable given the observed data. This problem appears in various contexts across domains. For instance, given observations of the experimental outcomes, say the binding affinities of several ligands to a particular target, we might be interested in estimating the distributions over parameters in the model that describe the process of binding. In principle this distribution can be estimated using Bayes' rule, In practice, however, computing the posterior exactly is intractable for high-dimensional θ and x. Thus, approximate inference methods have been studied extensively. We briefly summarize two broad classes of methods: Markov Chain Monte Carlo (MCMC) and Variational Inference (VI).

Approximate Inference
MCMC methods [18] are designed to generate samples from a target distribution, where a correct sampling procedure is not known but the density of the desired distribution is known up to a normalizing constant. They approximate sampling from the desired distribution by constructing a sequence of samples whose asymptotic distribution (as the sequence becomes longer) matches the desired target distribution. At each step of the sequence a new sample is generated by performing a small random perturbation from the previous sample. When trying to sample from a Bayesian posterior p(θ | D), we can compute the unnormalized form of the posterior as the product of the prior and the likelihood, p(θ)p(D | θ). Unfortunately, in high-dimensional problems and problems where the modes of the distribution occupy a tiny relative volume and can be far from each other, MCMC methods can take exponentially more time to properly sample from all the modes (or even just move from one mode to another). Methods to improve the performance of MCMC for sampling from high-dimensional distributions [19,20,21] are limited to certain classes of distributions and do not apply to sampling complex objects such as graphs and sequences which are important in a number of applications in scientific discovery. Variational Inference methods [22] approach the problem of sampling from the posterior using instead an optimization-based approach, finding a member of family of distributions that is closest to the posterior distribution that we seek. In particular, they search for a distribution q(θ)-by optimizing the values of q or parameters that define it-so as to minimize the reverse Kullback-Liebler (KL) divergence E θ∼q [log q(θ) − log p(θ, D)]. Because this measure of "closeness" is a reverse KL, VI methods tend to drop most modes of the true posterior or even focus on just one of them [23,24].
GFlowNets address this issue of mode dropping that plagues both MCMC and typical VI methods. They are similar to amortized variational methods (i.e., they learn a parameterization for q) but use a different training objective which favors a greater diversity of samples by allowing the use of exploration in the space of samples [25] as we discuss in Section 4.1.

Experimental Design
Experiments are the primary interface for interaction between our abstract models and the complexities of the real world. A key element of scientific methodology has been the careful design of experiments that allow the acquisition of knowledge corresponding to a reliable understanding of the underlying phenomena. However, experiments are expensive-either computationally, financially or in time. Therefore, we need methods to design experiments that maximize the amount of information our models learn from each experiment. This task of automated experimental design has been extensively studied in statistics and machine learning.
The field of experimental design studies the problem of designing "useful" experiments effectively. The usefulness of an experiment is defined by a utility function (or reward) U (·; D) : X → R, which may change as a function of the data, D, that we have observed from previous experiments. Given this utility function, we are typically interested in selecting the most useful designs, x * , The process of experimental science is often iterative, as illustrated in Figure 1. We design an experiment, perform the experiment and observe the experimental outcome, update our model based on the observations and then design the next experiment guided by the updated model. This is referred to as sequential experimental design. The sequential experimental design setting is thus formalized at any iteration k as follows, in terms of the estimated utility of the experiment x considered and the past data D k−1 : consists of the designs and outcomes of the experiments performed till iteration k.
Designing utility functions that accurately reflect the value of an experiment while being efficient to compute has been a problem of interest in various communities. Classical work on experimental design relied on the Fisher information matrix to quantify the information about parameters θ contained in the experimental outcome y [26,27]. This measure of information can be efficiently computed in linear models, where the outcome depends linearly on the design [8]. When this relationship is nonlinear, a variety of methods exist to select among a version space of nonlinear functions that are consistent with what what we have observed; see Settles [28] for a survey. In scientific discovery, we are not agnostic to the set of functions that could explain our observations: we typically have significant prior knowledge from the literature and previous experiments that what we can use to weigh the relative likelihood of potential experimental outcomes. Bayesian experimental design, introduced below, provides a principled approach to incorporating these priors into our choice of future experiments.

Bayesian Experimental Design
Bayesian Experimental Design (BED) or Bayesian Optimal Experimental Design (BOED) [29,8] approaches experimental design by modeling a rational agent that aims to maximize their expected utility of new experiments with respect to prior beliefs. The utility function provides a real-valued score for each potential outcome, y, of any potential experimental design, x. At each round of experimentation, the agent selects an experimental design 1 , x, that gives the highest expected utility weighted by how likely each outcome is to occur under the agent's prior beliefs (parameterized by θ). By specifying the agent's prior belief, we can encode scientific knowledge of known relationships and uncertainties in the observed outcomes, thereby making the procedure more efficient at exploring unknown parts of the experimental design space.
A common choice for the agent's utility function is the mutual information [MI; 30] between the experimental parameters θ and the outcome y observed upon performing an experiment y, given the dataset D of previous experiments.
MI can be interpreted as how much information can we expect to gather-or equivalently how much we reduce our uncertainty-about some random variable of interest (say the model parameters θ) thanks to the experimental outcome. Equations 5 and 6 give two equivalent definitions of MI, but both are intractable in general so we will need to rely on approximations.
Assuming access to the likelihood function, p(y | θ, x), we first need to estimate the various posterior distributions, depending on which of the above formulations of the MI we choose. To sample one of the sums, we need to estimate or at least sample from either the unknown and generally intractable posterior over parameters p(θ | D), or the marginal likelihood p(y | x) (where θ has been summed out). Estimating high-dimensional posteriors can be challenging and marginalizing out θ in p(y | θ, x) can also be intractable. Additionally, the MI itself involves a high dimensional integral which can be intractable. Consequently, developing efficient estimators to approximate the MI has been one of the central challenges in BED. Estimators of MI have been developed for both implicit and explicit models. The estimators typically leverage tools from approximate inference, including MCMC and VI discussed in Section 3.1, to approximate the posterior over parameters or the marginal likelihood [31,32,33,34,35,36], and the likelihood in the case of implicit models [37].
Towards Amortization Conventional approaches discussed above consist of two distinct steps: estimating or approximating the posterior over parameters or marginal likelihood to estimate the MI and then maximizing this estimator of MI to find the optimal experiment. Despite being applied in various contexts in scientific discovery [38,37], each of these steps alone can be computationally expensive. [39] introduced a unified stochastic gradient method to combine estimation of MI and the selection of the optimal experiment. They propose jointly optimizing the parameters φ of the MI estimator and the experiment design x using gradient based methods. [40] formalize the conventional BED approach in terms of a design policy π, which directly maps a history of experimental data h t = [(x 1 , y 1 ), . . . (x t , y t )] to the next experiment design x t+1 . Once this policy is trained, the next experiment can be selected directly using the policy, instead of the usual MI estimation and optimization. In essence, the training of the policy amortizes the cost of estimating and optimizing the MI. [41] extends this to implicit models.
These are examples of amortized inference: instead of expensive Monte-Carlo sampling to estimate or maximzes some expected value at run-time, we pre-train a function that directly produces an approximation of the desired quantities. We elaborate on learned amortized inference in Section 4, with a focus on GFlowNets and a discussion of its advantages over MCMC methods.
Current BED methods have been limited to continuous design spaces. While recent work has considered extensions to incorporate discrete designs [42], they are limited to small problem domains. BED methods in general are hard to scale to larger problem settings.

Bayesian Optimization
A typical problem encountered in various domains of scientific interest is that of optimizing the value of some expensive to compute black-box function f . For instance, consider the task of designing novel molecules to inhibit the activity of a particular target protein. We are interested in searching for a molecule x which minimizes the binding energy of the molecule with the target protein, f : Further, we only observe noisy measurements y on evaluating f , as described by p(y | f (x)). In the context of the binding energy, each experimental evaluation can be expensive and take weeks to perform in the lab and the observed results are noisy. Consequently the goal here is to discover x with the fewest possible evaluations of f . This problem is studied broadly within the framework of global optimization [43]. Solving the global optimization problem for any general function f is NP-hard in discrete spaces, and intractable without special structural constraints (e.g. convexity) on f in the continuous case. Methods for finding approximate solutions to this problem have received significant attention in the literature due to the broad applicability. Among the wide variety of techniques studied, Bayesian Optimization (BO) [44,45,46] is a popular and widely used approach. Bayesian Optimization has been applied extensively to a wide variety of scientific problems [47,48,49,50]. Broadly, Bayesian optimization consists of an iterative process to search for the global optimum in a sample-efficient manner, relying on tools from Bayesian Inference. Algorithm 1 provides a general overview of the Bayesian optimization approach.

Algorithm 1: Bayesian Optimization
The two key ingredients of a Bayesian optimization algorithm are the surrogate model p(f | D) which approximates f (and its epistemic uncertainty) and the acquisition function α(x, p(f | D)) which quantifies the utility of acquiring a point. Gaussian Processes [GPs; 51] are an appealing choice for the surrogate model owing to simple analytical form for the posterior p(f | D). Consequently, GPs are the default choice in modern BO methods [52].
From an information theoretic perspective, in each round we are interested in acquiring candidates that maximize the mutual information between the observed value and the global optimum of the function: arg max This objective resembles the information gain from Section 3.3, where the parameter of interest is θ = x = arg max x∈X f (x). Indeed, BO can be viewed as an instantiation of experimental design with an implicit model f , where we are interested in a particular random variable, the location of the maximum value of f , rather than all the parameters.

Acquisition Functions
The acquisition function plays a critical role in Bayesian Optimization and over the years various acquisition functions have been proposed. Expected Improvement [53]  . [59,60] instead considered the mutual information between the outcome and the max value of f rather than the arg max, resulting in the Max Value Entropy Search (MES) acquisition function: [61] further proposed considering a lower-bound on the mutual information resulting in a general purpose information-theoretic acquisition function GIBBON, which is also applicable to various extensions we discuss below.
Extensions Inspired by various practical applications, several extensions to the standard Bayesian Optimization setting have been studied. A common scenario is where f can be evaluated on multiple different candidates in parallel. In practice, we can often evaluate multiple candidates with nearly the same cost as a single candidate. For example, phage display can produce libraries of millions of antibodies in one batch. Batch Bayesian Optimization [62,63] is an extension of BO where in each round we acquire a batch of candidates instead of a single candidate. Additionally, we might have access to oracles with different costs and fidelities to evaluate f ; for example, to obtain the simulated binding affinity of a molecule to a protein, oracles can include free energy perturbation and molecular docking, where the former is substantially more accurate but the computational cost is orders of magnitude higher.. This setting is studied in Multi-fidelity Bayesian optimization [64,65,66]. Another important aspect in practical applications is multiple objectives. For example, we are interested in multiple properties such as the drug-likeness, toxicity to humans, and synthesizability in addition to binding energy in the drug discovery setting. Multi-Objective Bayesian Optimization [58,67,68] methods study this problem setup. Recent work has also incorporated physical inductive biases as priors for efficient Bayesian Optimization [69,70]. Finally, while traditional BO methods mainly consider continuous x, recent work has enabled BO on discrete spaces [71,72]. Despite recent progress, BO methods have typically been limited to small problems due to challenges in scaling surrogate models to larger domains. Additionally, as mentioned in Section 2, as BO is concerned with maximizing or minimizing a function, it can miss out on diversity which is critical for many scientific applications.

Causal Discovery
An important goal of the scientific methodology is understanding the causes and effects of certain phenomena based on prior observations and experiments. Causal Discovery studies the problem of learning the causal structure from data. A Bayesian Network [73] is a representation of the joint distribution over d random variables {Y 1 , . . . , Y d }, whose conditional independencies are encoded in a compact graphical way. These random variables correspond to the nodes of a directed acyclic graph (DAG) G that determines the factorization of the joint distribution as where Pa G (Y i ) is the set of parent nodes of Y i in the graph G, and θ i are the parameters of the conditional distribution associated with the random variable Y i .
Although in a Bayesian Network the edges connecting the nodes in G only encode associations between random variables, a causal graphical model enhances this framework with notions of causality. In a causal graphical model, again represented by a DAG, G, over the random variables, any directed edge Y i → Y j represents a direct causal influence of Y i on Y j . This allows the model to not only represent the joint distribution of the system (i.e., passively observing the system), but also the effects of actively experimenting on it. A causal model specifies the distribution that would be obtained under any intervention. For example, a "DO-intervention" sets the value of a variable, ignoring its usual causes. However, because the same causal mechanisms (the conditionals P (Y i | Pa G (Y i ); θ i )) are shared across all interventions, if the causal mechanisms (i.e. θ) and the graph G have been inferred correctly, a causal model can generalize to distributions never seen during training (i.e., out-of-distribution), corresponding to new interventions.

Causal Structure Learning
The structure G of a causal graphical model is often assumed to be known, where for example the causal relationships are determined using expert knowledge. This allows us to perform a number of tasks using these models, such as inference (either probabilistic, or causal), or learning the parameters θ of the causal model from observations of the system.
However in the context of scientific discovery, the objective is precisely to discover causal relationships that may have eluded experts thus far. For example, in the development of a disease, we want to find what factors (e.g. social factors, proteomes, pathogens) are involved. In this situation, we would like to learn the structure of the causal graphical model (or at least part of it) using data, stored in a dataset D. This data could either come from passive observations of the system (called observational data, e.g. the statistics of protein expressions in patients), or from active experiments (called interventional data, e.g. genome-wide gene perturbation). This problem is known as causal structure learning, or causal discovery [74,75,76,77,78].

Bayesian Causal Discovery
Similar to how there may be multiple theories explaining the same phenomenon, there may also be multiple models that could explain our observations equally well, even in the limit where we have a very large amount of data. Concretely, this means that many standard structure learning methods would typically choose an arbitrary model, which could lead to undesirable (and potentially harmful) outcomes. For example, if we had a system with only two (correlated) random variables A and B, there would be no way in general to distinguish between the two causal models A → B and B → A using observational data only, even though both models have significantly different causal conclusions. Moreover, in practice, the amount of data available in D to identify the causal model may be scarce, and this introduces another source of variability: since causal discovery methods only return a single candidate, some theory may be favored only due to the limited evidence. Ideally, we would like to quantify our epistemic uncertainty to avoid model misspecification. This can be done using the Bayesian posterior over the causal structures G, given a dataset D, similar to the description in Sec. 4.1.6. Using Bayes' rule, the posterior is given by In the expression above, P (G) represents our prior belief, and may encode some a priori knowledge about the structure G. For example, we may encourage causal graphs to be sparse, i.e. to limit the number of parents for any node in the graph. While this prior may be designed based on expert knowledge, this usually encodes only soft beliefs about the causal model, and does not represent a single graph G unlike when the structure is assumed to be known. The term P (D | G) is called the marginal likelihood, and is defined by integrating over all possible values of the causal mechanisms where P (D | θ, G) is the likelihood of the data under a specific choice of causal structure and mechanisms, and P (θ | G) is a prior distribution over causal mechanisms. The marginal likelihood represents how well the data D fits a certain hypothesis, given by the causal model G, regardless of the choice of the causal mechanisms themselves.

Active Learning of Causal Structures
With an approximation of the Bayesian posterior over causal graphs, we can also leverage the tools from Bayesian Experimental Design in Section 3.3 in order to design interventions on the system that would refine our beliefs about its causal structure [91]. This creates a feedback loop between the estimation of our uncertainty about the causal graph based on data, the decisions about which experiments to perform, and the acquisition of new experimental data (see Figure 1). Active causal discovery can be used to learn the causal structure of either the whole system [92,93,94,95], or part of it [85]. Recently, Toth et al. [96] proposed a novel framework called Active Bayesian Causal Inference (ABCI) to infer not only the causal graph, but also jointly learning the posterior over causal queries of interest.

Generative Flow Networks
We begin by introducing the broader ideas around using neural networks to efficiently learn a mapping from sampled observations to proposed high-dimensional probability distributions and intractable sums, as a general substitute for the popular MCMC-based inference. These ideas lead into GFlowNets, which provide a general framework for amortized inference with neural networks.

Learning to Perform Amortized Inference
Let us first look at how neural nets can be used to amortize the inference problem introduced in Section 3.1, i.e., by being trained to approximately perform the sampling or summing task that is otherwise intractable. Specifically, we consider how an intractable expectation or sum can be transformed into a tractable training task to approximate the desired sum. This is the fundamental principle underlying GFlowNets.

Simple Mean Squared Error Criterion to Amortize an Intractable Expectation
Consider a set of intractable expectations that we would like to approximate, for a pair of random variables x and y that can both take an exponential number of values or live in a high-dimensional space: which is then intractable because of the exponential number of terms in the sum. If we know how to sample from p(y | x), we could, however, train a neural net S with input x, stochastic target output R(x, y) and Mean Squared Error (MSE) loss where y ∼ p(y | x), to train the estimator S with parameters θ. When we sample training examples (x, y), the stochastic gradients ∂L(x,y) ∂θ would make S converge to S if it has enough capacity and is trained long enough [97].
For any new x, we would then have an amortized estimator S(x) which in one pass through the network would give us an approximation of the intractable sum S(x). We can consider this an efficient alternative to doing a Monte-Carlo approximation Figure 2: Example schematics of a GFlowNet. (a) The objects whose distribution is modeled by a GFlowNet should be compositional, like graphs or sequences, built through a sequence of actions; (b) Representation of the sequences of actions by which a GFlowNet can construct a molecule by composing smaller fragments, as a directed acyclic graph (DAG) whose nodes represent partially constructed molecules, and a reward is provided when the molecule is completed. Figure 3: An illustration of the feasibility of systematic generalization enabled by ML methods: if we have already discovered the three modes shown of a reward function, a learner that can generalize may guess the presence of a 4th mode, because the first three modes seem to align on a grid. The existence of such generalization structure is why amortized ML samplers can potentially do much better than MCMC samplers.
which would require a potentially large number of samples and computations of R(x, y) for each x at run-time, especially if p(y | x)R(x, y) is a rich multimodal function (for which averaging just a few samples of y does not give us a good estimator of the expectation). Besides the advantage of faster run-time, a crucial potential advantage of the amortized version is that it could benefit from generalizable structure in the product p(y | x)R(x, y): if observing a training set of (x, y, R(x, y)) triplets can allow us to generalize to new (x, y) pairs, then we may not need to train S with an exponential number of examples before it captures the generalizable structure and provides good answers (i.e., approximates E Y |x [R(x, Y )] well) on new x's. This ability to generalize from structure in the data is actually what explains the remarkable success of ML (and in particular of deep learning) in the vast set of modern AI applications.
When we do not have a p(y | x) that we can sample from easily, we can, in principle, use MCMC methods that form chains of samples of y's whose distribution converge to the desired p(y | x), and where the next sample is generally obtained from the previous one by a small stochastic change that favors increases in p(y | x). Unfortunately, when the modes of the summand, p(y | x)R(x, y), occupy a small volume in the search space (i.e. throwing darts does not find them) and these modes are well-separated (by low-probability regions), especially in high dimension, it tends to take exponential time to mix from one mode to another. However, such an MCMC approach leaves money on the table: the attempts (x, y, R(x, y)) contain information that one could use to train an ML model. To the extent that the space is sufficiently structured, such a model could guess where the yet unseen modes might be given the location of the already observed modes, as illustrated in Figure 3.

GFlowNet Criterion to Obtain a Sampler and Estimate Intractable Sums
Let us consider the situation where we do not have a handy p(y | x) and our objective is just to approximate a set of intractable sums (for any x) where we have the constraint that R(x, y) ≥ 0 and S(x) > 0. This may be useful to estimate a normalization constant for energy-based models or Bayesian posteriors (where y corresponds to learnable parameters and x to observed data). Hence we may also be interested in the sampling policy Now, GFlowNet losses are derived from a set of constraints that we would like to be true: We can define estimators π and S and train them with a loss such as or with an interpretation of R as unnormalized probabilities that we want well calibrated in the log-domain, where (x, y) are sampled from a training distribution p(x, y) that has full support. It can then be shown [11,12] that with π and S with enough capacity and trained for long enough they both converge to their desired value: This is a crucial property of GFlowNets: they are trained to sample objects y (given x) with probability proportional to a given reward function R(x, y).

Marginalizing over Compositional Random Variables
To make the notion of intractable sum more concrete, it is good to think of y (and potentially x as well) as a compositional object, like a subset of variable-value pairs or a graph. For example, we can construct compositional objects sequentially through a series of steps where a new piece of the compositional object is inserted at each step, such as the molecular fragments composed to form a larger molecule in Fig. 2b. The sampling policy π then sequentially and stochastically chooses a constructive action at each step, and after each step we get a partially constructed object s which we call a GFlowNet state. A sequence of such states and actions forms a GFlowNet trajectory τ . In the basic GFlowNet framework, the actions are stochastic, but the next state is deterministically obtained from the previous state and the action, because they are not happening in an external environment but are part of the internal computation of a sampler. In addition, the GFlowNet mathematical results, as they currently stand, assume that each step is constructive, i.e., we cannot return to the same partially constructed object s twice. This means that the set of all possible trajectories forms a directed acyclic graph (DAG), illustrated in Figure 2. Because the transitions are deterministic, we can specify a trajectory τ with a sequence of states (or, equivalently, an initial state and a sequence of actions). A special "exit" action is also defined to declare the construction of the object y is completed. The policy π(y | x) is now specified by a forward transition distribution P F (s | s ) which specifies how to generate each constructive step given the previous state, and we are interested in parameterizing and learning this P F . For instance, consider the molecule graph example illustrated in Figure 2. We can construct a molecule graph sequentially using nodes representing atoms as building blocks. Starting from a special empty state, which we denote as s 0 (this can be an empty graph, null set or empty sequence or chosen based on the value of a conditioning variable x, maybe specifying some desired characteristics of the molecule), the object y can be constructed through a sequence of steps, each consisting of adding a single block a ∈ A. We assume that the actions are limited to be constructive, and deletion of blocks is not allowed. At each step, we have a partially-constructed object s ∈ S, where S denoted the space of all possible partially-constructed objects and Y ⊂ S. Another assumption we make throughout is that these states are Markovian, that is, they incorporate all the information from their history. This results in a directed acyclic graph (DAG) G which is defined by a tuple (S, E), where the set of nodes corresponds to S, and an edge s → s ∈ E indicates that object s can be constructed by adding a block a ∈ A to s, s a − → s . We can define a trajectory as a sequence of steps describing the construction of an object τ = s 0 a1 − → s 1 . . . an − − → y . 2 Let R(x, y) denote the utility (reward) for a given object y in context x. For instance, it can be the binding energy for the ligand molecule y with a given target protein x.
Formally, the training objective is to learn a stochastic policy π which sequentially generates an object y with a probability proportional to its reward, i.e., π(y | x) ∝ R(x, y).

Multiple Parent States
Besides the sequential nature of the generative process for y, an interesting complication is that there may be many ways (in fact exponentially many trajectories) to construct y from some starting point and context x. This means that a partially constructed object, i.e., a state s, may have multiple parents s for which an action a exists that leads to s. Otherwise (when each state only has one parent), the DAG is a tree, which makes the computation much simpler. But when it is not a tree, it turns out to be convenient to consider and parameterize a backward transition probability function P B (s | s) which is consistent with that DAG and the associated forward transition probabilities P F . The constraint in Eq. 17 can be reformulated in several ways, in particular what is called the detailed balance constraint: where F (s) is called the flow at state s and plays a role similar to S(x) above, i.e., it is an intractable sum, and there is a starting state s 0 = x from which a trajectory is initiated, as well as a constraint that the flow into a terminal state s = y equals R(x, y). Similarly to the simpler case above, this can be turned into a training loss that we want to minimize, but now over all (x, τ ) pairs or over all (s, s ) pairs. As a consequence of satisfying the detailed balance constraint at all (s, s ) pairs, the initial flow becomes equal to the normalizing constant [11]:

Learning Objectives
In practice, we would like to approximate P F (· | ·; θ), P B (· | ·; θ), and F (·; θ) with learnable parameters θ, and we want to choose those parameters to satisfy as well as possible the detailed balance constraint ∀s, s ∈ S. The detailed balance constraint can be converted to the following loss to learn the parameters θ: Several alternative learning objectives for GFlowNets have been proposed, especially for longer trajectories to sample the object y [98,99]. Trajectory Balance [98, TB] is a prominent learning objective for training GFlowNets. Contrary to the detailed balance objective, which considers constraints on pairs of states, trajectory balance jointly applies the detailed balance constraint over entire trajectories. A learnable parameter Z is introduced which at convergence is equal to the desired sum. The trajectory balance loss over a trajectory τ is defined as follows: Algorithm 2 illustrates the typical approach for training GFlowNets, by sampling trajectories from a sampling policyP F which is typically a tempered P F or a mixture of P F with a uniform policy to enable exploration, and optimizing the loss induced by the learning objective with respect to θ, with stochastic gradient descent 3 . As a result of this procedure, we learn a P F with the objective that the marginal likelihood of a trajectory terminating at a terminal state x, denoted as π(x) become proportional to the reward R(x). The learnable objects P F , P B , F are typically parameterized by neural networks. These neural networks must have appropriate inductive biases depending upon the type of objects we are constructing. These neural networks must also have enough capacity to model the underlying distribution. In Section 4.2 and Section 4.3 we discuss specific cases of leveraging GFlowNets for problems of molecule generation and causal modeling respectively.

Implications for Bayesian ML
Let us consider the special case where y = θ is a latent parameter, i.e., in a Bayesian setting, and x = D is the available data. Then we can define the reward function from the parameter prior P (θ) (how plausible are these parameters a priori) and the data likelihood P (D | θ) (how well does this choice of parameters fit the data). Training a GFlowNet provides us with an approximate sampler for the posterior over parameters (the policy π(θ | D)) given data as well as an estimator of the normalizing constant of the Bayesian posterior, through the learned initial flow S(D). Hence, we have used amortization to turn a tractable function (prior times likelihood, as a function of θ) into estimators of these generally intractable quantities. We get a fast sampler for the posterior with no need for a Markov chain going through a large number of candidate samples. With that sampler, we can generate many independent samples θ | D (possibly in parallel) that are likely to visit the larger modes of the posterior (where the reward is larger).
To make computations more ML-friendly (especially for large datasets) while training the GFlowNet, we can note that the GFlowNet squared loss objectives naturally lend themselves to the case where the reward or log-reward is stochastic and is an unbiased estimator of the true reward. For example, we can typically decompose the overall dataset log-likelihood log P (D | θ) into a sum of per-example or per-minibatch terms, and we can introduce a multiplicative correction to account for the prior P (θ): where the expectation over Z is just a sum over the Z's in D. This makes it possible to train the GFlowNet posterior estimator using stochastic gradient descent on single examples or minibatches, which is the state-of-the-art to train deep nets.

Why GFlowNets?
Let us look at how GFlowNets differ from other related conceptual frameworks: • Markov Chain Monte-Carlo: GFlowNets do not construct Markov chains with semantics like those in MCMC methods. GFlowNets and MCMC approaches differ fundamentally: with MCMC approaches, the chain is irreducible (every state is reachable from every other state), and the stationary distribution of this chain is of interest. In order to generate samples from this stationary distribution, we need to run long chains (ideally infinitely long ones). In GFlowNets, however, the chains only need every state to be accessible from the initial state, and what we have is a bounded sequence of stochastic transitions. The expensive stochastic "search" (to reach low energy configurations) normally performed by MCMC is replaced by the training phase of the GFlowNet, using the principle of amortized inference, so that during inference a sample can be generated in a single short trajectory by the policy. To exemplify, if we were to run an MCMC to generate samples of desirable molecules, we would probably have a molecule at every state, and we would have actions that can stochastically transform one molecule into a nearby one [100], whereas a typical way to sample a desirable molecule with GFlowNets is constructive, where we start from an empty object (which is not really a molecule) and we sequentially and stochastically add molecular fragments: the state corresponds to these intermediate objects, which may or may not correspond to a well-formed molecule [11]. That being said, the overall objective is the same: turn a given energy function (or unnormalized distribution) into a generative procedure for obtaining samples from that distribution. Both MCMC and GFlowNets can also be used to marginalize, i.e., compute intractable sums, although again in different ways. MCMC turn the exact sum into a Monte-Carlo approximation of it while GFlowNets perform amortized inference, i.e., they train a neural net whose output will, after training, approximate the sum. This becomes useful when we have more than one marginalization to do, say given a context, and thus the neural net can take that context as input and rapidly produce an estimator of the intractable sum as output.
• Reinforcement Learning: GFlowNets learn policies to sample trajectories that land in a terminal state with probability proportional to the reward of the terminal state rather than trajectories that maximize the expected reward, as in standard deep reinforcement learning. As shown by Bengio et al. [11], Jain et al. [101], this results in a diversity of samples which is important when the reward function is an imperfect proxy for the property that we actually care about: it avoids putting all our eggs in the wrong basket.
• Deep generative Models: Traditional generative models in deep learning such as variational autoencoders or VAEs [102,103] or GANs [104] require positive samples to model the distribution of interest, whereas GFlowNets are trained from a reward function.
• Variational Inference: variational inference trains an approximate sampler (and the corresponding density) so as to reduce the forward KL-divergence (the evidence lower bound or ELBO) with a given distribution function (playing the same role as the reward). This requires on-policy training (sampling from the learned sampler), which has a tendency for focusing on a single mode rather than find a diversity of modes. Instead (see Malkin et al. [25] for details), GFlowNet objectives enable off-policy training without requiring the high-variance importance sampling correction necessary with the ELBO.
To summarize, GFlowNets shine in problems with the following properties: • It is possible to define or learn a non-negative reward function which will specify from what distribution the GFlowNet should sample.
• The reward function of interest is highly multimodal. This emphasizes the advantage of GFlowNets in terms of diversity of samples. If the reward function was unimodal, existing RL or variational inference methods (which tend to focus on a single mode) could be used instead.
• It is advantageous to sample sequentially, e.g., there is compositional structure that can be exploited by sequential generation.
Current Limitations. Until recently, GFlowNets have been limited to sampling from distributions over discrete objects (e.g. graphs). Recent work by Lahlou et al. [105] presents a theoretical framework for extending GFlowNets to sample from distributions over continuous spaces. Leveraging this framework for sampling from distributions over high-dimensional continuous or mixed (discrete and continuous) spaces remains an open problem. Another potential limitation, shared with other reinforcement learning methods, is that effective credit assignment over very long trajectories (for generating large objects, such as proteins) is more difficult. Pan et al. [106] take initial steps to tackle this problem, proposing a way to assign partial rewards earlier in the generated trajectory which results in more effective credit assignment. Another open question is that of the best policy for sampling training trajectories for a GFlowNet. Existing theoretical results from Bengio et al. [12] assume that the policy sampling trajectories should have full support over the space of trajectories, but designing this policy (beyond the heuristics discussed in Section 4.1.5) for sample-efficient learning is an open problem.

Diverse Candidate Generation
A fundamental problem in chemistry is the synthesis of novel chemical structures (e.g. molecules) that satisfy some criteria. As alluded to in Section 4.1, generation of molecules to optimize for a particular chemical property is an appealing use case for GFlowNets, because GFlowNets will tend to generate a diverse set of molecules optimizing that property. An important problem in the context of drug discovery, which we introduced in Example 1.1 is to discover molecules that bind to a particular target, potentially inhibiting the target in the process. From the computational design perspective, molecular docking simulations can give scoring functions to approximately evaluate proposed molecules. More recently, graph neural networks which approximate the binding energy [107] are used to approximate docking as they are much faster. As discussed in Section 2 as these scoring functions serve as approximations to the underlying process, it is important to generate diverse candidates for downstream applications, to avoid putting all our eggs in the same basket. [11] leverage GFlowNets for the problem of diverse molecule generation 4 . Soluble epoxide hydrolase (sEH) in the 4JNC configuration is studied as a target in the paper. It is a useful target as it plays a role in certain respiratory and heart diseases [108,109]. Autodock Vina [110] was used for docking the generated molecules to evaluate the binding energy. Docking each molecule with Autodock Vina can be quite slow and takes several minutes to run, making it prohibitively expensive to train a policy directly directly using it as a reward. Instead, the authors rely on a graph neural network, trained using a data set of docking scores for 300, 000 molecules, as the reward for training the policy. The molecules are generated using fragments, as illustrated in Figure 2. At each step, the policy picks a fragment from a library to add to the partially constructed molecule, and choose where to place that fragment. The library of fragments is derived from the Zinc database [111].
The molecule design problem possesses all the key properties discussed in Section 4.1.7 for GFlowNets to be effective -(a) there is compositional structure in generation as molecules are built using subgraphs with unique chemical properties (b) the reward function is an approximation of what we really care about, as the docking score and its approximation by a neural network (which has epistemic uncertainty associated with it due to finite training) are approximations of the underlying phenomenon of a molecule binding to a target and inhibiting it, and (c) the reward is multi-modal since there can be multiple motifs of molecules that bind well to a given target. [11] showed that GFlowNets result in substantial improvements over existing methods on this molecule generation task. In particular, as shown in Figure 4a, GFlowNets discover significantly more modes of the reward function (i.e. many different molecules that have high predicted docking score) relative to other reinforcement learning (PPO) and MCMC (MARS) approaches. Sampling proportional to the reward results in high reward and diverse samples. Though it is important to note that while using GFlowNets results in significant improvements in the diversity of generated samples, they do not always lead to the highest scoring candidates, because there is a natural trade-off between diversity and reward. [112] introduce metrics to study the ability of GFlowNets to explore novel regions in molecular space.
Further, [11] also consider an active learning setup, starting with a data set of 2000 molecules. In each round, a surrogate model is trained on the data set. This surrogate model is used as the reward for the GFlowNet. Next, molecules are generated with the GFlowNet policy, evaluated with docking, and added to the data set for the next round. Using GFlowNet to acquire the batches of molecules results in significant improvements in the reward over the initial data set, shown in Figure 4b, demonstrating the potential of GFlowNets to accelerate large-scale virtual screenings.
From a practical perspective, we are often interested in multiple objectives rather than a single one. For instance, in the context of drug discovery, an ideal drug candidate should specifically inhibit the target but also be synthesizable in large quantities, soluble, and and harmless to humans; alternatively, in material science, to have an efficient solar cell means the optimization of current, voltage, and fill factor. Typically, there are very few candidates which simultaneously satisfy all the objectives, which might even be conflicting with each other (e.g. in solar cells, photocurrent can increase with a photoactive material with lower bandgap, but the voltage decreases). Instead, there exists a set of candidates with the optimal trade-offs between the objectives where further optimizing an objective is impossible without compromising another, i.e. the Pareto front. Moreover, as with the single objective case, diversity is still critical in the multi-objective case. Multi-Objective GFlowNets [113,MOGFNs;] extend GFlowNets to tackle multi-objective optimization problems. Building upon scalarization approaches in multi-objective optimization [114], MOGFNs decompose the multi-objective optimization problem into a family of sub-problems which can solved simultaneously. This family of sub-problems is modeled simultaneously with a reward conditional GFlowNet [12]. MOGFNs demonstrate state-of-the-art performance on a variety of small molecule generation and protein design tasks. MOGFNs consistently generate diverse pareto-optimal candidates. For instance, MOGFNs are able to generate molecules that bind to the sEH target, while achieving a high synthesizability and QED score. The active learning setting is further explored by [101] 5 . Incorporating ideas from Bayesian Optimization discussed in Section 3.4, GFlowNet-AL [101] incorporates information about the epistemic uncertainty of the surrogate model in the reward for the GFlowNet with an acquisition function. This epistemic uncertainty helps in guiding the GFlowNet to optimize the promising less explored regions in the state space. As such, instead of maximizing the acquisition function in Algorithm 1, [101] propose sampling proportional to the acquisition function. Equipped with information about the epistemic uncertainty in the reward and other improvements, GFlowNet-AL outperforms various existing methods on a variety of biological sequence design tasks, including generation of peptide sequences with antimicrobial properties. The state space S consists of partially constructed sequences with each action being the addition of a token from a vocabulary (e.g. a residue from a set of amino acids) to the end of the current partial sequence. Candidate sequences generated by GFlowNets are significantly more diverse and have high rewards. The diversity of generated candidates demonstrates the potential of GFlowNets to accelerate the process of discovering novel antibiotics to tackle the growing and highly concerning phenomenon of antimicrobial resistance [115].
These initial empirical successes demonstrate the potential of GFlowNets to make a significant impact in improving experimental design for a wide variety of scientific problems.

Bayesian Causal Discovery with GFlowNets
As we have seen in Section 4.1.6, GFlowNets offer a general solution to approximate Bayesian posteriors, like the one in Eq. 10. GFlowNets are all the more adapted to the problem of Bayesian causal discovery that causal structures G, represented as a DAG, are compositional objects. Deleu et al. [116] 6 used this observation to introduce a GFlowNet whose states are DAGs, and where some graph G is created sequentially by adding one edge at a time, starting from the completely disconnected graph over d nodes; the structure of this GFlowNet is shown in Figure 5 (left). Similar to Eq. 24, for a fixed dataset D, the reward function of the GFlowNet is defined as R(G) = P (D | G)P (G). By constraining the set of valid actions at every state, the edges are added in such a way that they will never introduce a cycle, which guarantees that graphs remain acyclic at every stage of the construction. Therefore, all the states of the GFlowNet are valid causal structures. Deleu et al. [116] leveraged this property and showed that such a GFlowNet may be trained using a modification of the detailed balance loss [12, see also Eq. 20], specifically adapted to the case where all the states are terminating. To evaluate the reward function R(G) though, one needs to evaluate the marginal likelihood P (D | G) in Eq. 11, which is in general intractable [117]. Deleu et al. [116] experimented only with models such as multinomial-Dirichlet (for discrete data) and linear-Gaussian (for continuous data), for which the marginal likelihood may be computed efficiently in closed form. Alternatively though, instead of approximating the (marginal) Bayesian posterior P (G | D) over structures only, we could approximate the posterior P (θ, G | D) over both the causal structures G and the causal mechanisms θ [118], to avoid the intractable integration in Eq. 11.
Beyond DAGs. Work on causal discovery focuses primarily on the causal graphical model framework introduced in Section 3.5 that assumes a DAG structure, but acyclicity is indeed an assumption that may not hold in certain domains. For instance, in Gene Regulatory Networks, there are feedback loops between multiple genes interacting with one another [119]. Nevertheless, when we consider the temporally unfolded cyclic graph, i.e., the dynamics, we are back to a DAG. Some recent work has studied the problem of learning the structure of non-acyclic causal models [120,121,122]. The GFlowNet approach discussed above naturally extends to cases where the causal graph might be cyclic. The masking mechanism introduced by Deleu et al. [116] prevents cycles from being introduced at every step where an edge is being added, ensuring the generated graphs are DAGs. If we remove this additional constraint and allow the GFlowNet to introduce cycles in the generated graph, then it would approximate the posterior distribution over cyclic causal models which fit the given observations.

Bayesian Posteriors for Scientific Discovery
Posterior Predictive Once the structure of the causal graphical model is known, we can use the model to perform inference, i.e. answering (possibly causal) questions about the system of interest, in the context of different interventions (which we can interpret as setting some variables, i.e., designing an experiment). If we have information about the epistemic uncertainty, through the Bayesian posterior P (G | D), we can even go one step further and average out the predictions made with all possible causal models, weighted using the posterior distribution. Concretely, given a new observation y in the context of an intervention x this corresponds to evaluating This is called the posterior predictive, or Bayesian model averaging [123,124]. The advantage of Eq. 25 over making predictions using a single causal model is that multiple concurrent theories may now participate in those predictions. In this way, we avoid using only theory, which may be incorrect, and we obtain a more conservative answer, thus avoiding catastrophic outcomes due to a single theory (say a particular causal graph G) being confidently wrong.
Amortized Posterior Predictive Instead of performing a Monte-Carlo average to estimate P (y | x, D) for each candidate x as per Eq. 25 (which may be fairly expensive if we want to train a policy that is trained by considering a large number of possible x's), we can use a neural network g φ (x, y) to amortize that calculation. This can be done by training g over (x, y, G) triplets with squared loss Other amortization approaches are possible. For example, using the GFlowNet framework, a policy Q(y i | x i , y i−1 1 , x i−1 1 ) can be trained to first sample one outcome y i at a time, given the input experiment specification x i and the previous experimental results y i−1 1 of the previous experiments x i−1 1 . The GFlowNet constraint to satisfy is that where the trained posterior predictive takes explicitly a partial dataset (x i−1 1 , y i−1 1 ) as input, similarly to neural processes [125] and Q(G | D) is the GFlowNet causal graph sampler as described above, except that we allow interventions (different choices of x) in the data.
Interpretability Bayesian posteriors over causal models also provide a natural tool for interpretability, since they encode the belief that a causal structure fits the observed data. By inspecting which causal structures contain a certain edge, we can obtain a belief that certain causal relationships between two random variables exist [126]. This is called the edge marginal distribution: Fig. 5 (right) shows a comparison of the edge marginals computed with the posterior approximation returned by DAG-GFlowNet [116] against the exact edge marginals, highlighting the capacity of GFlowNets to accurately approximate the posterior over graphs P (G | D). Note that this kind of comparison is typically limited to small problems where the true posterior P (G | D) may be computed efficiently in closed-form, and in general one may be concerned with the calibration of these estimated marginals [90].

Towards a Unified Framework for Scientific Discovery with GFlowNets
In this paper, we have introduced GFlowNets as a tool for modeling and for experimental design in the context of the scientific discovery loop (Fig. 1). We have summarized how they have been used and could be further used on both fronts. In this concluding section, we outline research directions based on this early work and aimed at providing scientists with a powerful ML-based framework applicable when it is possible to iteratively generate informative experimental data.

Exploiting Amortized Causal Bayesian Modeling for Defining the Utility of an Experiment
An appealing theoretical framework for defining the objective of an experiment is that of information gain introduced in Section 3.2: "how much information about a random variable of interest can we expect to gain through the experiment?" A good policy for experimental design should propose experiments with a high value of this information gain as a reward. Note that this framework is broadly applicable to a wide range of interactive learning domains, such as reinforcement learning and active learning, encapsulating the fundamental problem of exploration. In general, the decision to perform an experiment may not simply be based on the number of bits of information gained but also on the risks and costs involved [127]. We can however incorporate the notion of a cost or budget by considering the information gain per unit of cost incurred as the reward. Information gain can be measured in principle by the mutual information between the outcome of an experiment (a random variable since the experiment has not taken place yet) and the variable of interest (about which we seek to gain information), given the experimental specification and any other knowledge (including data) we may already have. In the simplest and purely unsupervised knowledge-seeking scenario, the variable of interest may be the causal model explaining the outcomes of experiments. In a more targeted scenario, for example in drug discovery, it would be the set of molecules that have certain desirable characteristics (e.g., affinity with a target protein is above a threshold and toxicity is below a threshold and synthesis cost is below a threshold). Let Y be the experimental outcome, x be the experiment specification, and V the variable of interest about which we seek to gain information. The information-theoretic utility for our experimental design would then be defined as where D is our dataset of prior experimental results {(x, y)} and any other constraint we want to exploit to condition the probabilities. With V typically being a much higher-dimensional object than Y , Eq. 32 tends to be more practical numerically. To evaluate the expression in Eq. 32 we can leverage the ideas of amortization and GFlowNets to estimate (a) the numerator and denominator probabilities P (y | v, x, D), P (y | x, D) (b) a sampler for the joint P (y, v | x, D), and (c) an estimatorÎ of the MI itself, as a function of x. In the case where where the variable of interest are the parameters of some underlying process v = θ (as introduced in Section 3.2), we can follow the amortization approach outlined in Section 4.3.2 to estimate the posterior predictive P (y | x, D) in the denominator. On the other hand, the likelihood in the numerator P (y | θ, x, D) = P (y | θ, x) is available in the case of explicit models, and can be approximated in the case of implicit models as discussed in Section 3.3. Next to learn a sampler for the joint P (y, θ | x, D) we first approximate samples from the posterior over parameters Q(θ | D) following Section 4.1.6. By combining the samples from the posterior Q(θ | D) and the likelihood P (y | x, θ) we can learn a sampler Q(y, θ | x, D) = Q(θ | D)P (y | x, θ) to approximate the joint P (y, θ | x, D).
As for the estimator of MI itself, one possibility is to train a neural networkÎ(x) to amortize the expected value over (θ, y) given (x, D) using the samples from the joint Q(y, θ | x, D) and the estimators for the probabilities in the log-prob ratio from with a squared loss where x is sampled from a dataset or a generative model of inputs, θ ∼ Q(θ | D) and y ∼ P (y | x, θ). With enough capacity and training time, Q converges to P andÎ(x) converges to I(Y ; θ | x, D). What is particularly interesting if x is in a high-dimensional space is that it generally won't be necessary to see more than one value of y and θ for each value of x in order to trainÎ, as usual in supervised learning. This can work if it is possible for the learning procedure forÎ to generalize from the (θ, x, y) triplets used to train it.

Modeling and Causality
One open challenge on the modeling side is to leverage GFlowNets to model Bayesian posteriors beyond causal models. [128] take an initial step in this direction, using GFlowNets to model the posterior over dropout masks in a neural network. Moreover, in the context of causal models, many challenges remain to extend the work done by [116]. This includes (a) accommodating larger causal graphs efficiently (b) making it possible to handle unobserved causal variables by also learning how the raw inputs (e.g., images) may be related to the causal variables (c) learning how experimental choices relate to interventions on the causal variable when this is not known perfectly a priori.

Experimental design
Our discussion has focused primarily on the case where we are interested in information gain about the parameter θ to drive knowledge acquisition in the experimental design loop. As we discussed in the previous section, it is possible to incorporate any random variable V (Eq. 31) that we can learn to model and sample. An interesting case is one where the random variable V is an extremum, e.g., the top molecular candidates, for some task, as in Eq. 9. Furthermore, in many practical experimental settings, we have access to measurements of varying fidelities as the outcome of our experiments (e.g., computer simulations with varying accuracy-computational cost trade-offs). Such a multi-fidelity setting is discussed above (Sec. 3.4) but needs to be incorporated within the GFlowNet framework. Another important practical scenario also introduced in the same section is the existence of multiple objectives, with early work to incorporate that in the GFlowNet framework by Jain et al. [113]. Finally, as introduced in Section 3.5 these experimental design tools could be integrated within the causal discovery framework, by focusing the knowledge acquisition on the causal graph itself, an object of great value to scientists from an interpretability point of view. This could be achieved by using the graph itself as the target variable V (or a part of it), to drive the experimental design to accelerate the discovery of the causal structure.