Stephen
Gow
*a,
Mahesan
Niranjan
b,
Samantha
Kanza
a and
Jeremy G
Frey
a
aDepartment of Chemistry, University of Southampton, University Road, Southampton, Hants SO17 1BJ, UK. E-mail: s.r.gow@soton.ac.uk
bDepartment of Electronics and Computer Science, University of Southampton, University Road, Southampton, Hants SO17 1BJ, UK
First published on 30th August 2022
The growth of machine learning as a tool for research in computational chemistry is well documented. For many years, this growth was heavily driven by the paradigms of supervised and unsupervised learning. Recently, however, there has been increased interest in the use of a third paradigm: reinforcement learning. This approach, in which an agent interacts with an environment to learn which actions it should take to maximise a long-term objective, is particularly suited to problems of planning or sequential decision making. In this review, we present an accessible summary of the theory behind reinforcement learning (and its common extension, deep reinforcement learning) tailored specifically to chemistry researchers. We also review the applications of reinforcement learning which already exist within the world of chemistry, and consider the future direction of research based on this promising technique.
Supervised and unsupervised learning paradigms essentially learn static or dynamic mappings in a space of features. Unsupervised learning aims to characterise the probability distributions of these features, or discover subspaces in which the derived data may live, while supervised learning uses labelled data to infer relationships between features. Such data is usually seen as arising from an underlying probabilistic generating mechanism, sampled in an identical and independently distributed IID manner.2
A machine learning paradigm that is distinct from the above two is that of reinforcement learning (RL),3 which addresses planning or sequential decision making problems. Reinforcement learning assumes a setting in which an agent interacts with an environment to acquire data and learn about the environment, and executes actions that will maximise a long term objective. The environment in this setting makes transitions between states whenever the agent executes an action. The resulting state the environment transitions into is either fully or partially observable by the agent, which is also given a short term reward resulting from the chosen action. A diagram of this process is shown in Fig. 1.
The problem then is to combine the information contained in the observed state transitions and received immediate reward so as to act in a way that a long term objective is met. The challenge, of course, is that greedily accumulating the immediate rewards obtained need not be the optimal decision because such decisions could drive the agent along trajectories that may subsequently offer low rewards. The action an agent takes in any state is referred to as a policy, and discovering an optimal policy while characterising the uncertain behaviour of the environment is the learning challenge in this paradigm. The formulation is closely related to the subject of stochastic optimal control.4
Recent developments in the reinforcement learning have been heavily influenced by the use of powerful non-linear function approximation methods derived from neural network architectures, and the algorithmic developments around them.5 Such function approximations are used to model parametric forms of optimal policies or to approximate what we could loosely refer to as the usefulness of reaching any state along the trajectory towards the optimal one in the long term. It is this combination of learning paradigm and neural architectures that have led to several advances in artificial intelligence such as defeating the world's best human players of complex game of Go.6
Machine learning is now relatively widespread in chemistry research. Both supervised and unsupervised learning, driven by the rise of artificial neural networks, have seen use in virtually every important discipline within the field.7 Until recently, reinforcement learning was less widely utilised. In the last five years this has begun to change, and examples of RL can now be found in a wide variety of areas from drug discovery to reaction control. This review aims to act as a guide to the use of reinforcement learning within chemistry, setting out the theory required to use it, discussing the practical issues which may be encountered, and reviewing the applications that already exist.
The remainder of this paper is organised as follows. In Section two, we introduce the theoretical principles of reinforcement learning. In Section three, we present a brief review of deep neural network architectures. In Section four, we review the applications of RL to real problems within chemistry that have been conducted to date. Finally, we consider the future of RL as a tool in chemistry research.
A reinforcement learning agent is designed to select a particular action from a set of possible actions, given the current state of a system within a set of possible states. Both the state space and the action space may be continuous or discrete; the action space may differ between states. Actions are selected using a policy which gives the probability of selecting an action given the current state. Each action that may be selected both changes the state of the system at the next time step and leads to the agent receiving a different reward, which is defined by a reward function. The agent wishes to maximise the total reward received during the length of the simulation. In most applications, the simulation length is finite and reaching a specified number of time steps or a specified subset of the state space will cause the simulation to end, but infinite-step simulations are also possible. A key challenge in reinforcement learning is the exploit-explore problem of finding the correct balance between taking actions which are known to give high expected rewards and exploring actions which have not yet been investigated.
Reinforcement learning is often framed in the context of the Markov decision process,8 in which the current state is all that is needed to determine an action: knowing the previous history of the states and actions taken to reach the current state provides no additional information with which to select the next action. This allows the agent to learn efficiently, as the reward of taking the same action while in the same state is effectively constant. However, this is also a strong assumption, as there are real processes in which the most rewarding action may depend on the process history as well as the current state. These are referred to as hidden state problems, since the behaviour of the system would be Markovian if the state observed by the agent captured all of the information which defined the reward.9 They can be handled via the more general partially observable Markov decision process, allowing the agent to interact with an incomplete representation of the state space; a discussion of how different methods compare in this respect was provided by Jaakkola et al.10
The design of the reward function plays an important role in RL, as this determines what the agent is aiming to maximise through its policy learning. Reward functions can be discrete or continuous, and may include rewards at every time step of the simulation or at only the final time step. They may also include a discount value to decrease the importance of expected rewards as the number of time steps beyond the current one increases. The use of additional rewards not motivated directly by the underlying problem to help the agent to learn is called reward shaping, and has been the subject of significant research.11 Designing a reward function to ensure the agent performs well for the specified task is challenging and somewhat subjective, and most RL work in chemistry has approached this using subject-specific expert knowledge. In inverse reinforcement learning, the reward function is treated as unknown and a good reward function is learned by the agent based on a set of examples with properties which the agent wishes to replicate. Further details of this approach can be found in Hadfield-Menell et al.12
Another important distinction for RL is whether an algorithm is model-based, in which a model for the dynamics of the environment is used during learning and action selection, or model-free, in which case a policy is learned and actions selected without the use of such a model. An algorithm may also be on-policy, if the states and actions used in training the agent are generated by the current policy, or off-policy if they are generated in some other way.
The scope of problems which can be addressed with traditional RL is limited by practical concerns around the size of the state space and action space. For example, in value learning, the expected value of every state–action pair must be recorded, which quickly becomes infeasible as the number of possible states increases. Recent advances in the field have been driven by deep reinforcement learning, in which unknown functions of the state and action space such as the policy or value function are approximated via a deep neural network. Since these functions can be approximated based on a small sample of observations, the dimensionality of the state space is no longer a limitation and more complex problems can be tackled.5 This approach does introduce additional complexities arising from the additional variance imposed by the approximation, and in the choice of neural network used; the types of neural network which may be considered for the purpose are discussed in Section 3.
Most policy learning algorithms are based on two key concepts: policy gradient and actor-critic. Actor-critic methods consist of two modules: an “actor” model estimating the policy function, and a “critic” model estimating the value function. The parameters of the actor are updated based on the output of the critic given the current parameters to determine a policy which will lead to good expected rewards. Fig. 2 presents this form of RL agent diagrammatically. In contrast to Fig. 1, the agent is now split into the actor module choosing actions based on the current state and the value function estimate, and the critic module learning the value function based on the current state and reward.
Actor-critic methods are frequently characterised by the nature of the critic module, such as the pioneering temporal difference actor-critic (TD-AC) algorithm of Barto et al.,17 and the advantage actor-critic (A2C) algorithm of Peters and Schaal;18 the latter was developed further by Mnih et al.19 into asynchronous advantage actor-critic (A3C). The actor may optimise its parameters using any method, but in recent work it is common for a version of policy gradient learning to be used.
In policy gradient methods, the parameters of the policy are optimised by updating them proportionately to the partial derivative of the performance of the policy with respect to the parameters (the gradient function). This is equivalent to optimising a stochastic policy via a function approximating the true policy. The optimisation step may be conducted using any gradient ascent method. Direct computation of the gradient is difficult as it depends on both the action selection under the current policy, and the distribution of states under an optimal policy; however, the policy gradient theorem states that it is possible to obtain an unbiased estimate of the gradient given a sample of experiences from a reasonable function approximation.20 The earliest policy gradient algorithm, named REINFORCE or ‘vanilla’ policy gradient, was introduced by Williams.21 This is a form of on-policy Monte Carlo approximation using the likelihood ratio trick, and remains popular due to its simplicity to implement. Other approaches to simple policy gradient learning include finite difference methods; a comparison of these early algorithms is given by Riedmiller et al.22 However, there are several disadvantages: they can be slow to converge and require many samples to learn a near-optimal policy, have very high variance and will often converge to a local optimum instead of a global one.
These weaknesses led to the development of more advanced methods incorporating actor-critic techniques. Schulman et al.23 developed Trust Region Policy Optimisation (TRPO), an algorithm based on a series of approximations to a theoretical result concerning policy improvement by minimising surrogate loss functions. A restriction on the divergence between the old and new policy is enforced to prevent overly large updates which may harm convergence, while still allowing large updates within a trust region where it is safe to do so. Proximal Policy Optimisation (PPO)24 follows a similar strategy but works with a clipped surrogate objective function, increasing the range of problems to which it can be applied and reducing the complexity of implementation while maintaining many of the benefits of TRPO.
Another way to improve the efficiency of actor-critic policy gradient algorithms is to utilise off-policy updates to reduce the amount of interaction required with the environment. Deep Deterministic Policy Gradient (DDPG)25 is an early example of this which can be viewed as an extension of deep Q-learning to a continuous action space: learning is conducted via gradient descent on the Q function and gradient ascent on the policy function, resulting in a model-free algorithm which is easy to apply and does not require discretisation of the action space. Fujimoto et al.26 combined DDPG with ideas from double Q-learning by using two critic networks, delaying policy updates until value learning is complete, and adding a regularisation step with the combined effect of reducing variance and potential overestimation bias; the resulting algorithm is named Twin Delayed Deep Deterministic Policy Gradient (TD3). DDPG and TD3 are extremely efficient, but can suffer from sensitivity to hyperparameter selection. An off-policy method that aims to address this is the soft actor-critic algorithm,27 in which the actor network aims to simultaneously maximise expected reward and entropy.
Finally, the popular decision-making algorithm Monte Carlo tree search (MCTS) can also be viewed as a form of reinforcement learning. MCTS was introduced by Coulom28 as a search algorithm for decision-making processes, drawing on a form of Monte Carlo simulation named rollout within a decision tree. More recent developments in MCTS have incorporated RL techniques, and several scholars have noted the links between the general MCTS formulation and reinforcement learning; a detailed study of the relationship between the two fields was undertaken by Vodopivec et al.29
The choice of which algorithm to use is a difficult one and somewhat subjective, although it will depend on the nature of the environment (for example if the state space is represented in a discrete or a continuous form). There have been several studies conducted to aid in this decision, including comparisons on the cart-pole problem by Nagendra et al.,30 a guide to statistical comparisons by Colas et al.,31 and a comparison across several benchmark problems by Jordan et al.32 It should be noted that the list of RL algorithms given here is not exhaustive, and that the development of algorithms is an active area of research – a recent example being the Invariant Decoupled Advantage Actor-critic method of Raileanu and Fergus,33 which aims to improve the ability of an agent to generalise to new environments.
A neural network must contain an input layer and an output layer. Hidden layers between the input and output layers are required for all but the simplest computations to be achieved. Deep neural networks contain multiple hidden layers; while there is no general definition for the lowest number of layers required for a network to be considered deep instead of shallow, the most common threshold is that of any network with two or more hidden layers. Schmidhuber34 provides a more detailed account of the history of deep neural networks.
The number of layers in a neural network, and the number of nodes in each layer, can have a significant effect on the behaviour of the network. A network with more or larger layers is more powerful and can fit well to a wider range of underlying data generation processes, at the expense of increased complexity and a risk of overfitting to the training data. This is however rarely considered in reinforcement learning literature, where it is common to simply present the network used without consideration of alternatives. One approach adopted by some authors is to treat neural network architecture as an optimisation problem and search the space of possible architectures for the best model; this is discussed by Benardos and Vosniakos35 and Luo et al.36 among others.
In principle, any directed acyclic graph may be used as the basis for a feedforward NN. The most common network structure is a fully connected network, in which every node in a layer has a connection to every node in the next layer. An example of a fully connected feedforward NN with two hidden layers is shown in Fig. 3. This type of network is often referred to as a multilayer perceptron (MLP), although this terminology is somewhat confusing as other authors restrict it to specific types of fully connected network. Feedforward neural network are usually trained using iterative optimisation methods such as gradient descent or stochastic gradient descent, with the backpropagation algorithm used to compute the gradient with respect to a loss function.
Feedforward neural networks have found widespread use in fields as diverse as facial recognition,37 dynamic systems control,38 geology and mining,39 and urban sustainability.40 In some application areas, however, the lack of ability for information to move backwards between layers can be a limitation.
The presence of cycles in RNNs affects the way in which the network can be trained, and methods based on backpropagation can be problematic due to repeated multiplication of gradients causing terms to vanish or explode. This was addressed by Hochreiter and Schmidhuber48 through the creation of Long-Short-Term Memory (LSTM), a system in which information is stored in a cell and the flow of information to and from the cell is regulated by a series of gates. The Gated Recurrent Unit (GRU) is a more recent development with a similar but simpler structure, and was shown by Chung et al.49 to offer comparable performance on a selected set of problems.
LSTM and GRU architectures can struggle with certain tasks due to limitations of their structure. Difficulties in sequence prediction with these methods motivated Joulin and Mikolov50 to investigate augmentation of RNNs with a neural stack to increase the memory capability of the network. Similar work on extending RNNs with structures analogous to stacks and queues was conducted by Grefenstette et al.51 to overcome challenges in learning natural language transductions. Stack-augmented RNNs (stack-RNN) extend the network with cells consisting of multiplicative gates defining a memory stack, on which a PUSH (insert) or POP (remove) operation may be conducted to change the makeup of the vectors stored within the network's memory. This allows the network to learn longer-term dependencies between features of the training data.
The networks G and D may take several forms, but are typically FNNs or RNNs. GANs were extended to handle sequential data using reinforcement learning in the SeqGAN method of Yu et al.53 Application areas include text-to-image generation,54 astronomical image restoration55 and several problems in medical imaging.56 GANs are a developing area of machine learning, both in terms of structure and applications; a more detailed review of recent progress was provided by Alqahtani et al.57
The variational autoencoder (VAE) was introduced by Kingma and Welling62 as an application of neural networks to variational Bayesian methods for approximate inference. It consists of two networks, an encoder and a decoder, which map inputs to and from a latent variable space which is otherwise intractable. The encoder and decoder must be selected from the stable of other neural network architectures: early work by Bowman et al.63 made use of RNN encoders and decoders for language modelling, and similar techniques have since been applied to automatic chemical design by Gómez-Bombarelli et al.64 and Griffiths and Hernández-Lobato.65
Another significant encoder–decoder structure is the transformer architecture of Vaswani et al.66 The encoder and decoder are built from a series of layers comprising FNNs and sub-layers based on attention mechanisms, a form of mapping which enhances the importance of some parts of the input data while reducing the importance of other parts. In contrast to RNNs, transformers process an entire input at once, even if the input is sequential in nature. Transformers have found success in natural language processing tasks, and have largely replaced RNNs as the method of choice in the field.67 Recent papers by Chen et al.68 and Janner et al.69 link the transformer method to reinforcement learning by framing RL as a sequence modelling problem.
The area in which computational methods have shown the most promise is in the generation of candidate molecules with promising properties. In principle, a generative model can be used to propose molecules which are more likely to be successful for a given task. While this is not a new idea, the rise of neural networks in recent years has led to drug design becoming one of the most active areas of machine learning research in chemistry, and a wide variety of approaches have been taken. Authors such as Mandlik et al.,71 Schneider and Clark,72 Vamathevan et al.1 and Mouchlis et al.73 have provided detailed coverage of the topic. Most relevant to this review, however, is the large body of work considering molecular design as a sequential problem in which decisions are made to improve the properties of the molecule(s) being proposed. Reinforcement learning is an ideal tool to achieve this.
The use of reinforcement learning in drug discovery began with the work of Guimares et al.74 and Sánchez-Lengeling et al.75 The problem formulation is based on SMILES, a widely-used string representation of a molecule. The states of the RL agent are partially-completed SMILES strings, and the action space is the selection of the next character to be added to the string. Constructing a string in this fashion is a difficult problem, as SMILES syntax is notably fragile. The vast majority of combinations of characters are invalid, including those generated at intermediate time steps en route to a valid string. Among valid strings, very small changes may dramatically alter the chemical properties of the associated molecule, so it is difficult to construct a string which is likely to correspond to a desirable molecule. Non-RL methods to do this have been proposed with some success, notably that of Ikebata et al.,76 but even among these the proportion of valid, chemically desirable sequences is relatively low.
The method introduced by Sánchez-Lengeling et al.,75 named Objective-Reinforced Generative Adversarial Networks for Inverse-design Chemistry (ORGANIC), is based on the SeqGAN approach of Yu et al.53 It utilises a GAN neural network structure, with an LSTM RNN as the generator network and a CNN as the discriminator. The reward function is a linear combination of the discriminator and a quality metric based on the properties of the sequence; Monte Carlo rollout is used at intermediate time steps to estimate expected future rewards, with policy gradient optimisation used for policy learning. This work was later extended by Putin et al.77 using a differentiable neural computer in place of LSTM, allowing longer and more complex sequences to be learned and generated. The resulting Reinforced Adversarial Neural Computer (RANC) architecture was shown to generate a higher proportion of unique molecules with desirable properties.
Also in 2017, Olivecrona et al.78 proposed a different RL method for drug discovery. With the same state and action spaces, this method trains an RNN with three layers of 1024 GRUs each on a set of 1.5 million SMILES strings corresponding to existing molecules, and uses RL to augment the likelihood of the RNN so that molecules with desirable properties will be constructed. The authors state that a policy-based learning approach is more appropriate than a value-based approach for this problem, and make use of the REINFORCE algorithm to learn an optimal policy. The reward functions used are based solely on the desirability of the sequences created, for example to promote DRD2 activity. REINVENT, a direct extension of this approach incorporating a memory unit in the scoring function so that a more diverse range of molecules are proposed, was developed later by the same research group and published in a paper by Blaschke et al.79
There are many further examples of RL agents constructing SMILES strings one character at a time. Popova et al.80 used the REINFORCE algorithm and a reward function based only on molecular properties and no intermediate rewards while making use a generator-predictor structure: a memory augmented single layer stack-RNN was chosen to generate new SMILES strings directly without recourse to descriptor-based modelling, while an LSTM RNN was used for property prediction. This approach was shown to propose novel compounds to inhibit Janus kinase 2 (JAK2), a protein linked to several important cellular processes in the human body. Later, Yoshimori et al.81 maintained the same architecture as Olivecrona et al.,78 while replacing the reward function with the output of the LigandScout 3D pharmacophore model of Wolber and Langer82 to aid the discovery of molecules with the desired pharmacophores.
Neil et al.83 conducted a thorough exploration of several approaches to molecular design using RL. Focusing on multi-objective optimisation, the authors tested several different neural network architectures and policy optimisation methods against a set of 19 benchmarks for molecular design. Proximal policy optimisation with a single LSTM RNN was found to perform best of the RL approaches, outperforming both advantage actor-critic and vanilla policy gradient as well as GAN architectures (although as noted above, GANs have advanced significantly in the years since). The non-RL method Hillclimb-MLE, based on repeated maximum likelihood estimation, was also found to offer competitive performance levels.
Recent work by Pereira et al.84 used six-layer RNNs incorporating both LSTM and GRU structures within their layers to both generate SMILES strings and determine the reward to given to the generated molecule. Given a target receptor to inhibit, the IC50 is defined as the amount of a substance required to inhibit 50% of the receptor. The RL reward function is based on the negative predicted logIC50 of the compound generated with respect to the target, with a penalty term if the molecule is lacking in novelty. This has the effect of biasing the generator model towards both desired chemical properties and increased molecular diversity. Another recent paper, Born et al.,85 directly incorporated information on the target disease (in this case cancer) into the molecular generation process. Molecules are generated using two VAEs running in conjunction, one generating a gene expression profile and the second generating SMILES strings for molecules based on the generated gene expression profile, with a reward function based on the IC50 as estimated by the PaccMann drug sensitivity model.
Another distinct approach to SMILES string generation for drug design is that introduced by Krishnan et al.86 This approach draws on transfer learning, in which policies learned by an agent in one problem may be applied to another related problem; it has previously seen use in both molecular library generation87 and RL research.88,89 A stack-augmented RNN with gated recurrent units is trained to generate strings, and is combined with a set of molecules known to inhibit proteins similar to the intended target using transfer learning to create a target-specific generative model. A property prediction model is then used to guide the generator towards more desirable molecules, with a reward function based on the predicted docking score of the generated molecule. The method was tested on the problem of inhibiting the JAK2 protein; it was able to both reproduce existing JAK2 inhibitors without similar molecules being included in the training set, and to design potential new inhibitors.
The wide variety of reward functions used in the work described so far highlights the importance of reward function construction in drug design problems. To bypass this challenge, Agyemang et al.90 treated drug design as an inverse RL problem and inferred the reward function for the agent from the SMILES strings of known molecules with desirable properties. Molecule generation is handled by a multiple layer stack-augmented RNN, with PPO for policy learning. The authors demonstrated this method on several examples, including JAK2 inhibition.
As previously mentioned, the challenging syntax of SMILES strings makes their construction difficult. For this reason, Thiede et al.91 worked with SELFIES strings, in which every combination of characters is valid and the substrings generated during the construction process can be directly interpreted. Optimisation is handled via PPO, while the reward function is defined by a combination of an extrinsic reward (based on the predicted properties of the molecule) and an intrinsic reward named curiosity to encourage increased exploration of the state space.
An alternative to string-based molecular representations are two-dimensional graphs, which offer increased robustness and interpretability, for example of partially-constructed graphs as molecular substructures. Fig. 7 shows a highly simplified example of how a molecular graph may be constructed using reinforcement learning. The action space consists of two distinct action types: addition of a new atom in the molecular graph and addition of a bond between existing atoms. A stopping criterion to determine when a molecule is considered “complete” is also required. The methods used in practical applications are significantly more complex than this, but are usually based on similar principles.
The pioneering RL work for graphical molecular construction was that of You et al.92 Given a set of scaffold subgraphs corresponding to atoms or molecular substructures, the state space of the RL agent is defined as the set of graphs which can be constructed from these subgraphs, and the action space as the set of possible extensions to the existing graph by either connecting existing nodes in the graph or adding an additional scaffold subgraph. Two deep graph convolutional networks were used as a generator and a discriminator in an adversarial setting, with rewards based on molecular properties and adversarial loss. This method was tested on problems of molecular property optimisation and property targeting, and demonstrated to offer significant improvements over earlier approaches. Later, Khemchandani et al.93 integrated the work of You et al.92 with a neural network method for property prediction introduced by Yang et al.94 to create a new method for generation of molecules with desirable properties. A recent paper by Atance et al.95 uses a gated graph neural network in place of a convolutional network, with a memory-aware RL approach developed from that of Olivecrona et al.78 and a best agent reminder loss function. This method shows signs of strong performance on QED optimisation and DRD2 activity tasks.
Zhou et al.96 took a different approach, aiming to optimise existing molecules for drug discovery. By making only chemically valid changes – atom addition, bond addition or bond removal – to molecules, the authors ensure the final molecule is itself chemically valid, and avoid the need for pre-training on existing data to reduce the risk of bias in the model. A deep Q-network structure utilising a fully connected feedforward NN with four layers is used for value function learning, with a reward function (which includes intermediate rewards but weights the final reward most highly) defined by the molecular properties to be optimised.
In a similar vein, Ståhl et al.97 introduced DeepFMPO, a temporal difference actor-critic RL with two LSTM RNNs to discover novel molecules which optimise multiple objectives through molecular modification. First, a library of molecular fragments is created by fragmenting a set of initial molecules, and fragments are encoded in such a way that similar molecules have similar encodings. The agent alters one fragment in the molecule at each time step, with rewards given for molecular validity and improvement in the target properties. The reward function is updated over time as the agent discovers more molecules with desirable properties.
Two papers published in 2020 focus on ensuring that the proposed new molecules can be synthesised and that the synthesis route for the molecule can be determined. Gottipati et al.98 developed an approach named policy gradient for forward synthesis. The state space is the set of molecules to be used in a chemical reaction, and the action space is split into two parts: an intermediate action of selecting a reaction template, and a final action of selecting a reactant to combine with the current molecule to produce a new molecule with desirable chemical properties. A k-nearest neighbour algorithm is used to discretise the otherwise extremely vast reactant space, and the selection policy is optimised using twin delayed deep descent policy gradient (TD-DDPG). Three neural networks are used, one each for predicting the best reaction template, computing the action to be taken, and estimating the Q-value of the product molecule. Each network is a fully connected feedforward NN with four layers, although the number of neurons in the hidden layers varies between networks. Similarly, Horwood and Noutahi99 take a given set of initial molecules and reactants and define the transitions between states in terms of sequences of chemical reactions. Actions are treated hierarchically in terms of selecting a reaction template and then a reactant. Molecules are represented using Morgan fingerprints and reaction templates by SMARTS syntax. An advantage actor-critic algorithm is used for policy optimisation.
There has also been interest in using RL for three-dimensional molecular design, which can take advantage of spatial information not captured by string or graphical representations. Simm et al.100 developed an RL agent which designs molecules by placing atoms onto a canvas in 3D space. The reward function is based on fundamental physical properties of the constructed molecules, in effect encouraging the agent to learn the laws governing atomic interactions in three dimensions. The state space is the set of currently placed atoms and their positions plus a bag of atoms remaining to be placed (the initial set of atoms to be placed must be specified in advance), with the action space being the selection of the next atom to be placed and its location, and the placement action chosen based on the distance and dihedral angle with respect to an already-placed focal atom when the canvas is not empty. The same authors develop the method further in Simm et al.101 to exploit symmetry in the design process using rotationally covariant state–action representations and neural network architectures.
Another example of 3D molecular construction is the work of Bolcato and Boström,102 an extension of the DeepFMPO method of Ståhl et al.97 discussed above. The revised method uses conformer search on molecular fragments for 3D alignment so that electrostatic and shape similarities may be considered instead of simpler 2D similarity.
Recently, Meldgaard et al.103 proposed a method combining imitation learning and reinforcement learning to ensure that the generated molecules are stable. Imitation learning is a learning scheme in which an ML model may learn from demonstrations and has some history of use alongside RL.104 First, an imitation learning agent is trained on a molecular database to construct existing molecules and predict their stability using quantum chemistry and 3D information. A deep RL agent using a Q-learning algorithm then explores the molecular space to discover new stable molecules, while simultaneously updating the prediction model in areas of the space a long way from the training data.
Cho et al.105 presented a method to predict the 3D structures of small molecules. Given the SMILES string of a target molecule and a randomised initial structure for its atoms, the agent takes actions corresponding to the movements of atoms within the molecule in 3D space, with rewards given by density functional theory energy calculations and deep deterministic policy gradient optimisation. The agent was tested on five simple hydrocarbon targets and was able to correctly locate the lowest-energy structure of the five molecules.
Ahuja et al.106 modified a conventional approach for optimisation of molecular geometries via minimisation of potential energy using the BFGS algorithm to additionally draw on RL. The state of the agent at each time step consists of the gradients of the potential energy surface and the updates made to the positions of the atoms in the molecule at each previous time step. The action to be taken is to propose a correction term to the BFGS calculation for the next atomic position update, with policy learning via PPO, and a neural network architecture consisting of several feedforward NNs linked by a self-attention layer for information sharing (the authors note the similarity of this structure to the transformer architecture discussed above). Since the aim is to reach an optimal geometry quickly, the reward function is fixed to −1 per time step. This approach is shown to reduce the number of time steps required to reach an optimum compared to non-RL methods on a set of test molecules.
The earliest reinforcement learning work in this field was conducted by Eastman et al.107 in the shape of a deep RL agent for the RNA design or inverse folding problem. Given a sequence of RNA bases, the agent aims to find a new sequence which will fold to a target structure by repeated modification of the existing structure using a form of local search. Policy optimisation is handled by asynchronous advantage actor-critic algorithm, while the neural network used consists of several convolutional layers operating on either one or seven RNA bases at a time, plus a final fully connected layer for value function estimation and a softmax layer to output action probabilities. This approach is shown to improve on existing non-RL algorithms for RNA design on a test set of 100 target structures, although the authors note that further performance improvements should be possible.
Independently, Runge et al.108 introduced LEARNA, a deep RL algorithm to sequentially design an RNA sequence to fold to a target structure. Uniquely among chemistry literature, the neural network architecture is determined via architecture search: the network always includes an embedding layer and a fully-connected NN with at most 2 layers, and may also include a CNN with at most 2 layers and a LSTM RNN with at most 2 layers. PPO is used for policy learning, with the hyperparameters and environment parameters jointly optimised alongside the network architecture. The reward function is based on the Hamming distance between the observed structure from the folding process and the target structure, and incorporates a local improvement step. The authors also demonstrate an extension based on meta-learning, which concerns agents that can learn how to improve their own learning given relevant previous experience109 and has been used to improve the training of RL agents.110,111 Meta-LEARNA, which transfers knowledge learned from solving one RNA sequence problem to others, was found to improve on the other algorithms tested, solving a higher proportion of problems in a significantly shorter time across three sets of test problems.
The wider problem of general biological sequence design (for example of DNA or proteins) is tackled by Angermueller et al.112 Given a type of biological sequence to be designed, the state space is defined as the set of possible sequence prefixes, and the action space as the vocabulary of characters which can appear in the sequence (in the previously stated examples, DNA nucleotides and amino acids respectively). The reward function is treated as partially unknown: intermediate rewards are set to zero and the final reward penalises overly similar sequences, but other than this the reward function must be approximated during sequence construction. Several surrogate models of varying complexity are considered for the reward function, predominantly supervised regression models, including a Bayesian ensemble of neural networks. The policy is determined by a new algorithm named DyNA PPO, a form of model-based optimisation using PPO with additional simulated data from one or more of the candidate models for the reward function, with the effect of increasing the sample efficiency of the agent.
Conformer search – the prediction of stable 3D molecular geometries for flexible molecules – is also a relevant and significant challenge, analogous to geometry optimisation for small molecules. This is approached using RL by Gogineni et al.113 To represent the conformer search problem as a Markov decision process, the state of the RL agent is taken as the sequence of conformers observed during the search process so far. The action space is a discretised version of the torsional space, consisting of the choice of torsion angles for each rotatable bond in the current molecule. Node embedding is handled via a message passing graph convolutional neural network, a graph pooling operator and LSTM memory unit to convert the embeddings into a full representation with historical information incorporated, while torsion action selections are determined by a feedforward NN. The authors also demonstrate that the agent's performance may be enhanced via the concept of curriculum learning, a strategy in which the agent learns progressively over a series of increasingly complex related problems building up to the problem of interest.114
Li et al.118 took the first steps into deep RL for the 2D HP model by introducing FoldingZero, a protein self-folding architecture based on a deep convolutional network and upper confidence bound tree searching plus an actor-critic RL algorithm to iteratively improve both steps. In this formulation, the action space is of size three instead of four, as the folding moves conducted must form a self-avoiding walk. Rewards are provided for the amount of H–H contact in the final folded protein. Later, Jafari and Javidi119 utilised an RNN with three hidden LSTM layers for deep Q-learning RL via a vanilla policy gradient algorithm, with the action space defined identically to Czibula et al.117 but a novel reward function: the reward is −1 if amino acids are lattice neighbours and not consecutive in order and is a small positive value otherwise, and the agent attempts to maximise absolute value of the cumulative reward. Results demonstrated an improvement in both minimising the free energy in the final folded state and in the time required to reach a solution from several starting positions compared to previous bidimensional folding techniques.
Two recent attempts have been made to address protein folding in a more realistic framework. The masters thesis of Gao120 uses a graphical representation of the 3D structures of proteins: nodes in the graph contain information about residues in the conformation and edges contain information about relationships between them. The RL agent operates by changing the torsion angles of the conformation one at a time, so the action space consists of selecting torsion angle to be altered and the number of degrees by which to rotate it. Protein folding is viewed as an infinite-step process, with rewards given so that the agent aims to minimise the free energy of the conformation. The neural network architecture is very similar to that used by Gogineni et al.113 for the broader conformer search problem.
Panou and Reczko121 applied RL to the protein folding software game Foldit Standalone, introduced by Kleffner et al.122 as a development of the online game Foldit of Cooper et al.123 Foldit works by allowing human users to interact with an image of a protein in several ways, with a score awarded to each protein configuration based on the negative of its energy. In the work of Panou and Reczko,121 15 of the possible interactions with a protein form the action set for the RL agent, which works on pre-processed images of proteins taken from the game. Deep Q-learning with a CNN composed of several convolutional layers for feature extraction and two fully connected layers for prediction is used for policy optimisation, with the reward function based on the change in Foldit score resulting from an action. The agent was trained on 20 proteins and tested on 20 more; while performance is highly dependent on hyperparameter settings, an optimised agent is able to deliver consistent structural improvements within a reasonable time, although it does not exceed human performance at the same task.
Ideas derived from reinforcement learning have also been applied to more general protein conformation sampling. Shamsi et al.124 presented an extension to count-based adaptive sampling for exploration of protein conformation landscapes, in which reinforcement learning is used to choose the set of points in the protein structure at which molecular dynamics simulations should be run to identify low-energy states. The use of RL in this work is limited, however, as the policy is never optimised and must be supplied by a human; instead, the reward function is used to optimise a set of weight parameters corresponding to the directions in which sampling may move from an initial state. Barozet et al.125 developed two methods to sample the conformational space of protein loop portions, an important feature of many proteins which are often represented unrealistically by a single conformation. The more successful of the two methods in terms of sampling speed uses an RL-inspired heuristic for the selection of a tripeptide to be added during the loop generation process. Scoring is based on the probability of successfully closing a protein loop given previous loops generated using relevant tripeptides, although this is not a reward function in the traditional sense as there is no cumulative reward to be maximised.
For chemical retrosynthesis, Schreck et al.128 designed an RL agent which is given a set of possible reactions leading to a product (either the final product or an intermediate) and a set of commercially available molecules from which the target must be synthesised. The action space consists of selecting a reaction at each step in the synthesis chain, and the reward function is designed to minimise the cost of synthesising the target product, subject to the total number of reactions required being no greater than a specified threshold. A form of deep Q-learning is used for policy improvement. A different RL treatment of the problem was developed by Segler et al.129via a method to propose molecular transformations and determine their validity based on Monte Carlo tree search and three separate neural networks for expansion policy, rollout policy and prediction. Koch et al.130 later applied this approach to bioretrosynthesis, with extensions to handle biological compounds via reaction rules and combined biochemical scoring.
Having determined a reaction of interest, the next step is to choose the conditions under which the reaction will take place. This can significantly influence the volume of the target reaction product which can be produced, and there is substantial literature on optimising reaction conditions via various methods, for instance Gao et al.131 and Shields et al.132 The key work on RL for reaction optimisation is that of Zhou et al.133 The state space of the agent is defined as the set of all possible combinations of reaction conditions, and the action space by the set of changes that could be made to the current conditions. A policy gradient algorithm and a LSTM RNN with two hidden layers are used. When tested on four real reactions, the agent consistently finds the optimal reaction conditions faster than other non-RL algorithms.
In a related problem domain, Li et al.134 use asynchronous advantage actor-critic RL to control the molecular weight distribution in atom transfer radical polymerisation. Given the current reaction system, the agent may add a fixed amount of four possible reagents to the reaction system, provided the reagent budget has not been reached. A reward of 1 is given if the actual molecular weight distribution is very close to the target molecular weight distribution, and 0.1 given if it is further away but still sufficiently close to be worthy of further exploration in this approximate region. Two different neural networks are considered, with a CNN found to outperform a simple fully-connected NN.
Ma et al.135 treat control of the molecular weight distribution in nonlinear polymerisation reactions as a process control problem. The agent chooses the values to which the initiator and monomer flow rates in the polymerisation process are to be set, with a positive reward given if the difference between target and observed molecular weights is sufficiently small, and negative reward otherwise; the final molecular weights are given greater importance than intermediate values. DDPG and two feedforward NNs are used for policy optimisation. The training regime is adapted to incorporate historical measurement data as semi-batch experiments are non-Markovian. A different reaction control problem was also approached using DDPG by Alhazmi and Sarathy,136 focusing on the reaction temperature in a continuous stirred tank reactor (CSTR) network with complex dynamics and measurement uncertainty. The problem of temperature control of a CSTR system via reinforcement learning was earlier considered by Pandian and Noel,137 with the conclusion that using deep RL directly is a better approach than simply using RL to tune the parameters of a more traditional controller.
Also relevant is the recent work of Rajak et al.138 on the problem of optimal synthesis planning for inorganic materials. The authors use deep RL with policy gradient optimisation to generate time sequences of reaction conditions for quantum material synthesis via chemical vapour deposition.
Another approach to better understanding and explaining chemical reaction mechanisms is to locate the transition states in the reaction and use these to determine the factors that characterise the reaction. This is a difficult problem, as the set of possible factors is typically much larger than the subset which actually affect the reaction mechanism, but recent computational developments have led to renewed interest in attempting to learn reaction mechanisms automatically. RL was introduced to this effort by Zhang et al.139via a method to determine reaction dynamics and transition state locations using molecular dynamics simulation and RL with two fully connected feedforward NNs. The learning approach is named variational target optimisation and is a combination of two techniques closely related to actor-critic methods which were earlier described by the same research group.140
Reinforcement learning has recently begun to play a role in research into catalysis. Although the key problem of identifying new catalysts to improve reaction efficiency has yet to be attempted using RL, there has been promising work on related problems, including that of Yoon et al.141 on prediction of kinetic pathways and barriers in potential catalysts under reaction conditions. The state of the system consists of information about the energy surface structure, and the agent chooses one of four possible actions to modify the surface, with actor-critic TRPO for policy optimisation. The final reward is positive if kinetically feasible surface separation is observed and negative if the upper energy bound is exceeded, with positive intermediate rewards given when transition states are observed. The resulting CatGym method was tested on an Ni–Pd–Au alloy catalyst and was able to explore the surface efficiently and generate kinetic pathways to low-energy configurations. Similarly, Lan and An142 used a combination of deep RL and density functional theory to discover reaction pathways in the Haber–Bosch process with an Fe(111) catalyst, identifying a path with a lower free energy barrier than previously known pathways. The agent uses PPO with two fully connected NNs, with a state space defined by vectors of length 23 containing information about the reaction at different surface sites and the number of gas species along the reaction path, and rewards based on the free energy barrier connecting the previous and current state after an action is taken.
Finally, an unusual application of reinforcement learning to improve the yield of a chemical product was developed by Hubbs et al.143 in the form of an agent to determine optimal production schedules for a chemical manufacturing process. Here, the action space consists of constructing a schedule to determine which product is to be produced on each day of the simulation, and rewards are given based on the profit made from the production schedule. The RL implementation uses advantage actor-critic and a neural network with 12 hidden layers.
However, this is perhaps an area of opportunity, as RL has seen some use in quantum physics which may have applications to chemistry fields. For example, both Niu et al.148 writing on an application of RL to quantum control and Bolens and Heyl149 on RL for digital quantum simulation highlight their potential uses in quantum chemistry but do not attempt this themselves. Similarly, the work of Nguyen et al.150 on measurement of double quantum dot devices may also be relevant to quantum chemistry research in future.
RL has recently seen use in battery modelling. Unagar et al.156 calibrate a model for lithium-ion battery discharge by using an RL agent to select the degradation parameters of the model. The agent uses the Lyapunov variation of the actor-critic algorithm and two fully connected NNs, and is shown to generate better calibrated models than a Kalman filter method. Li et al.157 use deep Q-learning to construct an energy management strategy for battery electric vehicles, with the reward function constructed so that the agent will minimise the energy loss of the system while ensuring its safety.
The growth of RL within chemistry has been notably non-uniform, with much more focus on some areas than others. In particular, molecular design for drug discovery has seen an extremely large volume of work, with many different variants of the wider design problem being attempted independently by different research groups using several varieties of RL. Biochemistry problems of sequence design and protein analysis have also been a key area of RL research in recent years. In contrast, the use of RL in fields such as chromatography, crystallography and quantum chemistry has so far been extremely limited. It is possible that there is untapped potential for RL to aid researchers in some of these areas, as planning and sequential decision making problems to which it is well suited are present in these fields.
One area of concern arising from this review is the relatively limited impact of previous RL research on practical chemistry. For example, despite the many promising computational results obtained in drug discovery problems, it is somewhat concerning to note that few (if any) of the molecules discovered through these methods have yet been synthesised. Similarly, despite the encouraging performance of RL in reaction optimisation tasks, there is little evidence that this has influenced the choice of reaction conditions in real problems. There are many reasons why this may be the case, but it does highlight the importance of ensuring that practical chemists – and not just those already engaged in computational work – are in a position to take advantage of new developments.
In practical terms, reinforcement learning has become significantly more accessible in recent years. As interest in the topic has increased, so too has the volume of easily available literature. In addition, open-source code libraries are available in a variety of programming languages, removing a significant hurdle to easy implementation of RL. This, combined with its increasingly widespread use across different disciplines, suggests that the importance of RL as a practical technique within chemistry will only continue to grow.
This journal is © The Royal Society of Chemistry 2022 |