Yuanqi
Gao
^{a},
Xian
Wang
^{b},
Nanpeng
Yu
*^{c} and
Bryan M.
Wong
*^{d}
^{a}Department of Electrical and Computer Engineering, University of California-Riverside, Riverside, CA, USA
^{b}Department of Physics and Astronomy, University of California-Riverside, Riverside, CA, USA
^{c}Department of Electrical and Computer Engineering, University of California-Riverside, Riverside, CA, USA. E-mail: nyu@ece.ucr.edu
^{d}Department of Chemical and Environmental Engineering, Materials Science and Engineering Program, Department of Chemistry, and Department of Physics and Astronomy, University of California-Riverside, Riverside, CA, USA. E-mail: bryan.wong@ucr.edu

Received
1st June 2022
, Accepted 9th September 2022

First published on 13th September 2022

We present an efficient deep reinforcement learning (DRL) approach to automatically construct time-dependent optimal control fields that enable desired transitions in dynamical chemical systems. Our DRL approach gives impressive performance in constructing optimal control fields, even for cases that are difficult to converge with existing gradient-based approaches. We provide a detailed description of the algorithms and hyperparameters as well as performance metrics for our DRL-based approach. Our results demonstrate that DRL can be employed as an effective artificial intelligence approach to efficiently and autonomously design control fields in quantum dynamical chemical systems.

The conventional approach to solving these quantum control problems is to maximize the desired transition probability using either gradient-based methods or other numerically intensive methods.^{14–17} Such approaches include the stochastic gradient descent over quantum trajectories,^{18} the Krotov method,^{19} the gradient ascent pulse engineering (GRAPE)^{20} method, and the chopped random basis algorithm (CRAB)^{21} approach. While each algorithm has its own purposes and advantages, the majority of these approaches require complex numerical methods to solve for the optimal control fields. Moreover, due to the nonlinear nature of these inverse problems, the number of iterations and floating point operations in these algorithms can be extremely large, sometimes even leading to unconverged results for relatively simple one-dimensional problems.^{16,22}

To address the previously mentioned computational bottlenecks, our group recently explored the use of supervised machine learning to solve these complex, inverse problems in quantum dynamics.^{23} In contrast to supervised machine learning, reinforcement learning (RL) techniques have attracted recent attention since these machine learning methods are designed to solve sequential decision-making tasks, which can be naturally suited for quantum control problems. However, all prior RL studies to date have focused on low-dimensional spin-1/2 systems, which generally require a relatively small number of control pulses (typically 10–100) to converge.^{24–29} More specifically, the RL algorithms used in previous quantum control problems (such as tabular Q learning or policy gradient) assume a finite set of admissible control pulses and quantum state representations. While this is possible for finite-dimensional, spin-1/2 Hilbert spaces, they are typically ineffective for continuous (i.e., chemical/material) Hamiltonian systems.

In this work, we develop an extremely efficient RL approach for solving chemical dynamics systems for the first time. Our RL formulation utilizes modern deep learning frameworks and has a computational performance that scales linearly with the control time horizon. We test our new machine learning approach against a wide range of quantum control benchmarks to demonstrate that our RL approach significantly improves the fidelity and reduces the computation time compared to conventional gradient-based approaches. This paper is organized as follows: Section II reviews the background of quantum control in continuous systems and formulates it as a reinforcement learning problem. Section III presents the reinforcement learning techniques. Section IV provides the numerical results, and Section V concludes the paper with a discussion and future perspectives on prospective applications.

(1) |

With the time-dependent Schrödinger equation defined in eqn (1), the quantum control problem can be stated as follows: given a starting state ψ_{0}(x) and a desired final state ψ_{f}(x), what is the temporal form of the electric field E(t), t ∈ (0,T) that propagates the state ψ_{0}(x) to ψ_{f}(x)? In other words, the quantum control formalism seeks the electric field that maximizes the following functional:

(2) |

In this work, we harness new RL techniques to automatically construct optimal control fields, E(t), that enable desired transitions in these dynamical systems. To test the performance of our RL approach, we compare against the NIC-CAGE (Novel Implementation of Constrained Calculations for Automated Generation of Excitations) code,^{38} which solves the quantum control problem using a traditional gradient-based approach. Specifically, the NIC-CAGE code utilizes analytic gradients based on a Crank–Nicolson propagator, which are computationally more efficient than other matrix exponential approaches (such as those used in the GRAPE^{39} or QuTIP^{40,41} packages) or higher-order time-propagation methods.^{42} As such, a comparison against the execution times of the already optimized NIC-CAGE code serves as an excellent benchmark test of the performance of our RL methods. Before describing our reinforcement learning approach, we first provide a brief review of Markov decision processes (MDPs) in the next section.

The goal of the learning agent is to find a policy π(a|s), which is a rule for taking actions based on states, such that the expected discounted return v^{π}(s) is maximized:

(3) |

(4) |

Another function commonly used in reinforcement learning is the action-value function defined as

(5) |

(6) |

Time variable.
The time variable, t, of the MDP is naturally defined as the time in the quantum control problem, which we discretize into evenly spaced intervals of duration τ.

State.
The state at time step t is defined as s_{t} = [P^{0}_{t}, P^{1}_{t}, …, P^{K}_{t}, g_{t}]. P^{k}_{t}, k = 0, …, K is the squared magnitude of the projection of the current wavefunction, ψ(x,t), onto the kth eigenstate, ψ^{k}(x), of the time-independent Schrödinger equation:

We include the various P^{k}_{t} terms in our state space since it gives additional information to the reinforcement learning agent about the current wavefunction. The variable K is a design parameter that is described further in Section IV. The variable g_{t} is the gradient of the fidelity with respect to the electric field, E, evaluated at E(t − 1):

To calculate this gradient, we re-express P^{k}_{t} as

where is an algorithm that performs one propagation step of the wavefunction (i.e., propagates one step of the time-dependent Schrödinger equation in eqn (1)). We approximate the integral in eqn (9) with a finely-spaced Riemann sum and leverage the auto-differentiation engine from the PyTorch deep learning framework^{44} to calculate the gradient. Adding this gradient information provides the machine learning agent with the direction in which the fidelity can possibly be improved.

(7) |

(8) |

(9) |

Action.
The action at time step t is defined as the amplitude of the electric field a_{t} = E(t), where the minimum and maximum amplitude is restricted to E_{min} and E_{max}, respectively.

Reward.
The reward to the agent after taking an action a_{t} is defined as r_{t+1} = P^{κ}_{t+1} (i.e., the immediate next fidelity score).

This brief explanation completes the formulation of quantum control as a reinforcement learning problem. In the next section, we provide the technical details for utilizing RL to solve our quantum control problem in reduced-dimensional chemical systems.

(10) |

To learn the optimal policy, RL algorithms must properly balance the conflicting objectives of exploring the state-action space as much as possible to collect environment feedback, while only visiting useful portions to act optimally. This is known as the exploration–exploitation trade-off. In the terminology of quantum control, the RL algorithm must explore different external electric fields, E(t), before it recognizes the optimal one. However, to be efficient, this exploration should not take too long since it may undermine computational performance. We briefly review two of the popular methods to balance exploration and exploitation in our work.

Epsilon greedy.
The Epsilon greedy^{46} algorithm explores the state-action space by following the optimal policy, while occasionally taking a random action uniformly sampled from the action space:

where is a uniform distribution between 0 and 1, and ε is a constant between 0 and 1. In practice, ε may start with a large value and becomes gradually annealed as the training progresses.

(11) |

Entropy bonus.
In the maximum entropy RL framework,^{47} the entropy of the policy is added to the reward to maintain high stochasticity of the policy when the collected reward value is small:

where α is the temperature parameter that controls the influence of the entropy to the reward. Policies learned from r^{h}(s,a) tend to have a higher stochasticity than the ones learned from r(s,a) alone. Therefore, the sampled actions a_{t} = π(·|s) have a higher chance of visiting a larger portion of the state-action space.

r^{h}(s,a) = r(s,a) + αH(π(·|s)), | (12) |

In our work, two DRL algorithms are harnessed to solve the quantum control problem: deep Q learning and soft actor-critic, both of which are described below.

(13) |

θ ← θ − δ∇J(θ), | (14) |

In this paper, we adopt two important extensions to the basic deep Q learning algorithm, namely, the dueling architecture^{48} and the double deep Q learning,^{49} which have achieved improved performance on other control benchmarks. The dueling architecture decomposes the Q value estimate into a state value and advantage function estimate according to the formula q^{π}(s,a) = v^{π}(s) + A^{π}(s,a). As a result, the dueling Q network replaces the output of the standard neural network, q_{θ}(s,a), with two intermediate output streams: v_{θ,β}(s) and A_{θ,α}(s,a). The final output, which is the Q value estimate, is given by the following aggregation of the two intermediate streams:

(15) |

The double deep Q learning network (DDQN) approach modifies the loss function in eqn (13) as:

(16) |

(17) |

u_{t} = ξ_{t}·σ_{ϕ}(s_{t}) + μ_{ϕ}(s_{t}), | (18) |

a_{t} = tanh(u_{t}). | (19) |

The training of the policy neural network, π_{ϕ}(a|s), and value networks, q_{θi}(s,a) for i = 1, 2, are carried out as follows. At each iteration, the policy neural network is trained to minimize the temporal difference error J(θ^{i}):

(20) |

θ^{i} ← θ^{i} − δ∇J(θ^{i}) i = 1, 2, | (21) |

(22) |

ϕ ← ϕ + δ∇J(ϕ), | (23) |

θ^{i−} ← ρθ^{i−} + (1 − ρ)θ^{i} i = 1, 2. | (24) |

Algorithm 1 RL for QOC | |
---|---|

1: | Initialize neural network weights and s_{0} |

2: | for t = 0,…, do |

3: | Sample a_{t} = π(·|s_{t}). The sampling is defined by eqn (11) for deep Q learning and eqn (17)–(19) for SAC |

4: | E(t) ← a_{t}·Ē |

5: | Perform one environment step to obtain s_{t+1}, r_{t+1} according to Section II A |

6: | Store (s_{t},a_{t},r_{t+1},s_{t+1}) into the replay buffer |

7: | Sample mini-batch from |

8: | Train the RL algorithm by performing eqn (16) for |

deep Q learning and eqn (20)–(24) for SAC | |

9: | ifP^{κ}_{t} > then |

10: | Break |

At each time step t, the agent takes an action, a_{t}, according to the state s_{t} and converts it to an electric field E(t) = a_{t}Ē, where Ē is the upper/lower limit of the magnitude of the electric field. The environment then transitions to the next state according to the Markov decision process defined in Section II C. The agent-environment interaction transition (s_{t},a_{t},r_{t+1},s_{t+1}) is stored in the replay buffer. Tuples of this form will be randomly sampled to train the neural networks of the deep Q learning and SAC algorithm. The procedure stops when the fidelity P_{t}^{κ} is above a pre-defined threshold , which we set to 0.99 in this work.

(25) |

Algorithm setup.
The hyperparameters for our RL algorithm are provided in Table 1, which we manually tuned on 10 selected potentials. Nevertheless, we found that the algorithm's performance was insensitive to small variations for most of the hyperparameters. Unless specified otherwise, these values were used for all of our subsequent simulations.

Hardware/software setup.
The NIC-CAGE package is implemented in Python with the NumPy and SciPy package; the one-step forward propagation of the Schrödinger equation and the RL algorithms are implemented in Python with the PyTorch deep learning framework. All simulations were executed on the Extreme Science and Engineering Discovery Environment (XSEDE) Comet computing cluster at the University of California, San Diego. To provide a fair comparison between the NIC-CAGE and RL approaches examined in this work, each computation utilized 2 Intel Xeon E5 cores.

The converged electric field and fidelity, as well as the power spectrum of the electric field for all methods are shown in Fig. 3a–c.

Fig. 3 Electric fields, E(t), computed by the (a) NIC-CAGE algorithm and various reinforcement learning algorithms: (b) SAC and (c) DDQN. The power spectrum for all cases is shown in panel (d). |

All of the ML algorithms examined in this work automatically construct an electric field that propagates the initial state to the desired target state with a fidelity larger than 0.99. However, the difference is that the NIC-CAGE algorithm concurrently updates the electric field for all time steps in each iteration, whereas the RL algorithms developed in this paper sequentially add a new electric field data point at each time step (i.e., the RL “learns” the electric field in an automated fashion). Fig. 3d plots the power spectrum for each of the electric fields shown in Fig. 3a–c. Interestingly, the NIC-CAGE and SAC algorithms produce relatively smooth electric fields and power spectra, whereas the DDQN approach employs an action-discretization approach, resulting in an electric field and power spectrum with significant noise. As such, the SAC algorithm can be used to improve other RL methods, such as those used to construct optimal fields for spin-1/2 systems in quantum computing. The control fields obtained in these prior studies are typically not smooth,^{26} making them difficult to realize in experiments, whereas the SAC algorithm used here can ameliorate these artifacts.

When the effective mass is set to a larger value of m = 10.0, computing the optimal electric field becomes significantly more difficult. Large masses pose significant difficulties since they correspond to quantum optimal control of macroscopic objects (for example, quantum mechanical tunneling through a potential energy barrier is significantly more difficult for a larger mass than a smaller one). For some potentials, the NIC-CAGE algorithm does not converge to a high-fidelity solution within a reasonable computation time. To further compare the performance of the various RL algorithms against the traditional NIC-CAGE approach, we classified all the potentials into three groups based on the range of fidelities obtained by the NIC-CAGE algorithm: [0.75,1.0] designates easy cases, [0.01,0.75) are medium-difficulty cases, and [0.0,0.01) are hard cases. There are 89, 47, and 149 potentials in each group, respectively.

The fidelity vs. computation time for all methods is shown in Fig. 4–6. For the easy cases, the NIC-CAGE benchmark converges to high-fidelity solutions but requires long computation times. In contrast, the DDQN algorithm gives a similar fidelity as the NIC-CAGE benchmarks but with less computational effort. The SAC algorithm is only slightly worse than the DDQN method. Examining the medium-difficulty and hard cases (which account for ∼66% of all the tested potentials), we find that both DDQN and SAC significantly improve on the fidelity compared to the gradient-based NIC-CAGE approach. In particular, both RL methods are significantly more effective in scenarios where the NIC-CAGE algorithm only gives a low-fidelity solution, as shown by the hard cases in Fig. 6. It is worth noting that the quantum control problem for some potentials can be difficult to converge with RL, as shown by the outliers in each of the bar plots. We discuss these special cases in the next subsection and demonstrate that increasing the computation time improves their fidelity for RL (whereas these cases still remain unsolvable with the gradient-based NIC-CAGE algorithm).

As shown in Fig. 4–6, DDQN generally produces better results than the SAC method. In the context of machine learning, we carried out an ablation study to show which extension contributes the most to its superior performance. Table 2 shows the fidelity for all cases across Fig. 4–6 arranged in the 10, 50, and 90 percentiles. Each row is a variant of deep Q learning: basic DQN, double DQN, DQN with dueling architecture, and double DQN with dueling architecture. In particular, We found that both extensions improved the performance of DQN, when used alone or combined.

Easy | Medium | Hard | All | |
---|---|---|---|---|

DQN | 0.931 | 0.927 | 0.833 | 0.879 |

Double DQN | 0.957 | 0.949 | 0.847 | 0.898 |

Dueling DQN | 0.958 | 0.958 | 0.836 | 0.894 |

Dueling + double | 0.961 | 0.949 | 0.862 | 0.907 |

Looking forward, we anticipate that the RL techniques in this work could be used as efficient (and sometimes superior) alternatives to gradient-based approaches in quantum control problems. In particular, our RL approaches are expected to be even more efficient in high-dimensional quantum systems or applications with a large number of qubits. For both of these examples, calculations of the high-dimensional gradients would be computationally expensive, whereas the RL approach (which does not require these gradients) would be significantly more efficient. As such, these new RL techniques could be a viable option for obtaining optimal control fields of large quantum systems where gradient-based calculations are intractable or prohibitively out of reach.

- D. Castaldo, M. Rosa and S. Corni, Phys. Rev. A, 2021, 103, 022613 CrossRef CAS.
- K. C. Nowack, F. H. L. Koppens, Y. V. Nazarov and L. M. K. Vandersypen, Science, 2007, 318, 1430–1433 CrossRef CAS PubMed.
- M. Kues, C. Reimer, P. Roztocki, L. R. Cortés, S. Sciara, B. Wetzel, Y. Zhang, A. Cino, S. T. Chu, B. E. Little, D. J. Moss, L. Caspani, J. Azaña and R. Morandotti, Nature, 2017, 546, 622–626 CrossRef CAS.
- E. M. Fortunato, M. A. Pravia, N. Boulant, G. Teklemariam, T. F. Havel and D. G. Cory, J. Chem. Phys., 2002, 116, 7599–7606 CrossRef CAS.
- H. J. Williams, L. Caldwell, N. J. Fitch, S. Truppe, J. Rodewald, E. A. Hinds, B. E. Sauer and M. R. Tarbutt, Phys. Rev. Lett., 2018, 120, 163201 CrossRef CAS PubMed.
- A. Bartana, R. Kosloff and D. J. Tannor, Chem. Phys., 2001, 267, 195–207 CrossRef CAS.
- B. L. Brown, A. J. Dicks and I. A. Walmsley, Phys. Rev. Lett., 2006, 96, 173002 CrossRef PubMed.
- M. J. Wright, J. A. Pechkis, J. L. Carini, S. Kallush, R. Kosloff and P. L. Gould, Phys. Rev. A: At., Mol., Opt. Phys., 2007, 75, 051401 CrossRef.
- M. B. Oviedo and B. M. Wong, J. Chem. Theory Comput., 2016, 12, 1862–1871 CrossRef CAS PubMed.
- N. V. Ilawe, M. B. Oviedo and B. M. Wong, J. Chem. Theory Comput., 2017, 13, 3442–3454 CrossRef CAS PubMed.
- N. V. Ilawe, M. B. Oviedo and B. M. Wong, J. Mater. Chem. C, 2018, 6, 5857–5864 RSC.
- M. Maiuri, M. B. Oviedo, J. C. Dean, M. Bishop, B. Kudisch, Z. S. D. Toa, B. M. Wong, S. A. McGill and G. D. Scholes, J. Phys. Chem. Lett., 2018, 9, 5548–5554 CrossRef CAS PubMed.
- B. Kudisch, M. Maiuri, L. Moretti, M. B. Oviedo, L. Wang, D. G. Oblinsky, R. K. Prud'homme, B. M. Wong, S. A. McGill and G. D. Scholes, Proc. Natl. Acad. Sci. U. S. A., 2020, 117, 11289–11298 CrossRef CAS PubMed.
- P. Brumer and M. Shapiro, Acc. Chem. Res., 1989, 22, 407–413 CrossRef CAS.
- J. Somlói, V. A. Kazakov and D. J. Tannor, Chem. Phys., 1993, 172, 85–98 CrossRef.
- W. Zhu, J. Botina and H. Rabitz, J. Chem. Phys., 1998, 108, 1953–1963 CrossRef CAS.
- C. Brif, R. Chakrabarti and H. Rabitz, New J. Phys., 2010, 12, 075008 CrossRef.
- M. Abdelhafez, D. I. Schuster and J. Koch, Phys. Rev. A, 2019, 99, 052327 CrossRef CAS.
- D. J. Tannor, V. Kazakov and V. Orlov, in Control of Photochemical Branching: Novel Procedures for Finding Optimal Pulses and Global Upper Bounds, ed. J. Broeckhove and L. Lathouwers, Springer US, Boston, MA, 1992, pp. 347–360 Search PubMed.
- N. Khaneja, T. Reiss, C. Kehlet, T. Schulte-Herbrüggen and S. J. Glaser, J. Magn. Reson., 2005, 172, 296–305 CrossRef CAS PubMed.
- T. Caneva, T. Calarco and S. Montangero, Phys. Rev. A: At., Mol., Opt. Phys., 2011, 84, 022326 CrossRef.
- W. Zhu and H. Rabitz, J. Chem. Phys., 1998, 109, 385–391 CrossRef CAS.
- X. Wang, A. Kumar, C. R. Shelton and B. M. Wong, Phys. Chem. Chem. Phys., 2020, 22, 22889–22899 RSC.
- C. Chen, D. Dong, H.-X. Li, J. Chu and T.-J. Tarn, IEEE Trans. Neural Netw. Learn. Syst., 2013, 25, 920–933 Search PubMed.
- T. Fösel, P. Tighineanu, T. Weiss and F. Marquardt, Phys. Rev. X, 2018, 8, 031084 Search PubMed.
- X.-M. Zhang, Z. Wei, R. Asad, X.-C. Yang and X. Wang, Npj Quantum Inf., 2019, 5, 1–7 CrossRef.
- M. Y. Niu, S. Boixo, V. N. Smelyanskiy and H. Neven, Npj Quantum Inf., 2019, 5, 1–8 CrossRef.
- M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov and P. Mehta, Phys. Rev. X, 2018, 8, 031086 CAS.
- J. Mackeprang, D. B. R. Dasari and J. Wrachtrup, Quantum Mach. Intell., 2020, 2, 1–14 CrossRef.
- P. V. D. Hoff, S. Thallmair, M. Kowalewski, R. Siemering and R. D. Vivie-Riedle, Phys. Chem. Chem. Phys., 2012, 14, 14460–14485 RSC.
- S. Thallmair, D. Keefer, F. Rott and R. de Vivie-Riedle, J. Phys. B: At., Mol. Opt. Phys., 2017, 50, 082001 CrossRef.
- T. Brixner and G. Gerber, ChemPhysChem, 2003, 4, 418–438 CrossRef CAS PubMed.
- M. Dantus and V. V. Lozovoy, Chem. Rev., 2004, 104, 1813–1860 CrossRef CAS PubMed.
- E. B. Wilson, J. C. Decius and P. C. Cross, Molecular Vibrations: the Theory of Infrared and Raman Vibrational Spectra, Dover Publications, New York, NY, 1955 Search PubMed.
- K. Fukui, Acc. Chem. Res., 1981, 14, 363–368 CrossRef CAS.
- K. Fukui, J. Phys. Chem., 1970, 74, 4161–4163 CrossRef CAS.
- B. M. Wong, A. H. Steeves and R. W. Field, J. Phys. Chem. B, 2006, 110, 18912–18920 CrossRef CAS PubMed.
- A. Raza, C. Hong, X. Wang, A. Kumar, C. R. Shelton and B. M. Wong, Comput. Phys. Commun., 2021, 258, 107541 CrossRef CAS.
- N. Khaneja, T. Reiss, C. Kehlet, T. Schulte-Herbrüggen and S. J. Glaser, J. Magn. Reson., 2005, 172, 296–305 CrossRef CAS PubMed.
- QuTiP: Quantum Toolbox in Python, https://qutip.org/docs/latest/modules/qutip/qobj.html#Qobj.expm.
- J. Johansson, P. Nation and F. Nori, Comput. Phys. Commun., 2013, 184, 1234–1240 CrossRef CAS.
- H. Gharibnejad, B. Schneider, M. Leadingham and H. Schmale, Comput. Phys. Commun., 2020, 252, 106808 CrossRef CAS.
- R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, MIT Press, 2018 Search PubMed.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in PyTorch, 2017.
- M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014 Search PubMed.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland and G. Ostrovski, et al. , Nature, 2015, 518, 529–533 CrossRef PubMed.
- T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, arXiv, 2018, preprint, arXiv:1801.01290.
- Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot and N. Freitas, International conference on machine learning, 2016, pp. 1995–2003 Search PubMed.
- H. Van Hasselt, A. Guez and D. Silver, Proceedings of the AAAI conference on artificial intelligence, 2016 Search PubMed.

## Footnote |

† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2cp02495k |

This journal is © the Owner Societies 2022 |