Carlos
Floyd
*ab,
Aaron R.
Dinner
ab and
Suriyanarayanan
Vaikuntanathan
*ab
aDepartment of Chemistry, The University of Chicago, Chicago, Illinois 60637, USA. E-mail: csfloyd@uchicago.edu; svaikunt@uchicago.edu
bThe James Franck Institute, The University of Chicago, Chicago, Illinois 60637, USA
First published on 15th May 2025
Active nematics are paradigmatic active matter systems which generate micron-scale patterns and flows. Recent advances in optical control over molecular motors now allow experimenters to control the non-equilibrium activity field in space and time and, in turn, the patterns and flows. However, engineering effective activity protocols remains challenging due to the complex dynamics. Here, we explore a model-free approach for controlling active nematic fields using reinforcement learning. Combining machine learning with trial-and-error exploration of the system dynamics, reinforcement learning bypasses the need for accurate parameterization and model representation of the active nematic. We apply this technique to demonstrate how local activity fields can induce effective interactions between nematic defects, enabling them to follow designer dynamical laws. Moreover, the sufficiency of our low-dimensional system observables and actions suggests that coarse projections of the active nematic field can be used for precise feedback control, making experimental or biological implementation of such feedback loops plausible.
Recent works have focused on guiding active nematic flows through the external control of spatiotemporally dynamic activity fields α(r,t). Experimental advances in optical actuation of molecular motors using light fields motivate studying activity fields as externally controlled functions.18–26 Light-based control has also recently been developed over Ca2+-powered assembly and contraction of proteins found in protists27,28 and over cytoskeletal network growth.29 In simulations, optimal time-dependent activity fields α(r,t) can be derived using knowledge of the nematohydrodynamic equations of motion to guide nematic and polar defects along desired paths.30–33 Other techniques allow targeted modulation of nematic channel flow and design of localized “topological tweezers” for precise defect manipulation.34–36
A key difficulty in implementing the above mentioned control methods is their reliance on accurate system models and parameterization. Here, we explore controlling active nematics using a model-free machine learning technique called reinforcement learning (RL).37,38 This approach allows a program to develop a closed-loop control policy for a dynamical system purely through trial and error, without relying on precise model specifications.
To our knowledge RL has not yet been applied to control active nematics, although previous studies used it to control other active matter systems such as flockers, driven colloids, and branched actin networks.39–41 These studies focus on design tasks that aim to achieve a specific static property of the system, such as target net flocking motion or average cluster size. In contrast, we study control tasks that guide active nematic defects to follow prescribed virtual interactions with one another (see Fig. 1).42,43 Our goal is to learn activity field protocols that can effectively override the natural defect dynamics (such as passive Coulomb-like attraction) to impose a user-specified interaction law, a more general design objective termed “cyberphysics” in ref. 38. Additionally, we show that efficient protocols can be learned even when feedback is limited to imperfect information such as a coarse projection of the full nematic field configuration.
![]() | ||
Fig. 1 Schematic overview of the closed-loop control over active nematic defect dynamics enabled by a trained RL policy. One step of the feedback loop is depicted. |
![]() | (1) |
![]() | (2) |
![]() | (3) |
Although defects in an active nematic fluid evolve under complex hydrodynamics (see eqn (10) and (11) below), theoretical work has shown that their motion can be approximated using simpler one- and two-body dynamical equations.35,45,46 The positions p± and orientations ϕ± approximately obey differential equations which are coupled to each other through elastic interactions, and they are also coupled to the local activity profile α(r) and its gradients. These equations are derived from the hydrodynamic Stokes flow of the active nematic fluid and involve numerical prefactors such as the effective defect size, friction coefficient, elasticity constant, shear viscosity, and others. As predicted by these equations, in the absence of activity (i.e., for passive nematics) a pair of oppositely-charged defects will attract each other without rotating in a Coulomb-like interaction until they annihilate, preserving the total topological charge. This default behavior is illustrated in Fig. 3. These effective dynamical equations enable the design of precise “activity tweezers” that manipulate defect dynamics, but this technique is complicated by the difficulty of estimating the model parameters involved.35,47 Here, we simply use the fact that defect positions and orientations couple in some way to activity gradients to find tweezer-like activity protocols purely through exploration (i.e., RL). This approach is thus agnostic to the underlying model parameters and dynamics.
For every task we run several experiments using different task parameters or different seeds for the pseudo-random number generator, and the outcome of each experiment is a trained policy function πθ(S). An experiment is divided into episodes, each of which begins by resetting the active nematic field using some distribution of initial conditions (Fig. 4). Every episode consists of a fixed number of steps, Nsteps, although an episode may terminate before this number of steps if the number of defects changes through annihilation or creation. At every step, the RL agent views the state S and chooses a new action A, corresponding to a given configuration of the activity field α(r,t), which is applied for the duration of the step. A step consists of Ndt iterations of the numerical integrator. The fixed time interval which elapses during one step is thus Ndtdt where dt is the time resolution of the integrator; in simulation units we have dt = 1. The experiment ends when a number of episodes, Neps, is reached or when the wall time of the program exceeds a specified value. Throughout this paper we set Ndt = 50, Nsteps = 75, and we run experiments for 3 hours, corresponding to roughly Neps = 150 episodes per experiment. The choice to fix wall time rather than Neps is arbitrary and made simply for convenience.
![]() | ||
Fig. 4 Schematic illustration of the structure of an experiment, episode, and step as used in our RL training program. |
The main actor and critic networks are updated using stochastic gradient descent (SGD) with mini-batches of Nbatch = 32 samples from a replay buffer, which stores tuples (S(s), A(s), R(s), T(s)) (state, action, reward, and termination flag) collected at every step s. For each experiment, an initial random policy is used for 10Nbatch steps to populate the replay buffer, after which the main actor network μθ(A) is used for action selection, with SGD updates performed every 5 steps. How the mini-batches are used to update the network parameters is described in ref. 48 and 49. All neural networks have two hidden layers with 32 neurons each, and they are trained using the ADAM optimizer with a learning rate of 0.001 and a weight norm clip of 1.0. The discount factor is γ = 0.99 (see ref. 48 and 49 for a definition) and the weight transfer factor is ρ = 0.995. These parameter values were chosen to be consistent with previous implementations of the DDPG method.48,49
The actions are chosen as A = πθ(S) + ε, where ε ∼ (0,0.05) is a small Gaussian noise added to the policy output to encourage exploration of the available actions. The resulting action A is then clipped to the range [−1,1]. State values provided as inputs to the network are normalized to lie approximately within the same range. For each task, the action values are linearly mapped from this range to the corresponding scaled parameters of the activity field α(r,t), such that −1 maps to the lowest scaled values and 1 maps to the largest.
To numerically integrate the active nematic hydrodynamics equations in eqn (10), we use a custom Julia implementation of Heun's finite difference method. The implementation was previously described and validated in ref. 50–52. We parameterize the nematic system so that the unit of length is equal to the equilibrium nematic persistence length (see Appendix).53 To train the RL policy, we combine this numerical solver with an implementation of the DDPG algorithm provided by the ReinforcementLearning.jl package.54
![]() | ||
Fig. 5 Results for task 1: to reach a target horizontal separation l0. (A) Schematic illustration of the task, in which a nearly uniform disk of activity is applied to the center of the +1/2 defect. The amplitude of the activity is controlled as an action, and its radial profile for different amplitudes is shown on the right. The dashed green line denotes the position of the +1/2 defect in the activity field at the start of each step. (B) Trajectories of the activity amplitude (top) and resulting horizontal separation rsep (bottom) with a trained RL policy. The dashed line represents l0 = 55 for this task. (C) Snapshots of the active nematic field at the end of three different steps for the episode in panel (B). See Fig. 3B caption for a description of the visualization. Color here denotes the amplitude of the activity. The black dashed line represents the target separation l0 = 55. (D) Two trajectories of rsep with a trained RL policy for five values of l0, shown as different colors. The values of l0 are shown as dashed lines. (E) Fitted values of lfit0, plotted against their target values of l0. Standard deviations are shown as shaded areas (smaller than symbols). |
The state used in the RL algorithm at step s is S(s) = (rsep(s) − l0)/50 (where 50 is a rough scale factor), and the action is the amplitude α0 of the activity profile scaled to the range [−5,5]. We note that allowing the activity to change sign is unphysical as a typical active nematic fluid has either only extensile (α > 0) or contractile (α < 0) force dipoles. We allow activity to change sign in this task as a first illustration of a simple control mechanism, and in the remaining tasks we constrain the sign of the activity and only vary its magnitude or position. The shape of the activity profile is an azimuthally symmetric function centered on p+ with radial dependence
![]() | (4) |
A typical episode for l0 = 55 is shown in Fig. 5B and C. The policy applies a strong negative activity field to the +1/2 defect, which propels it away from the −1/2 defect. Upon reaching the desired separation of 55 lattice units the applied activity weakens in magnitude and, in a feedback control-like manner, nudges the +1/2 defect further backward whenever the −1/2 defect moves toward it due a to the attractive elastic interaction between them. This maintains the separation at rsep = 55 for the remainder of the episode. Although in principle the activity could be applied when the separation is at the spring equilibrium, rsep = l0, to exactly balance the attractive force, the actual separation oscillates slightly around this point due to the discretization of the simulation domain. We also do not expect the algorithm to learn a numerically exact policy.
We train RL policies for a range of values of l0 and observe that in each case the algorithm converges to a policy in which the target separation is reached by the end of each episode (Fig. 5D). We fit the learned lfit0 by taking the average of rsep over the last 25 steps of each episode, over the last 40 episodes of each experiments, and over 10 experiments using different random initial seeds. The learned lfit0 matches the target l0 (Fig. 5E) in each case. Thus, RL can learn how to adjust the activity amplitude as a function of defect position to reach a target separation, for a simple physical set-up in which the defects are aligned and move primarily due to the activity-induced propulsion of the +1/2 defect.
![]() | (5) |
![]() | (6) |
![]() | (7) |
We train the RL program to achieve these target dynamics by leveraging a recently demonstrated modality of defect motion under activity gradients, rather than the modality of +1/2 defect propulsion under constant activity used in the previous task. −1/2 defects have been shown to couple to second-order and higher gradients in α(r),35 and, using this fact, we parameterize the activity field in the vicinity of the −1/2 defect as α(r;deα0,1,5) in eqn (4), which has non-zero gradients of all orders. We allow the RL program to adjust the amplitude α0 of the activity field (in the range [0,12]) and the horizontal offset r− between the −1/2 defect and the center of the activity field (in the range [−10,10]); see Fig. 2 and 6A.
![]() | ||
Fig. 6 Results for task 2: for rsep to obey the dynamics of an overdamped spring with rest length l0 = 50 and varying stiffness k0. (A) Schematic illustration of the task, in which an inhomogeneous activity field is applied near the −1/2 defect. The position of the activity field and its amplitude are controlled as actions, and its radial profile for different amplitudes is shown on the right. (B) Top: Trajectories of the activity amplitude (black) and horizontal distance from the −1/2 defect (red) for a trained RL policy. Bottom: The resulting horizontal separation rsep trajectory. The thick dashed line on the bottom represents l0 = 50, the target stiffness is k0 = 0.0015, and the thin dashed line denotes an exponential fit with kfit0 = 0.00124. (C) Snapshots of the active nematic field at the end of three different steps for the episode in panel (B). See Fig. 3B caption for a description of the visualization. The black dashed line represents the target separation l0 = 50. (D) Top: A set of 20 trajectories of rsep with a trained RL policy for k0 = 0.001, with exponential fits shown as dashed lines. Bottom: Same as top, but with k0 = 0.004. (E) Fitted values kfit0 plotted against their target values of k0. Standard deviations are shown as shaded areas. For this plot we exclude episodes which randomly start within 2 of rsep = 50 to focus on trajectories in which rsep changes appreciably during the episode. See Video S1 (ESI†) for a movie of the trained RL algorithm performing this task. |
We fix l0 = 50 in eqn (5) and vary k0, the overdamped spring constant. We show typical trajectories obtained with a trained policy for k0 = 0.0015 in Fig. 6B and C. The policy simultaneously adjusts the amplitude and offset of the activity profile relative to the −1/2 defect so that the defect separation rsep gradually approaches the “rest length” of l0 = 50. Viewing several episodes for policies trained on k0 = 0.001 and k0 = 0.004, we see that the deviation from the target decays roughly exponentially with a rate that depends on the target stiffness (Fig. 6D).
We consider a range of k0 values and average kfit0, which we obtain as the fitted rate of exponential approach of rsep(s) to l0, over the last 40 episodes of each experiment and over 10 experiments using different random initial seeds (Fig. 6E). We observe good agreement between k0 and kfit0, albeit with deviations at high values of k0. At these values the policy has difficulty pulling the defects faster than the nematic material timescales allow. Despite these limitations, the generally good agreement between k0 and kfit0 indicates that the RL program can utilize a range of physical effects (including coupling to gradients of activity) to pull defects, and these can be made to obey target dynamics rather than just target static configurations.
![]() | (8) |
![]() | ||
Fig. 7 Results for task 3: for sin(ϕ+) to obey the dynamics of an overdamped spring with zero rest length and varying stiffness kθ. (A) Schematic illustration of the task, in which an inhomogeneous activity field is applied near the +1/2 defect. The negative angle ϕα is indicated. This angle is controlled as an action, and the activity field's radial profile is shown on the right. The dashed green line denotes the position of the +1/2 defect in the activity field at the start of each step. (B) Top: Trajectory of the angular position ϕα of the activity field for a trained RL policy. Bottom: The resulting trajectory for orientation of the +1/2 defect sin(ϕ+). The thick dashed line on the bottom represents sin(ϕ+) = 0, the target stiffness is kθ = 0.0007, and the thin dashed line denotes an exponential fit with kfitθ = 0.00067. (C) Snapshots of the active nematic field at the end of three different steps for the episode in panel (B). See Fig. 3B caption for a description of the visualization. The black dashed wedge represents the current orientation sin(ϕ+) which approaches the target value of 0. (D) Top: A set of 20 trajectories of sin(ϕ+) with a trained RL policy for kθ = 0.00025, with exponential fits shown as dashed lines. Bottom: Same as top, but with kθ = 0.0008. (E) Fitted values kfitθ plotted against the target value of kθ. For this plot we exclude episodes which randomly start within 0.05 of sin(ϕ+) = 0 to focus on trajectories in which sin(ϕ+) changes appreciably during the episode. See Video S2 (ESI†) for a movie of the trained RL algorithm performing this task. |
We vary the stiffness kθ in eqn (8). We show typical trajectories obtained with a trained policy for kθ = 0.0007 in Fig. 7B and C. Under the policy, the angle ϕα fluctuates around a mean value so as to cause the +1/2 defect to rotate at approximately the desired exponential rate, kθ, toward ϕ+ = ζ+ = 0. In Fig. 7D we show several episodes for kθ = 0.00025 and kθ = 0.0007, indicating a clear difference in the learned decay rate of ζ+ toward 0. Fitting kfitθ for the last 40 episodes of each experiment and over 10 experiments with different random initial seeds, we observe good agreement with the target value of kθ (Fig. 7E). We thus see that RL can make both the positions and orientations of defects appear to follow dynamical laws that differ from those intrinsic to the system.
Our current implementation has several limitations that could be improved in future work. First, we employed a fairly restrictive parameterization of the controlled activity field, chosen to ensure a low-dimensional and easily trainable action space. This design supports the idea that simple spatiotemporal patterns can couple effectively to defect dynamics in active nematics. However, future work could refine this approach, for example, by using multi-agent control40,57 to explore higher-dimensional settings that might allow reconstruction of more spatially intricate control fields such as those shown to function as precise “topological tweezers”.35,36 Second, our model of active nematics includes several simplifying assumptions: we adopt a high-friction limit,35,58 assume negligible density variations,45 and omit detailed modeling of the dynamics of the activity-carrying agents, such as dispersal.59,60 These choices were made for computational efficiency, as the RL algorithm requires many trial iterations to train. Our algorithm converged in under three hours without heavy optimization, however, suggesting substantial room to scale up in future implementations. A key advantage of the RL framework is its agnosticism to the specific physical model, which means that, at least in principle, it can learn effective control strategies across a range of active fluid models. Finally, our reward computation could be improved. We observe small deviations from ideal spring-like behavior in cases with soft spring constants (near k0 = 0) or when the defects are near their equilibrium separation. In these regimes, the expected defect motions are minimal, so small erroneous motions are weakly penalized. Achieving more uniform adherence to spring-like behavior may require a reweighted reward function, and further refinements could enable more complex, multi-objective control tasks.
The scientific merit of demonstrating control over active nematics via RL is at least two-fold. First, as mentioned earlier, RL offers a practical method for engineering soft active matter systems with minimal reliance on accurate model specification, although it does present several technical challenges. When deploying RL to control a new system, careful attention must be given to parameterizing the control fields and designing the reward functions in a way that allows the algorithm to efficiently learn to solve the abstract optimization problem. Like other machine learning applications, this process involves fine-tuning various hyperparameters such as learning rates, neural network architectures, etc. Additionally, reliable state observations are important (although there exist stochastic variants of RL using uncertain state measurements), and accurately measuring nematic field configurations in experiments requires considerable technical attention.61,62 Despite these challenges, RL has a proven track record of solving highly complex problems, often surpassing human capabilities,63,64 making its application to controlling active matter systems quite promising.
Second, the ability of RL programs to successfully manipulate the dynamics of active nematic defects using very low-dimensional state and action spaces suggests that an effective low-dimensional description is sufficiently accurate to capture the defects' dynamics. This finding supports the theoretical arguments presented in ref. 35 and 45. Relatedly, in previous work we demonstrated that dynamic activity fields can be iteratively constructed without RL and without knowledge of the active nematic dynamics using physically motivated yet imperfect local feedback rules.65 Recent experiments have similarly demonstrated closed-loop control over active nematic flow speeds which matches predictions from a highly coarse-grained model.18 We have also demonstrated that low-dimensional RL may in principle be used to provide feedback control over other biomimetic mechanochemical protein networks such as the ultrafast Ca2+-powered contractile proteins found in certain protists.27,28 The sufficiency of imperfect, low-dimensional information to guide active nematics defects has implications for designing experimentally efficient feedback control over living systems and for understanding what information is sufficient to enable biologically plausible control over mechanochemical dynamics during biophysical processes like morphogenesis.7–15
![]() | (9) |
The tensor Q couples to a flow field v and spontaneously seeks to minimize its local free energy. We consider the overdamped limit and the limit of high substrate friction (as in ref. 35, 45, 58), yielding the following equations of motion:
∂tQij = Sij(v) + ΓHHij, | (10) |
vi = γv−1∂k(σaik(Q) + σEik(Q)). | (11) |
Here, Hij is the symmetric and traceless part of where F is the free energy functional (see below), Sij is a flow-coupling term, and γv is a coefficient of friction. In this overdamped limit, v is an instantaneous function of Q, so eqn (10) is closed in Q. This simplification allows for computational efficiency which is advantageous in training RL programs. The active stress tensor is given by67,68
σaij = −αQij, | (12) |
![]() | (13) |
The Landau–de Gennes free energy is:
![]() | (14) |
![]() | (15) |
The isotropic part of the passive stress in this model depends only on f as we neglect pressure contributions resulting from variations in density of the fluid.45
The flow coupling term is
Sij(v) = −vk∂kQij + ΦikQkj+ + Qik+Φkj − 2ξQij+(Qkl∂kvl), |
![]() | (16) |
![]() | (17) |
![]() | (18) |
Φij = ξΨij − Ωij. | (19) |
In these equations, ξ, A0, U, ΓH, L, γv are parameters whose meanings are described in ref. 21, 52, 66, to which we refer the reader for additional details on this model. Throughout the paper we use ξ = 0.7, A0 = 0.1, U = 3.5, ΓH = 1.5, L = 0.1, and γv = 10 (all in simulation units). For these choices the equilibrium nematic persistence length is the same as one length unit.53
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sm00063g |
This journal is © The Royal Society of Chemistry 2025 |