Michele
Caraglio
*,
Harpreet
Kaur
,
Lukas J.
Fiderer
,
Andrea
López-Incera
,
Hans J.
Briegel
,
Thomas
Franosch
and
Gorka
Muñoz-Gil
Institut für Theoretische Physik, Universität Innsbruck, Technikerstraße 21A, A-6020, Innsbruck, Austria. E-mail: michele.caraglio@uibk.ac.at
First published on 8th February 2024
Finding the best strategy to minimize the time needed to find a given target is a crucial task both in nature and in reaching decisive technological advances. By considering learning agents able to switch their dynamics between standard and active Brownian motion, here we focus on developing effective target-search behavioral policies for microswimmers navigating a homogeneous environment and searching for targets of unknown position. We exploit projective simulation, a reinforcement learning algorithm, to acquire an efficient stochastic policy represented by the probability of switching the phase, i.e. the navigation mode, in response to the type and the duration of the current phase. Our findings reveal that the target-search efficiency increases with the particle's self-propulsion during the active phase and that, while the optimal duration of the passive case decreases monotonically with the activity, the optimal duration of the active phase displays a non-monotonic behavior.
In many relevant circumstances, the agent has no a priori knowledge of the target location and has to develop effective stochastic strategies that allow minimizing, at least on average, the search time in an environment with randomly distributed targets. Motivated by observational data, physical intuition, and analytical tractability, Lévy walks21–23 and intermittent searches1,24–26 are among the statistical strategies that have received major attention in the past. In the former, the agent undergoes straight runs at constant speed with run lengths l drawn from a Lévy distribution p(l) ∝ l−(α+1), with 0 < α < 2 and the target is detected if the searcher transits closer than a threshold distance, which also acts as a small-lengths cut-off allowing to normalize the Lévy distribution. The optimal value of α depends sensitively on model details such as the revisitability and mobility of the targets or the complexity of the environment.22,27–29 Intermittent-search strategies have been proposed based on the observation that fast movements allow exploring quickly the whole environment but may, on the other hand, significantly degrade perception abilities.3 In these strategies, phases of diffusive motion permitting target detection are alternated with phases of ballistic motion which allow quick relocation to different positions at the cost of not being able to detect the target. In the simplest version of the model, the agent switches from one phase to the other with a fixed rate leading to exponentially distributed phase durations, but other distributions have also been considered.30,31 The mean search time of these strategies can be minimized under broad conditions. In particular, it has been shown that there is an optimal duration of the ballistic nonreactive phase which depends only on the dimensionality of the system and is independent of the details of the slow reactive phase.26 Intermittent-search strategies remain robust also in the cases of different target distributions such as patches32 and Poissonian distributions in one dimensions.33
In the past decade, machine learning has emerged as a revolutionary tool helping to elucidate various aspects of active matter systems.34 In particular, reinforcement learning (RL)35 and genetic algorithms36 have proved to be powerful tools able to identify successful swimming strategies improving the navigation performances of microswimmers and their odds of reaching a target. Promising and worthy results have been obtained in several situations including simple energy landscapes,37 viscous surroundings,38–40 complex motility fields,41 and steady or turbulent flows.42–46 However, previous literature has mainly focused on increasing the net flux of particles in a certain direction or on optimizing point-to-point navigation towards a target whose position is fixed and then implicitly learned during the learning process. Thus, notwithstanding the increasing popularity of machine-learning algorithms in the active matter field34,47 and the previously mentioned seminal works, investigation of stochastic target-search problems with randomly distributed targets via machine-learning approaches remains largely unexplored, with only a couple of very recent exceptions.48,49
Muñoz-Gil et al.48 have applied RL methods to learn optimal foraging strategies outperforming the efficiency of Lévy walks in the case of revisitable, sparsely distributed targets. In their setup, the learning agent performs a stepwise motion with constant velocity and, at each step, decides if maintaining the current direction or turning in a new random one, this choice being based only on the length of the current straight segment. Whenever the agent detects a target, it receives a reward and, through several trials, it optimizes its policy, learning an efficient distribution of the length of the straight segments. The former approach, with respect to traditional analytical ones, has the advantage of not being restricted to a specific ansatz of the straight-segment length distribution. However, it remains focused on investigating known idealized scenarios, which are not entirely apt for describing the behavior of real microswimmers.
On the other hand, Kaur et al.49 rely on the active Brownian particle (ABP),50 which well describes the behavior of artificial microswimmers, and show that genetic algorithms manage to address the problem of finding targets of unknown positions for particles able to decide if and when switching their behavior between standard passive Brownian diffusion and directed ABP motion. In particular, they use the algorithm NeuroEvolution of Augmenting Topologies51 to evolve an initial population of particles taking random decisions towards a population in which the majority of particles are optimized to solve the target-search problem. However, their findings are limited by the fact that, in their setup, a given individual particle acts deterministically in the sense that it always selects the same duration for each phase from a set of predetermined durations.
In the present manuscript, we combine the two former approaches: While, as in ref. 49, we resort again on agents able to switch their behavior between passive and active Brownian motion, we here exploit the powerful RL framework employed in ref. 48, thus allowing our agents to learn a distribution of durations for each of the two phases. This results in a stochastic strategy maximizing the foraging efficiency in a homogeneous environment, which can eventually be tested in experiments with artificial Janus particles52,53 where the activity is controlled by an external illuminating system.38
Including these notions into the standard ABP model50 in a homogeneous environment results in the following set of Langevin equations, discretized according to Itô rule,
![]() | (1) |
![]() | (2) |
![]() | (3) |
Here Δt is the integration time step, rt = (xt,yt) is the position at time t, and ut = (cosϑt,sin
ϑt) denotes the instantaneous orientation of the self-propulsion velocity with constant modulus v. D and Dϑ are the translational and rotational diffusion coefficients, respectively. Finally, the components of the vector noise ξt = (ξx,t,ξy,t) and of the scalar noise ηt are independent random variables, distributed according to a Gaussian with zero average and unit variance. Note that when the phase of the particle is that of a passive Brownian particle (ϕt = 0), the spatial evolution is decoupled from the orientational diffusion of the self-propulsion.
Our homogeneous environment is modeled as a two-dimensional square box of size L × L with periodic boundary conditions. A circular target of radius R = 0.05L is located randomly inside the box. Every time the agent finds this target (i.e. the distance between the center of the target and the particle position is smaller than the target radius R), it gets a positive reward, the target is destroyed, and a new target appears at a new random location inside the box. Due to the periodic boundary conditions, this environment is formally equivalent to an infinite domain with a lattice of targets. Being the reward given only when a target is found, in order to maximize its total reward, the agent learns to minimize the target-search time, i.e. to optimize the target-search efficiency.
In the following, we fix the length unit as the size of the box L and the time unit as the typical time τ:
= L2/4D required by a passive particle to cover this distance. The model has thus two free dimensionless parameters: The Péclet number Pe
:
= vτ/L, measuring the magnitude of the activity, and the persistence
* := v/DϑL, representing the persistence of directed motion in the ABP phase. The size and the speed of natural and artificial self-propelled microswimmers range over a few orders of magnitude.50 Furthermore, also the environmental conditions such as the size of the search space and of the targets, as well as the target density and shape can be very variegated. Nevertheless, one can translate our model units to physical units when considering a typical experiment having a box size L = 102 μm and Janus particles with diffusion coefficient D = 2 μm2 s−1, resulting in a τ = 1.25 × 103 s.
We first consider the case in which the learning particle, when in the ABP phase, has a large activity and a persistence length equal to the box size. To do so, we set the persistence to * = 1 and the Péclet number to Pe = 100 which means that the ratio between the typical length traveled because of the self-propulsion and the typical length traveled due to diffusion is 1 at the minimal phase duration, corresponding to the integration time step Δt = 10−4τ, and grows up to 100 for a phase duration equal to the time unit τ (see Methods section for details). In such a situation, the learning particle outperforms the target-search performances of a purely passive particle already after two episodes, see Fig. 1. During subsequent episodes, the average time required to find the target keeps decreasing, following a stretched exponential behavior and after 103 episodes it is about 4 times smaller than the benchmark value corresponding to the fully passive particle. Furthermore, also the spread of the average search times among the N different agents decreases during the learning process, with the difference between the first quartile and the third quartile reducing from about 1.4τ to about 0.1τ along the 103 episodes, see Fig. 1. The observed stretched exponential learning shows that convergence to the optimal policy does not have a constant rate during the learning process but that this rate decreases with increasing number of episodes. We are unable to explain this behavior starting from a closer inspection of the considered environment and/or of the PS algorithm. On the other hand, we also stress that the exact value of the stretching exponent β has no particular importance and that it depends on the parameters of both the model and the PS algorithm.
An important question is how the target-search efficiency depends on the activity of the particle. To address this issue, we investigate how the average time to reach the target during the 103-th episode varies when changing the Péclet number. This is reported in the inset of Fig. 1, which shows that the learning particle has performances comparable to those of a passive particle as long as the Péclet number is smaller than about Pe ≈ 10 and then the average time to reach the target decreases with increasing activity until it reaches a plateau for Pe ⪅ 200. Such a phenomenology is consistent with the results already found in ref. 49 and is intuitively understood as follows: Since the typical distance covered by pure diffusion grows with time as t1/2 while the one due to the self-propulsion grows about as t, for small activities, diffusion dominates the relocation process during the short phases. On the other hand, having long active phases is not favorable because of the particle's inability to find the target when in the ABP phase. Thus, at low Péclet numbers, the learning particle tries to maximize the time spent in the passive phase, with the resulting performances equivalent to those of a purely passive particle. In contrast, at large Péclet the self-propulsion velocity is enough to allow relocation at a distance larger than the target size even for very short active phases and the better performances of the learning particle are in accordance with the idea that having an intermittent-search strategy is more efficient than having a simple diffusive process.
Additional insight into how the PS algorithm encodes learning successful strategies can be gained by directly investigating the policy, i.e. the probabilities of switching phase given the state. This is done in Fig. 2 which reports these switching probabilities (panel a) and related observables as learned after 103 episodes for Pe = 2, 20, and 100, respectively corresponding to a low, intermediate, and high value of the activity. Among the related observables, we report the probability of having a phase with a certain duration conditioned to being in phase ϕ, P(ω =
|ϕ) (Fig. 2b) and the cumulative probability of having a phase duration shorter than
conditioned to being in the phase ϕ, P(ω <
|ϕ) (Fig. 2c). These quantities can be obtained directly from the switching probabilities p(s) = p(ϕ,ω) related to a given state s = (ϕ,ω) as
![]() | (4) |
![]() | (5) |
Further observables reported in Fig. 2 are the average duration of a phase (panel d) and the fraction of time spent in the passive phase (panel e) as a function of the Péclet number. However, before discussing the details of the learned policies, it is important to clarify that, for large enough ω's, the value of the switching probability p(s) always drops to the corresponding value in the initial (arbitrary) policy. In fact, the longer the phase duration ω of a given state s = (ϕ,ω), the more rarely this state is visited during the learning process, with the frequency of these visits depending on the switching probabilities associated with the states s = (ϕ,ω′) having the same phase ϕ and lower phase duration ω′ < ω. This results in practical limitation in sampling states with a large phase duration. In spite of this issue, the target-search abilities of the trained agent are not affected since the rarer it is to visit a given state, the smaller the contribution of the action following that state to the overall performances of the particle.
For low activity (Pe = 2), the probability of switching from the passive to the active phase decreases from the value 10−2 corresponding to the initial policy to a value of about 10−3 for very short phase duration. Such a probability increases with a power-law behavior for about three decades, until it quickly converges to the initial policy value for a duration of the phase larger than about 10−1τ (see Fig. 2a, lower panel). On the other hand, the probability of switching from the active to the passive phase (Fig. 2a, upper panel) increases from the initial policy's value 10−3 to about 4 × 10−2 and drops to the initial policy for a duration of the phase larger than about 10−2τ. These results, together with the corresponding ones in panels b and c, indicate that, for low activity, the trained particle prefers to alternate relatively long passive phases with short active ones, confirming the previously mentioned expectations. For Pe = 20, the probability of having a phase with a certain duration ω of a given phase shows a peak at around ω = 10−2τ both when conditioned to be in the ABP phase (ϕ = 1) and in the BP one (ϕ = 0), see Fig. 2b. Consequently, for this value of the activity, the best strategy consists of alternating between active and passive phases both having a typical duration of about 10−2τ (see also Fig. 2d), with the duration of the passive phase having a larger variance as indicated by the fact that the peak of the distribution conditioned to being in the active phase is narrower than the one of the distribution in the passive phase. We stress that, because of its self-propulsion, an ABP with Pe = 20 and * = 1 in a time interval of 10−2τ covers a typical distance of about 0.2L which is twice the target diameter. Finally, for Pe = 100, Fig. 2 shows that the distribution of phase durations displays a sharp peak at around 10−3τ for the ABP phase and a rather broad peak at a few integration time steps for the BP phase. Concomitantly, the learned strategy alternates between very short active phases with an average duration of about 1.4 × 10−3τ and even shorter passive phases lasting about 0.5 × 10−3τ on average. In this case, the typical distance traveled during the active phase because of the particle's self-propulsion is about 0.14L which is of the same order as the one registered in the case of Pe = 20 even though the activity is now 5 times larger.
It is interesting to note that, as reported in Fig. 2d, while the average passive phase duration monotonically decreases with the Péclet number, its counterpart for the active phase has a non-monotonic behavior that can be rationalized as follows: Both at large and low values of the activity the ABP phases are very short but for two different reasons. At low activity, these are short because active relocation to a distance greater than the target size would require too much time and the agent responds by minimizing the time spent in this phase. In contrast, for large activity, very short active phases are already sufficient to allow the particle to relocate elsewhere in the simulation box and improve the target-search performances of the smart particle. For intermediate Péclet numbers, the agent instead finds an optimal duration of the active phase reflecting the compromise between the utility of active phases for quick relocation and the fact that during these phases the target cannot be detected. The effect of the monotonic and non-monotonic behaviors of the average duration of respectively the passive and the active phase, also results in a non-monotonic behavior of the fraction of time spent in the passive phase. In fact, this quantity is close to 1 for small Péclet numbers, decreases to about 0.25 for Pe≈100, and then increases again to 0.5 which is the value expected for extremely large levels of activity, see Fig. 2e.
Finally, Fig. 3 shows the distributions of times needed to find the target by agents adopting the learned policies previously discussed for Pe = 2, 20, and 100. These distributions are exponential, meaning that the kinetics is completely characterized by the mean first-passage time. For a comparison, the distributions obtained by a searching particle adopting the initial policy and by a completely passive particle are also reported. Concerning the results obtained by adopting the initial policy, note that, even if the policy remains the same, in principle the resulting distribution depends also on the value of the activity. However, this dependence appears to be very weak, as revealed by the similar behavior of the distribution corresponding to the three different Péclet numbers. For low activity (Pe = 2) the distribution of the searching times related to the learned policy is very similar to that of a simple BP, confirming the passive-like behavior of the agents in this Péclet regime. As expected, increasing the activity, the distribution for the optimized particle becomes more and more narrow. In particular, for Pe = 100, the large majority of targets is found within the unit time τ.
![]() | ||
Fig. 3 Distribution of times needed to find the target collected during a time interval of length 106τ, for Pe = 2, 20, and 100 (panels a, b, and c respectively). We consider here agents behaving according to the policies learned after 103 episodes as reported in Fig. 2 (solid bars), to the initial policy prior to learning (black line), and completely passive particles (blue line). The magenta line is the exponential distribution having decay time given by the average time to find the target. |
With respect to previous literature on stochastic target search,1,21–26,48 which mainly applies to generic scenarios, our investigation is more focused on the microscopic world, namely we are interested in natural or artificial microswimmers. This is the main reason to resort to the active Brownian particle model. In fact, this model, besides being the paradigmatic model in the framework of non-equilibrium dynamics,57–60 also provides a faithful representation of the behavior of artificial microswimmers such as the Janus particles.50 Remarkably, nowadays it is already possible to perform experiments in which the activity of artificial microswimmers is controlled by an external illuminating system.38 Thus, the target-search strategies developed in the present manuscript can potentially be tested in a laboratory. Furthermore, the intermittent active Brownian dynamics that we introduce in the Model section can be also considered, in the case of relatively large activity and persistence, as a first proxy for the run-and-tumble dynamics which is the typical theoretical model describing the motion of bacteria.10,50,61
The proposed framework offers new insight into target-search problems in homogeneous enviroments and paves the way to further research. In particular, it can be leveraged to explore more complex scenarios such as, for instance, target search with resetting events,62–64 multiple and/or motile targets problems,1 or searchers with multiple migration modes, the latter being relevant to dendritic cells searching for infections.65 Moreover, other possible developments, particularly relevant for the envisioned medical and environmental application of smart active particles, entail heterogeneous environments involving the presence of obstacles, boundaries, and energy barriers.29,66–68 Finally, endowing the agent with a limited memory of the recently visited locations69 or with the ability to sense directional cues coming from the target itself, may also be an extension going in the direction of better modelling biological microswimmers.
The core idea of this algorithm is to use the notion of a particular kind of memory, called episodic and compositional memory (ECM) which is mathematically described by a graph connecting units called clips. Clips can be either percept or decision units, corresponding to states and actions respectively, or a combination of those. We design our target-search problem as a Markov decision process,35i.e. at each learning step, the agent is in some state s, takes an action a according to a policy defined by the conditional probabilities π(a|s), and receives a reward as a consequence of this action. In such a case, the ECM structure consists of a layer of states fully connected with a layer of actions. Each edge of the graph, i.e. each state-action pair (s,a), is assigned with a real-value weight h(s,a), called the h-value, which determines the policy according to
![]() | (6) |
This last feature of the PS algorithm makes it particularly apt to solve our target-search problem. The reason is that, on average, the equations of motion (1–3) have to be iterated a large number of times before a target is found and the agent obtains its reward. Consequently, the reward signal is very sparse and has only a very low correlation with the particular state-action pair encountered when the target is found. Approaches taking into account long sequences of visited state-action pairs, as the PS algorithm, should then be preferred with respect to typical action-value methods such as one-step Q-learning or SARSA.35 Indeed, we failed to obtain successful target-search policies when applying standard Q-learning to our setup.
Applying the PS framework to the model illustrated in the dedicated section and taking into consideration that in our case the action a can be described as a binary variable, with a = 1 corresponding to a switch of the phase (passive or directed motion) and a = 0 to maintaining the current phase, a single learning step consists of the following operation:
• Given the current state st = (ϕt,ωt), the probability of switching phase pt is determined as
pt = π(at = 1|st) = h(st,1)/[h(st,0) + h(st,1)] |
• The glow matrix is damped following the update rule G ← (1 − η)G, where η is called the glow parameter and determines how much a delayed reward should be discounted;
• The glow matrix is updated by adding a unit to the visited state-action pair, g(st,at) ← g(st,at) + 1;
• The position and the direction of the particle evolve according to eqn (2) and (3);
• The whole matrix of h-values is updated according to the learning rule of the PS model, , where
is the reward being zero if no target is found by the particle located at the updated position and 1 otherwise. Here, γ is called the damping parameter and specifies how quickly the H matrix returns to an initial matrix H0.
Note that the reward is different from zero only when the target is found, it is thus possible to optimize the computational costs by iterating the single-step update rule of the H matrix and updating this matrix as a whole only when the target is found. More in details, if a target is found at time step t2 and the previous target was found at time step t1, then the update rule of the whole H matrix at time t2 (given that the last time it was updated was at the end of step t1) is
. In doing so, one has to pay attention that, during the learning steps t with t1 < t ≤ t2, the switching probability pt has to be determined according to pt =
(st,1)/[
(st,0) +
(st,1)], where
(st,at) is a temporarily updated h-value obtained as
, with h0(st,at) the element of the initial matrix H0 corresponding to state st and action at.
The initial policy is such that the probabilities of switching phase are 10−2 and 10−3 when being in the passive and in the active phase respectively. This is obtained by setting, for each t, h0(st,at = 1) = 10−2 and h0(st,at = 0) = (1 − 10−2) if the state is in a passive phase, and h0(st,at = 1) = 10−3 and h0(st,at = 0) = (1 − 10−3) if the state is in an active phase. All the terms of the G matrix are initialized to zero at the beginning of each episode.
We set the integration time step to Δt = 10−4τ and, to have a finite set of states, we limit the duration of a given phase ω to be not longer than τ. This results in a total of 2 × 104 states st = (ϕt,ωt), being ϕt = 0,1 (see Model section) and ωt = 1,…,104. The glow and the damping parameters are considered hyperparameters of the model and, for each value of the activity Pe and of the persistence *, are adjusted to obtain the best learning performances. Their values are reported in Table 1. Finally, to investigate how the learning process evolves, we split the whole process into several episodes, each lasting 20τ. At the beginning of each episode, each element of the glow matrix is initialized to zero.
Pe | ≤5 | 10 | 20 | 50 | ≥100 |
---|---|---|---|---|---|
γ | 10−7 | 10−6 | 10−6 | 10−6 | 10−5 |
η | 10−2 | 10−3 | 10−3 | 10−2 | 10−2 |
This journal is © The Royal Society of Chemistry 2024 |