Mixing of a binary passive particle system using smart active particles

Thomas Jacob; Siddhant Mohapatra; Rajalingam A; Sam Mathew; Pallab Sinha Mahapatra

PMC · DOI:10.1038/s41598-025-33076-6·December 20, 2025

Mixing of a binary passive particle system using smart active particles

Thomas Jacob, Siddhant Mohapatra, Rajalingam A, Sam Mathew, Pallab Sinha Mahapatra

PDF

Open Access

TL;DR

Smart active particles, guided by machine learning, can mix passive particles more efficiently than traditional methods.

Contribution

A novel approach using smart active particles with adaptive behavior to achieve optimal mixing of passive particles.

Findings

01

Smart active particles outperform conventional run-and-tumble particles in mixing efficiency.

02

Optimal mixing occurs when active particles are confined to an eccentric zone, inducing global rotation.

03

The passive particles' motion transitions directionally toward the system center.

Abstract

The controlled activity of active entities interacting with a passive environment can generate emergent system-level phenomena, positioning such systems as promising platforms for potential downstream applications in targeted drug delivery, adaptive and reconfigurable materials, microfluidic transport, and related fields. The present work aims to realise an optimal mixing of two segregated species of passive particles by introducing a small fraction of active particles (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $2%$ \end{document} by composition) with adaptive and intelligent behaviour, directed by a trained Artificial Neural Network-based agent. While…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species4

Salmonella enterica(species)Bacillus subtilis(species)Pseudomonas aeruginosa(species)Escherichia coli(E. coli · species)

Chemicals1

SAPs

Diseases1

SaM

Figures9

Click any figure to enlarge with its caption.

Panel (a) showcases the initial distribution of particles in the confined circular domain. The active particles are coloured red, while the two passive particle species are coloured blue and orange. The wall particles are demarcated by their circumference in black. Ratio of radius of domain to that of a passive particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = 15$$\end{document}$ , while the packing

Panel (a) illustrates the interaction between the reinforcement learning agent and the environment. Here, the agent refers to an Artificial Neural Network (ANN). Action $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{t}$$\end{document}$ transforms the state of the environment from $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsb

Panel (a) displays snapshots of the system at different time instances $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}$ . The active particles are coloured red, while the two passive species are coloured orange and blue, respectively. Panel (b) showcases the locations (coloured by time stamp) of the active particles following run-and-tumble (RT) dynamics. (Note: Both the panels involve depicti

Time progression of the mixing of the two passive species enabled by smart active particles (SAPs) controlled by a non-trained agent (top panel) and a trained agent (bottom panel) is presented through snapshots of the domain. The SAPs (coloured red) can only move in the four cardinal directions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ ), and are set to tumble instantaneousl

Panel (a) illustrates the locations of the tumbling events of the SAPs in the case of (i) a non-trained SAP (NTSAP), and (ii) a trained SAP (TSAP) for a representative test episode. Panel (b) demonstrates the trajectories of three TSAPs for a period of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times 10^{4}$$\end{document}$ time steps, to further highlight the motion of the TSAPs being constrained to only a

The temporal evolution of the probability distribution of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =\frac{n_{o}}{n}$$\end{document}$ (i.e., the ratio of the passive neighbours of dissimilar type/species $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemar

Panel (a) illustrates the motion trajectories for three representative passive particles selected based on their initial positions: (a-i) off-centre, (a-ii) centre, and (a-iii) adjacent to the wall. The colour bar indicates the time until a maximum of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =10^{8}$$\end{document}$ time steps. The red and the cyan disks represent the initial and final positions of the

The effect of the tumble step $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ and run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ of the SAPs is demonstrated with respect to

Funding1

—https://doi.org/10.13039/501100003845Indian Institute of Technology Madras

Keywords

MixingActive matterReinforcement learningEngineeringMathematics and computingPhysics

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMicro and Nano Robotics · Pickering emulsions and particle stabilization · Modular Robots and Swarm Intelligence

Full text

Introduction

Active matter encompasses a broad class of intrinsically non-equilibrium systems, forming a ubiquitous part of natural as well as synthetic systems across various size scales. These systems consist of units which constantly dissipate energy and, through local interactions, give rise to emergent system-spanning behavioural patterns, otherwise termed collective behaviour. Such system behaviour has been reported in several experimental forays on natural^1–3^ and artificial^4,5^ systems. The 1990 s marked the development of the mathematical machinery to understand and interpret active systems, with pioneering works by Vicsek et al.^6^ and Toner and Tu^7^. These seminal works led to an increased interest in the numerical modelling of active systems and their applications to a wide range of fields, encompassing the life sciences, physical sciences, safety science, and econometrics. Reviews by Ramaswamy et al.^8^, Vicsek and Zaefiris^9^, Marchetti et al.^10^, and more recently, by Shaebani et al.^11^, and Gompper et al.^12^ provide a detailed account of theoretical paradigms and numerical approaches that have shaped the current research in active matter. In the biological world, evolutionary pressures driven by basic functional needs are considered central to the emergence of social behaviour in organisms. Some instances of this social behaviour among macroscale organisms include flocking/swarming for predator confusion^13–16^, navigation of complex environments^17^, schooling for hydrodynamic efficiency^18^, and herding for coordinated escape^19^. In the microscopic world, collective behaviour is often driven by physico-chemical processes or biological cues, some examples of which are the chemotaxis of Escherichia coli causing swarming^20^, the mechanotaxis of Pseudomonas aeruginosa through twitching mobility leading to the formation of rafts^21^, and the chemical gradient-driven active nematic motion of rod-like cells resulting in the formation of lanes^22^. Irrespective of the size scales, the collective behaviour observed in active systems stems purely from localised interactions involving proximate entities. In such dynamic systems, natural or artificial, the active entities often interact with passive structures/entities^23–27^. The scope and versatility of the phenomena deriving from such interactions, most notably clustering, homogeneous mixing, phase separation, and active transport, have prompted extensive studies on active-passive mixtures of varying size ratios^28,29^, activity^30–32^, and particle proportions^33^. Experiments introducing a minute fraction ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 1\%$$\end{document}$ by area) of active particles in a dense aggregation of passive colloids (varying between $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10\%$$\end{document}$ to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90\%$$\end{document}$ by area) demonstrated that even a highly limited active component can significantly alter the structure and dynamics of the system^34^. Microscopic parameters, such as particle activity and interaction strengths, as well as macroscopic properties such as particle concentration, have been found to affect the emergent behaviour in active and active-passive systems, raising pertinent questions about the parameter space in which such systems operate.

Recent advancements in Machine Learning (ML) have provided a bottom-up approach to understanding active systems. Supervised as well as unsupervised learning techniques have been deployed in various applications involving active matter, such as pattern recognition and classification^35,36^, predictive modelling^37,38^, optimal navigation strategies^39^, and swarm optimisation^40^. Several studies demonstrate the efficacy of reinforcement learning (RL) in discerning the optimal parameter set of active systems driven towards a specific objective. In the purview of control on a particle scale, a widely studied problem is that of optimal navigation of the particle towards a target location under different environmental conditions such as complex flow fields^39,41–43^, spatially varying motility landscapes^44^, physical or potential barriers^45^, and stochasticity in the surroundings^46^. Some studies have also focused on controlling the particle motion through selective activation using attraction/repulsion forces^47^, optical exposure^40,48^, among others. Therefore, a system of randomly interacting particles could be trained to be more efficient in achieving the desired goals by controlling one or more parameters of the system- usually speed and/or direction.

In the literature, Q-learning, a reinforcement learning algorithm, stands out as a prominent tool for training relatively less complex problems such as grid world navigation^49^, maze solving^50^, and path planning^39,45,51^, often using Q-tables for state-action mapping. However, with problems requiring a larger state action space (see the Methods section for technical details), increased dimensionality, and more extensive exploration, the learning process becomes progressively complex and cannot be handled by Q-tables. Artificial Neural Networks (ANNs) must be employed to handle the increased complexity. Deep Q Networks (DQN) is one of the most successful initial implementations using a Deep Neural Network (DNN), with a value-based off-policy algorithm. Since the advent of DQN, several algorithms have emerged to train deep neural networks effectively. These algorithms can be classified into three categories: value-based (e.g., DQN, DDQN), policy-based (e.g., REINFORCE), and actor-critic-based (e.g., PPO, SAC). Each of these algorithms, while being successful for certain problems, also presents its own set of challenges. Due to the decision-making role played by the agent when integrated with active matter systems (such as propulsion/directional control), the selection of the type of algorithm is contingent upon the characteristics of the action space and the nature of the control problem. Discrete action scenarios are the primary application of value-based algorithms, such as DQN. Policy-based as well as hybrid (combining policy and value) algorithms can generally handle both continuous and discrete action spaces. It is noteworthy that value-based algorithms have been predominantly applied to single-particle manoeuvring. However, in systems with multiple particles, it is preferable to switch to hybrid algorithms which use an actor-critic framework.

In the current study, a small number of active particles are introduced into a two-dimensional binary athermal bath, consisting of two initially segregated passive species, inside a circular confinement. The objective of the active particles is to agitate the passive particles and achieve an efficient mixture of the species. The motion of the active particles is controlled by an agent that has been trained through reinforcement learning to maximise an objective function (defined as a mixing index of the passive system). The present work involves passive particles driven by direct impact with active particles, in contrast to the existing literature^40,47,48^, which employed various methods to induce self-propulsion in the targeted particles. Additionally, the physical properties of both the passive species are assumed to be similar, for the sake of simplicity; therefore, they are differentiated only by colour. The complexity of the problem arises during the training stage, due to the large number of possible actions in any given system state, especially when the objective function is a macroscopic quantity that encompasses the entire system. Considering the highly non-linear nature of mapping the system states to probable actions, the method demands a non-conventional approach involving Artificial Neural Networks (ANN). Using Reinforcement Learning (RL) concepts, in which an agent learns to achieve an objective through repeated experiences, the current work demonstrates that a minute fraction of active particles, trained to perform simple discrete actions, suffices to efficiently mix a binary passive system. The next section outlines the numerical methodology, including the equations of motion governing both active and passive particles, as well as the reinforcement learning framework. It also describes the method of quantifying mixing among passive particles, which is further used to define the objective function of the optimisation problem. The subsequent sections pertain to the mixing performance of active particles following run-and-tumble dynamics, before discussing and comparing it to the mixing efficacy when employing active particles controlled by a trained RL agent.

Numerical methodology

Simulation environment

Monodisperse passive disc-shaped particles are uniformly distributed inside a confined two-dimensional circular domain as represented in Fig. 1(a). All particles have the same radius r, and the ratio of the domain radius R to the particle radius is defined as the radius ratio $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho$$\end{document}$ . A circular confinement is selected to avoid entrapment of passive particles in the corners of standard rectangular/square domains. The wall of the bounded system consists of one layer of immovable circular discs of the same size as the interior particles. The packing in the system is denoted by an area fraction $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi =Nr^2/R^2$$\end{document}$ , where N is the number of interior particles. It is to be noted that the packing fraction takes into account the passive particles only, while the active particles are represented in numbers, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{a}$$\end{document}$ .

Fig. 1. Panel (a) showcases the initial distribution of particles in the confined circular domain. The active particles are coloured red, while the two passive particle species are coloured blue and orange. The wall particles are demarcated by their circumference in black. Ratio of radius of domain to that of a passive particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = 15$$\end{document}$ , while the packing fraction of passive particles $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{p}=0.65$$\end{document}$ , and the number of active particles $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{a}=3$$\end{document}$ . Panel (b) displays the self-propulsion drive of any active particle i in the direction $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{i}$$\end{document}$ at speed v, while panel (c) illustrates the inter-particle repulsion drive $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{ij}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{ji}$$\end{document}$ acting on particles i and j, respectively, on overlap. The strength of this drive scales linearly with the extent of overlap. Panel (d) illustrates the neighbourhood selection scheme for any particle i in the calculation of the mixing index. A metric-based neighbourhood selection is used, when any particle within a distance of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}$$\end{document}$ from the centre of i is considered a neighbour of particle i. According to the example in panel (d), particle i has seven neighbours ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=7$$\end{document}$ ), four of which are of the same type (blue) as i, while the rest are of the opposite type (orange; $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{o}=3$$\end{document}$ ). Hence the number fraction of opposite species for particle i is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =\frac{n_{o}}{n}=\frac{3}{7}$$\end{document}$ .

The governing equations of motion of the active and the passive particles are delineated in Eqs. 1 and 2, adopted from the model used by Henkes et al.^52^. The current model assumes non-inertial and athermal particles and is valid for particles moving slowly or in highly viscous environments (such that inertial effects are negligible in comparison to viscous damping). The active particles are acted upon by two drives: the self-propulsion drive and the inter-particle repulsion drive (see Eq. 1).

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \boldsymbol{\dot{x}_i}=v\boldsymbol{\hat{n}}_i + \mu \sum _{j} \textbf{F}_{ij} \end{aligned}$$\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {x_{i}}$$\end{document}$ is the position of active particle i with respect to the origin, v is the self-propulsion speed in the direction $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{\hat{n}}_i$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{ij}=k(r_{i}+r_{j}-d_{ij}) \quad {\textbf {if}} \quad r_i + r_j> d_{ij}$$\end{document}$ is the repulsive force exerted on particle i due to overlap with any particle j, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}$ is the translational mobility, k is the coefficient of the repulsive force, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{i}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{j}$$\end{document}$ are the radii of the particles i and j, respectively. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{ij}=\Vert \mathbf {x_i}-\mathbf {x_j}\Vert$$\end{document}$ is the distance between the centres of particles i and j. The direction of motion of the active particle i is controlled by the term $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{\hat{n}_i} = \begin{pmatrix} \cos \theta _{i} \\ \sin \theta _{i} \end{pmatrix}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{i}$$\end{document}$ is the angle subtended by the desired direction of propulsion of the particle i with the x-axis (see Fig. 1(b)). $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{i}$$\end{document}$ can change continuously over the range $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0,2\pi )$$\end{document}$ , or assume discrete angles based on prescribed movement criteria. By modulating the direction of propulsion $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{i}$$\end{document}$ , different types of motion can be observed in the active particles, such as run-and-tumble (RT)^53,54^, run-reverse^55,56^, and directed migration^57,58^.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \boldsymbol{\dot{x}_p}=\mu \sum _{j} \textbf{F}_{pj} \end{aligned}$$\end{document}

The equation for passive particles (Eq. 2) differs from their active counterpart due to their inability to self-propel. Therefore, the passive particles are subjected only to the inter-particle repulsion drive. In the current work, all particles are assumed to be athermal (assuming negligible thermal diffusivity). The interaction of the interior particles (irrespective of activity) with the wall particles occurs through a repulsion drive similar to the one mentioned in Eqs. 1 and 2. When an interior particle i overlaps with a wall particle w, the former experiences a body force $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{iw}=k(r_{i}+r_{w}-d_{iw})$$\end{document}$ if $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_i + r_w> d_{iw}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_i$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_w$$\end{document}$ are the radii of the particles, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{iw}$$\end{document}$ is the Euclidean distance between their centres.

To observe mixing, the passive particles are initially segregated along the diameter of the circular domain, while the active particles are uniformly distributed (see Fig. 1(a)). In the present study, unless otherwise specified, three active particles ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{a}=3$$\end{document}$ ) are used to agitate a relatively dense binary passive aggregation ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{P}=0.65$$\end{document}$ ). All particles are assumed to be of unit radius, and any length dimensions presented are scaled against it. To keep the system concise and manageable, the radius ratio $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho$$\end{document}$ is fixed at 15, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu k=10$$\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v=1$$\end{document}$ . The results presented in the next section pertain to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^{6}$$\end{document}$ time steps or longer (to observe typical long-time behaviour). The upcoming subsection explains the dynamics of run-and-tumble particles (RTPs) and their efficacy in mixing the two passive species.

Run-and-tumble particles

Run-and-tumble (RT) is one of the prevalent mechanisms of bacterial locomotion, often observed in species such as Escherichia coli, Bacillus subtilis, and Salmonella enterica. RT motion is characterised by ballistic “runs”, interspersed with sudden directional changes (tumbles). Empirical evidence of the locomotion of these microbes suggests an exponential distribution of run duration, with tumble angles uniformly distributed within a certain range. However, for artificial systems, the tumbling range can be $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0,2\pi )$$\end{document}$ , which induces complete randomness. The active particles, having been modelled as slow-moving robots, can be thought to behave as RT particles, with run durations sampled from an exponential distribution and tumbling angles sampled from a uniform distribution. The tumbling is also assumed to be instantaneous (the time scale of the tumbling event is much smaller than that of the run event and can be neglected). This enhances the exploration probability over the entire domain due to the synergy between persistent runs and random tumbles. The dynamics of these particles are governed by Eqs.1 and 2, and the run duration is governed by Eq.3, where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{r}$$\end{document}$ is the sampled run duration and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{m}$$\end{document}$ is the mean run duration.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \boldsymbol{\tau _{r}}=\frac{1}{\tau _{m}}e^{\frac{-t}{\tau _{m}}} \end{aligned}$$\end{document}

In the current work, run-and-tumble particles are employed to behave as a randomised mixer of the passive species, with the mean run duration serving as a primary control parameter. Such an analysis provides a base case for defining programmed mixing functionality with the help of Reinforcement Learning (RL) later on.

Mixing index (\documentclass[12pt]{minimal}

			\usepackage{amsmath}
			\usepackage{wasysym} 
			\usepackage{amsfonts} 
			\usepackage{amssymb} 
			\usepackage{amsbsy}
			\usepackage{mathrsfs}
			\usepackage{upgreek}
			\setlength{\oddsidemargin}{-69pt}
			\begin{document}$$\chi$$\end{document})

As previously discussed, the active particles serve to agitate the passive species in the system, thereby promoting their mixing. Therefore, the mixing in the system has to be properly quantified. From the literature, various methods exist to quantify the mixing of a binary particle system^59,60^, and the choice of method must be both simple and effective. Due to the inherent large fluctuations observed in quantifying mixing when dealing with grid-based methods, this approach was ruled out. Mixing can also be computed based on Principal Component Analysis (PCA)^59^; however, it is computationally expensive. In granular mixing, several methods have been developed to assess the extent of mixing in a binary particle system. In the current work, a relatively straightforward and computationally efficient method has been used to quantify the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ , as elucidated in Eq. 4.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \chi =\frac{2}{N_{p}}\sum _i^{N_p} (\zeta _i) \end{aligned}$$\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _i=\frac{n_{o}}{n}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{o}$$\end{document}$ is the number of passive particles of the opposite species and n is the total number of passive particles surrounding the particle i, (both $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_o$$\end{document}$ and n are counted exclusive of the particle i), within a fixed radius $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}$$\end{document}$ from the centre of i (see Fig. 1(d)). Only passive particle species are considered to compute the mixing index. In an ideal homogeneous mixture, each passive particle is surrounded by an equal number of neighbours of the same and the opposing species, resulting in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.5$$\end{document}$ for each particle. Therefore, a normalisation factor of 1/2 has been applied such that the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi =1$$\end{document}$ for an ideal homogeneous mixture. Accordingly, the value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ can vary from 0 for a completely unmixed system (where a considerable space separates the two species) to a value close to 1 in a well-mixed system. In the current work, the initial positional configuration consists of certain particles at the interface of the two species with non-zero $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ , therefore, resulting in a non-zero $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t=0$$\end{document}$ . Due to the inherent dynamic nature of the system, fluctuations of the mixing index can occur, even in well-mixed systems. In the current study, the passive particles form a relatively dense system ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi _{p}=0.65$$\end{document}$ ), and Eq. 4 is applied within a neighbourhood defined by a radius of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}=4r$$\end{document}$ . Choosing a large $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}$$\end{document}$ could lead to erroneous reporting of a well-mixed state, even in the presence of small particle clusters of the same species. Conversely, a small $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_c$$\end{document}$ could result in too few or no neighbours in low-density regions, weakening the statistical accuracy of the mixing analysis.

Reinforcement learning framework

The Reinforcement Learning (RL) framework is employed to find an optimal mixing strategy to efficiently mix the passive system by guiding active particles. Here, optimality refers to the shortest path (lowest simulation time) to a high value of mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ . As discussed earlier, active particles can be controlled by adjusting their run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ and direction of motion $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{i}$$\end{document}$ in steps of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ . Figure 2(a) illustrates the RL setup consisting of two key components: the agent and the environment, interacting through three quantities: the state (observation), the action, and the reward. The agent is the decision maker, and the environment is the system whose state the agent attempts to modulate. This modulation is made possible by communicating action variables $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{t}$$\end{document}$ to the environment, based on the current state of the environment at any time t (also called the observation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t}$$\end{document}$ ). Due to the implementation of the action, the environment transitions to a new state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t+1}$$\end{document}$ . This transition to the new state concurrently results in a reward value $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{t+1}$$\end{document}$ , which is usually based on $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t+1}$$\end{document}$ . The reward value is quintessentially a quantification of the effect of the action $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_t$$\end{document}$ on the state $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_t$$\end{document}$ . The result is the formation of a tuple $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( s_{t}, a_{t}, r_{t+1}, s_{t+1} \right)$$\end{document}$ . In the current study, the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ (refer to the previous subsection) constitutes the reward function. The coordinates for the active as well as the passive particles, form the observation space in the current RL framework - the input is in the form of a flattened 1D array { $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{1},y_{1},x_{2},y_{2},\cdots x_{N_p+N_a},y_{N_p+N_a}$$\end{document}$ }. Cyclical interaction between the agent and environment gives rise to a collection of state-action-reward tuples. A large set of possible actions at any state corresponds to a large number of state-action combinations for the active particles, increasing the difficulty of training the agent to guide the environment to an optimal state.

Fig. 2. Panel (a) illustrates the interaction between the reinforcement learning agent and the environment. Here, the agent refers to an Artificial Neural Network (ANN). Action $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{t}$$\end{document}$ transforms the state of the environment from $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t}$$\end{document}$ to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{t+1}$$\end{document}$ along with a feedback in the form of a reward $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{t+1}$$\end{document}$ . Panel (b) demonstrates the different steps $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ by which the RL agent can modulate the direction of motion for the active particles. The orientation of the active particles can be modulated in steps of (i) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ (4 possible directions of motion), (ii) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /4$$\end{document}$ (8 possible directions of motion), and (iii) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /8$$\end{document}$ (16 possible directions of motion).

In the current RL training module, the orientations of the active particles ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{1},\theta _{2},\theta _{3}$$\end{document}$ ) are set to be the action variables communicated by the agent (see Fig. 2(a)). Although the directional orientations can ideally be set as continuous variables in the range $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0, 2\pi )$$\end{document}$ , a discrete action space is chosen for the ease of implementation (significantly faster training due to a smaller action space and negligible difference in the final state of the environment). As a result, when a certain action input is provided to an active particle, the particle continues to move in that direction until it receives another action input from the RL agent. In the purview of the current work, an RL agent-controlled active particle is termed a Smart Active Particle (SAP), and the terms “environment” and “system” are used interchangeably. To allow adequate time for the SAPs to interact and mix the passive particles, the agent transmits action variables to the SAPs every $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ time steps (also known as the run duration for the SAPs, inspired by the RT dynamics). The runs are assumed to be ballistic, without any rotational diffusivity, similar to the run-and-tumble particles. After every run duration, each SAP tumbles instantaneously to a new orientation. The tumbled (new) orientations of the active particles are the action variables transmitted from the RL agent, assuming no randomness, for simplicity and ease of training the neural network. As the action space is discrete, the tumbling can occur in steps of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ , bringing the number of possible actions for an SAP at any state to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\pi /\Gamma$$\end{document}$ . Therefore, the number of combinations of the possible actions for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_a$$\end{document}$ SAPs in any state amounts to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(2\pi /\Gamma )^{N_a}$$\end{document}$ . The lower the value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ or the higher the number of SAPs, the larger the action space, with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_{a}$$\end{document}$ being the greater influence of the two. Figure 2(b) illustrates the three values of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\{\pi /2, \pi /4, \pi /8\}$$\end{document}$ tested in the current work, corresponding to 4, 8, and 16 possible action directions for each SAP, respectively. Additionally, taking into account a fairly dense passive aggregation, the observation space turns out to be large enough to warrant the use of an ANN with multiple hidden layers for representing the RL agent (with a shared network for both policy and value functions). The parameters of the ANN are randomly initialised and are updated throughout the training process. A detailed description of the RL implementation is discussed in Sec. SI-1 and SI-2 of Supplementary Information, and visualised in Fig. S1 of Supplementary Information.

A MultiLayer Perceptron (MLP) policy with a ReLU activation function is selected, from a wide variety of ANNs, to represent the agent in the RL implementation. ReLU provides nonlinearity to the neural network, enabling it to learn complex mappings between action probabilities and state inputs. Proximal Policy Optimisation (PPO) is used to optimise the parameters of the MLP policy due to its stability in updating network parameters from a clipped surrogate objective function, its capability to manage discrete action spaces, the sample efficiency^61^, and the effectiveness in addressing physical problems related to active matter and optimal navigation. The policy update is executed using the PPO algorithm with the primary aim of maximising the cumulative reward (or minimising a loss function). The approach aggregates a sequence of tuples prior to policy update throughout the experience collection process. The whole reinforcement learning framework is constructed within the OpenAI Gym interface with the stable-baselines3 package in Python. Among the several hyperparameters in PPO, the learning rate is one of the most crucial in influencing the training efficacy in terms of convergence and speed. It controls the extent to which the policy’s weights and parameters are adjusted in response to the computed policy gradient. Preliminary simulations suggest using a learning rate lower than the default value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3 \times 10^{-4}$$\end{document}$ to prevent unstable parameter updates in our policy (see Sec. SI-3, Fig. S2, and Tables S1 and S2 of the Supplementary Information for details on the preliminary simulations and the selection of hyperparameters). Following a detailed investigation of the impact of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ in conjunction with the chosen learning rate, and taking into account several hidden layer configurations for the MLP policy, a neural network with hidden layer sizes of (512, 256, 64) has been selected to represent the agent (refer to Sec. SI-3 and Figs. S3 through S5 of Supplementary Information for details).

Results

Prior to examining the dynamics of SAPs and their efficacy in mixing the segregated passive system, it is pertinent to consider a baseline case without learning, where mixing arises exclusively from the stochastic driving of the active particles. Run-and-tumble particles serve as a useful model for examining self-propelled motion, particularly in the context of translational applications employing microrobots. Therefore, the RT dynamics serve as a paradigm for the development of SAPs, which can be programmed to perform certain tasks in designated environments. The upcoming subsection analyses the influence of active particles following RT dynamics on the passive species in a confined circular domain, which is subsequently contrasted against SAPs.

Mixing by run-and-tumble particles

Inspired by several microscopic organisms, run-and-tumble (RT) dynamics is one of the most widely accepted models describing the motion of active particles and can serve as a benchmark for highlighting the mixing performance of active particles. Figure 3(a) provides a visual depiction of a typical mixing scenario where the active particles follow RT dynamics interacting with the stratified binary passive system over a period of time ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times 10^{6}$$\end{document}$ time steps). The presented case involves a mean run duration of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _m=2\times 10^{3}$$\end{document}$ time steps and the angle of tumble uniformly distributed in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0, 2\pi )$$\end{document}$ . A gradual mixing can be observed in the series of time progression snapshots of the system, arising from interactions among the active particles and the two passive species. Figure 3(b) showcases the spatial mapping of the locations of the RT particles, coloured by time stamp. It is clear that all the active particles explore the entire domain, sans the region immediately adjacent to the wall. Simulating random tumbles, as is the case with RT particles, the active particles barely perturb the passive particles along the wall, which explains the absence of tumbling events adjacent to the wall. Although Fig. 3 demonstrates the features of a representative case, qualitatively similar behaviour is observed across multiple realisations and different simulation parameters. To understand the effect of the RT particles on the behaviour of the passive particles, the trajectories of three passive particles from different locations (centre, off-centre, and next to the wall) are showcased in Figs. 4(a(i–iii)) over a sufficiently long simulation time ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 10^{8}$$\end{document}$ time steps). Following the time signature in the form of the colour gradient, it is observed that the passive particles are driven through the domain in a random fashion. To bolster these observations, Fig. 4(b) illustrates the kernel density estimation (KDE) plot using a Gaussian kernel to estimate a smooth function of the probability of finding passive particles at any location given a long time window ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =[0,10^{8}]$$\end{document}$ time steps). The plot takes into consideration the positions of the passive particles sampled over a long timescale, and the Gaussian kernel then estimates a smooth probability density from the discrete particle data by placing Gaussian kernels centred at each sample point. The resulting value at any location is proportional to the probability of finding any passive particle in that region over the given time window. The highest probability is observed close to the domain boundary, primarily due to the increased residence time of particles near the wall (see Sec. SI-4 and Fig. S6 of the Supplementary Information for more details), as any passive particles next to the boundary require a strong inward push to re-enter the bulk. The only condition that permits an inward push is when active particles are wedged between peripheral passive particles and the wall particles, an event that is rarely observed in the current work. For the most part, active particles are observed to move these passive particles along the wall. The probability data also reaffirms an overall stochastic motion for the passive particles in the majority of the domain (similar probability values pointing towards a uniform distribution). The displacement of the passive particles is more pronounced away from the boundaries; hence, the ones travelling along the confinement exhibit minimal radial shifts. An extensive quantitative measurement of the mixing performance of the RT particles has been reported for a range of mean run durations in the upcoming subsection.Fig. 3. Panel (a) displays snapshots of the system at different time instances $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}$ . The active particles are coloured red, while the two passive species are coloured orange and blue, respectively. Panel (b) showcases the locations (coloured by time stamp) of the active particles following run-and-tumble (RT) dynamics. (Note: Both the panels involve depiction of a representative simulation, and qualitatively similar features are observed across multiple realisations.).Fig. 4. The trajectories of three representative passive particles are displayed to showcase their long-term behaviour, on interaction with run-and-tumble active particles (time evolution is represented through a colour gradient). The passive particles are chosen based on their initial positions in the domain: (a-i) located somewhat off-centre, (a-ii) centrally located, and (a-iii) located next to the wall. The starting point of these passive particles is marked with red disks, while the final position (i.e., $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =10^8$$\end{document}$ time steps) is marked with cyan disks. The arrows represent the direction of motion of the particles at any point. Panel (b) presents the kernel density estimate (KDE) plot of the spatial probability distribution of positions occupied by the passive particles over a long time window ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =[0, 10^8]$$\end{document}$ ), demonstrating a higher residence time near the wall (see Sec. SI-4 and Fig. S6 of the Supplementary Information).

Mixing using trained active particles

Although conventional RT-based simulations can achieve mixing in the segregated passive system, there is scope for improvement, particularly with stochastic inputs that induce directional changes. Therefore, an RL framework, as explained in the Methods section, is employed to train an agent (ANN) to make informed orientational decisions for the SAPs’ movements, thereby promoting mixing.

Fig. 5. Time progression of the mixing of the two passive species enabled by smart active particles (SAPs) controlled by a non-trained agent (top panel) and a trained agent (bottom panel) is presented through snapshots of the domain. The SAPs (coloured red) can only move in the four cardinal directions ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ ), and are set to tumble instantaneously every $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ time steps. (Note: For the bottom panel, the RL agent has been trained for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{6}$$\end{document}$ agent-environment interactions. Both the panels are representative of fifty test episodes.).

The RL training architecture (refer to the Methods section and Sec. SI-1 and SI-2 of Supplementary Information) requires two primary inputs for the smart active particles (SAPs): the tumble step $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ , and the run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ . In the current section, the training of the agent is carried out with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ time steps. The reasoning behind the selection of these specific values has been described in Sec. SI-3, and Figs. S3 through S5 of Supplementary Information. If a SAP is controlled by a trained agent, it is referred to as a trained SAP (TSAP). If controlled by a non-trained agent, the SAP is referred to as a non-trained SAP (NTSAP). Figure 5 qualitatively compares the progression in mixing among the passive particles on interaction with the NTSAPs (top panel) and the TSAPs (bottom panel). In the case of the NTSAPs, the mixing is random and doesn’t follow any pattern. However, in the case of TSAPs, the snapshots reveal the presence of a global clockwise swirl among the passive particles (more details are provided in the upcoming sections). Another point to note is the enhanced positional shift of passive particles near the wall in the presence of TSAPs, indicating improved mixing capabilities of TSAPs compared to those of NTSAPs. Irrespective of training, the mixing also results in the formation of empty spaces (voids) in the system (see $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^6$$\end{document}$ of Fig. 5).

To glean further insights into the mixing phenomena, the locations of the tumbling events of the SAPs have been mapped as shown in Fig. 6(a). It is evident that a non-trained RL agent prescribes actions which tend to move the active particles randomly across the domain. On the contrary, TSAPs move strategically, focusing on a part of the domain to enhance mixing. The trajectories for the TSAPs are presented over a short time window ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times 10^4$$\end{document}$ time steps) in Fig.6(b), which supports the hypothesis about the constrained motion of the SAPs, characterised by sharp turns ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ ). The perturbations on the path arise due to collisions with the passive particles and other TSAPs. However, once they reach the periphery, they follow the wall for a duration due to the boundedness and pertaining values of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ . A quantitative measure of mixing performance is computed using a mixing index defined a priori in the Methods section. Figure 6(c) compares the temporal variation in the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ (ensemble average over fifty test episodes), when mixing is carried out using NTSAPs and TSAPs for a tumble step $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ and a run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ . In both cases, the mixing index increases with time until it reaches a steady-state value. However, a trained agent demonstrates an enhanced mixing of the binary passive system, based on the absolute mixing index value and the time taken to reach it. The steady-state mixing index for the trained RL agent attains a value around $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi =0.96$$\end{document}$ , whereas that for the non-trained case stands at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi =0.90$$\end{document}$ . The inset of Fig. 6(c) illustrates the temporal variation in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ for mixing using RT particles with different mean run durations $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{m}$$\end{document}$ , each curve averaged over a hundred realisations. The inset clearly depicts a saturation value close to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi =0.90$$\end{document}$ for the RT particles with mean run durations beyond $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _m=2\times 10^{3}$$\end{document}$ , thereby defining an upper limit to the ability of the RT particles to induce mixing in the binary passive system. The use of a trained RL agent can augment the mixing of the passive system beyond these RTP-based limits, with a substantially simplified approach (restricting the motion of the SAPs to discrete steps in the four cardinal directions).Fig. 6. Panel (a) illustrates the locations of the tumbling events of the SAPs in the case of (i) a non-trained SAP (NTSAP), and (ii) a trained SAP (TSAP) for a representative test episode. Panel (b) demonstrates the trajectories of three TSAPs for a period of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times 10^{4}$$\end{document}$ time steps, to further highlight the motion of the TSAPs being constrained to only a section of the domain, unlike the NTSAPs. Panel (c) delineates the temporal evolution of the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ averaged over fifty test episodes, comparing mixing induced by NTSAPs and that by TSAPs. The inset in panel (c) represents the change in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ with time $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}$ for RT particles with different mean run durations $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{m}$$\end{document}$ (refer to Eq. 3). All the panels are depicted for parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ . (Note: NTSAP refers to a SAP receiving inputs from a non-trained RL agent, while TSAP is a SAP controlled by a trained RL agent.).

Probability of finding dissimilar neighbours

The previous sections have been concerned with the “microscopic” behaviour of the passive particles; however, it is equally important to understand the ”macroscopic” implications of the actions undertaken by the TSAPs. A well-mixed system, in the context of this passive system, can be postulated to have an equal number of particles of similar and opposite species/type surrounding each passive particle (within the radius $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}$$\end{document}$ ). To test this hypothesis, the probability distribution of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ , represented as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta )$$\end{document}$ , is computed where the SAPs are not taken into account. In a perfectly mixed dense system with a large number of particles, the histogram should peak at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.5$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta )=1$$\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta ) \quad \forall \quad \zeta \ne 0.5$$\end{document}$ assuming a value of 0.

Figure 7 compares the temporal evolution in the probability density function of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ , starting from the same initial distribution (Fig. 7(a)), when using NTSAPs (Fig. 7(b)) and TSAPs (Fig. 7(c)) for mixing the passive system. Due to the initially stratified state of the system, all passive particles apart from those at the interface of the two passive species have neighbours of a similar type. Therefore, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta )$$\end{document}$ has a peak ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 0.6$$\end{document}$ ) at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta = 0$$\end{document}$ at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =0$$\end{document}$ . As time progresses, the SAPs start agitating and mixing the system, causing the peak of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta )$$\end{document}$ to shift towards higher $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ values. However, a clear distinction can be made between the performance of the NTSAPs and the TSAPs, at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =1.5\times 10^6$$\end{document}$ time steps. Figure 7(c), illustrating mixing effected by TSAPs, shows a higher probability of finding more dissimilar neighbours around each passive particle, compared to a system where mixing is actuated by NTSAPs. At $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^6$$\end{document}$ time steps, when the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ has saturated (see Fig. 6(c)), the system involving TSAPs outperforms its counterpart, as evidenced by a higher probability of finding more dissimilar particles (the peaks in Figs. 7(b-ii) and 7(c-ii) occur at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta =0.50) = 0.09$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta =0.50) = 0.12$$\end{document}$ , respectively). The distribution of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ in the system with the TSAPs closely resembles a Gaussian distribution with a mean at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.47$$\end{document}$ . Furthermore, on fitting the observed histograms at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^6$$\end{document}$ to a Gaussian distribution, the kurtosis points at platykurtic distributions (negative excess kurtosis). A significantly higher negative excess kurtosis is observed in the case of the system with NTSAPs (excess kurtosis of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.46$$\end{document}$ ) compared to that of TSAPs (excess kurtosis of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.05$$\end{document}$ ), indicating a flatter distribution in the former.Fig. 7. The temporal evolution of the probability distribution of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =\frac{n_{o}}{n}$$\end{document}$ (i.e., the ratio of the passive neighbours of dissimilar type/species $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_o$$\end{document}$ to the total of passive neighbours n surrounding any passive particle within the radius $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_c$$\end{document}$ ) with (a) initial probability distribution at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =0$$\end{document}$ is compared for (b) a system involving NTSAPs, and (c) a system involving TSAPs. The time stamps during evolution are at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =0$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =1.5\times 10^{6}$$\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^{6}$$\end{document}$ , respectively. The systems using TSAPs for mixing perform better than their counterpart, as shown by the higher probability of finding more dissimilar neighbours at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =1.5\times10^6$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^6$$\end{document}$ , and by the sharper nature of the curve for the bottom panel at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau = 5 \times 10^6$$\end{document}$ .

Trajectories for the passive particles with TSAPs

Visualisation and critical analysis of the dynamics of the passive particles is crucial to understanding the mixing performance of SAPs. Figure 5 provides some insights into the motion of the passive particles as a whole. Visual inspections of the passive system displayed a clockwise swirl about the domain centre in all passive particles except those near the wall (which undergo motion in the anti-clockwise direction). To highlight the aforementioned rotational motion, three passive particles were chosen from three different locations (similar to the selection used in Fig. 4) to study their trajectories. Episodic simulations with long time scales are carried out employing TSAPs for mixing until $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau = 2\times 10^{8}$$\end{document}$ time steps. The trajectories of the three representative passive particles are plotted in Fig. 8(a). These particles are observed to move in roughly circular trajectories around the domain centre, until they reach the area where the TSAPs are active (see Fig. 5(a-ii)). In the episode presented in Fig. 8, the passive particles adjacent to the wall move in a counter-clockwise (CCW) fashion, whereas those in the interior regions of the domain perform clockwise (CW) motion. It is also observed that the particles in the path adjacent to the outermost path (band 2 in Fig. 8(b)) have a propensity to execute motion in either of the two directions (CW or CCW). Such behaviour can be explained by envisioning the motion of these particles as mimicking that of particles trapped between shear layers (moving in opposite directions). To corroborate our findings, basic computational fluid dynamics simulations are carried out to recreate a similar mixing behaviour. The Eulerian simulations utilise two identical fluids, differentiated only by colour, with similar initial conditions to the Lagrangian system. Fluid motion (and thereby, mixing) was induced by imposing a constant surface speed on the periphery of an elliptical disk with dimensions close to the operating region of the SAPs. The computational results exhibit analogous mixing dynamics to those observed in the particle system and are further discussed in Sec. SI-5 and Fig. S7 of Supplementary Information.

The behaviour of the passive particles reported in Fig. 8(a) is representative of the passive particles in all the episodes. This indicates an optimal mixing strategy where the active particles induce a circular motion around the domain centre to promote mixing among initially segregated passive particles. Moreover, the passive particles also have a transverse (radial) component of motion as they move along the circular path. Switching between the concentric paths is predominant in the region where the TSAPs are active (see Fig. 5(a-ii)) due to frequent collisions with the moving SAPs. To analyse the directionality of the circular motion of the passive particles over a complete trajectory, the domain has been divided into several bands (see Fig. 8(b); bands are numbered from 1 to 6, 1 being the outermost). Following the trajectory data from numerous passive particles over multiple episodes (7 distinct concentric trajectories are observed on superposing all the passive positional data), each band is assumed to have a radial width of 2r. Figure 8(b) illustrates the trajectory of a typical particle starting from the periphery (marked by the red disk). The density distribution of the angular velocities within these bands is elucidated in Fig. 8(c). In the inner bands (bands 3–6), the particle velocities are predominantly in the CW direction (assumed to be negative angular velocity $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega$$\end{document}$ ; majority of the distribution has $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega < 0$$\end{document}$ ). In band 2, the passive particles are almost equally likely to have a CW or CCW bias in their motion, while in band 1, there is a clear bias towards CCW motion ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega>0$$\end{document}$ ). The combined CW and CCW motion in the different regions of the domain culminates in an efficient mixing of the two passive species. Altering the initial distribution of particles produces similar clockwise (CW) and counter-clockwise (CCW) motion patterns, although the dominant direction of motion (CW or CCW) for particles in the interior and exterior regions can change depending on these initial conditions. Additionally, changes in initial conditions can influence whether a band favours CW or CCW motion.Fig. 8. Panel (a) illustrates the motion trajectories for three representative passive particles selected based on their initial positions: (a-i) off-centre, (a-ii) centre, and (a-iii) adjacent to the wall. The colour bar indicates the time until a maximum of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =10^{8}$$\end{document}$ time steps. The red and the cyan disks represent the initial and final positions of the passive particle in each panel. (Note: Each trajectory is expressed from the positions of the passive particle in steps of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{5}$$\end{document}$ time steps.) Panel (b) features the trajectory (for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{8}$$\end{document}$ time steps) for a passive particle originating from the periphery (red disk). The entire domain is divided into six concentric regions (excluding the central area), numbered $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-6$$\end{document}$ , starting from the outermost band, with each concentric band having a radial width of 2r. Panel (c) showcases the distribution of average angular velocity $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega$$\end{document}$ of the passive particles in each of the bands presented in panel (b) in a system mixed using TSAPs. Each simulation is carried out until $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =10^{8}$$\end{document}$ time steps, and the distribution takes into account data from fifty test episodes, considering all passive particles. The black dashed line represents zero angular velocity ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega =0$$\end{document}$ ).

Effect of \documentclass[12pt]{minimal}

			\usepackage{amsmath}
			\usepackage{wasysym} 
			\usepackage{amsfonts} 
			\usepackage{amssymb} 
			\usepackage{amsbsy}
			\usepackage{mathrsfs}
			\usepackage{upgreek}
			\setlength{\oddsidemargin}{-69pt}
			\begin{document}$$\Gamma$$\end{document} and \documentclass[12pt]{minimal}
			\usepackage{amsmath}
			\usepackage{wasysym} 
			\usepackage{amsfonts} 
			\usepackage{amssymb} 
			\usepackage{amsbsy}
			\usepackage{mathrsfs}
			\usepackage{upgreek}
			\setlength{\oddsidemargin}{-69pt}
			\begin{document}$$\delta$$\end{document} on mixing performance

The run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ and the angle of the tumble $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta = k\Gamma$$\end{document}$ ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=0,1,2\cdots$$\end{document}$ ) are deemed to be important input parameters governing the active particle dynamics and closely associated with the policy updates during the training of the RL agent. Figure 9 outlines the effect of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ in the temporal variation of the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ , when the binary passive system is mixed using trained SAPs. The training of the RL agent for all the combinations of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ is carried out using a neural network (NN) architecture of hidden layer size (512, 256, 64) and a learning rate of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-6}$$\end{document}$ . Other hyperparameters used in the PPO algorithm are set to their default values as defined in the stable-baselines3 package (see Table S1 of Supplementary Information). It is evident from any of the panels (each panel represents a different $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ ) that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ , despite allowing for minimal directional options for the movement of the active particles, is sufficient to induce mixing. Considering a lower $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ , such as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /4$$\end{document}$ or $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /8$$\end{document}$ , increases the angular action space for each active particle, thereby increasing the complexity of the training stage of the RL agent. Meanwhile, the corresponding improvement in the mixing index is nominal in most cases, with adverse effects being observed in certain cases (see Fig. 9(a)).

Fig. 9. The effect of the tumble step $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ and run duration $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ of the SAPs is demonstrated with respect to mixing actuated by a trained RL agent. The temporal variation in the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ is illustrated at different parametric combinations of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ . Panels (a) through (d) present the mixing index variation for different run durations $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =0.5\times 10^{3}$$\end{document}$ time steps, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =10^{3}$$\end{document}$ time steps, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ time steps, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3\times 10^{3}$$\end{document}$ time steps, respectively. In each panel, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ is varied in steps of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi /2$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi /4$$\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi /8$$\end{document}$ (blue, orange, and green curves, respectively). (Note: Each curve is averaged over fifty test episodes. The learning rate for the training of the RL agent is set to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-6}$$\end{document}$ , and other hyperparameters for the RL algorithm are set to default values defined in the stable-baselines3 library (refer to Table S1 of Supplementary Information).

On the other hand, the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ values used in the training of the RL agent are selected on the basis of the simulations involving the RT particles (see inset of Fig. 6(c)). It is apparent from Fig. 9(a) that too low a run duration can lead to sub-optimal mixing even using trained SAPs, if finer $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ values are chosen. However, with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma =\pi /2$$\end{document}$ , the mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ saturates to similar peak values ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi \approx 0.96$$\end{document}$ ), irrespective of the run duration of the SAPs, except at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =0.50\times 10^{3}$$\end{document}$ for which $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi \approx 0.89$$\end{document}$ (value at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =5\times 10^6$$\end{document}$ ). At $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =10^{3}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =2\times 10^{3}$$\end{document}$ , the influence of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Gamma$$\end{document}$ is trivial, as all the curves saturate to a similar value following a similar trend (see Fig. 9(b–c)). However, the variation in the mixing index is smoother in the latter one, and a higher mixing index is obtained at an early stage (hence, a marginal improvement in mixing). Further increase in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}$ is found to be detrimental to the mixing performance of the SAPs.

Discussion

The current study demonstrates the use of Reinforcement Learning (RL) to train and manage a collection of smart active particles (SAPs) in a high-dimensional state-action space to achieve an optimal mixing between two initially segregated passive species. The forces exerted by the active particles on collision drive the passive particles. The mixing among the two passive species is quantified through a mixing index $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ , which is a function of the number fraction of passive particles of opposite species, calculated locally. Extensive simulations show that a discrete action space with SAP movement limited to only four directions is sufficient for efficiently mixing the passive species. Even with the intrinsic nonlinearity resulting from inter-particle collisions, the MLP policy, employed to capture the best state-action pairs, along with the PPO algorithm (which optimises the RL agent parameters), effectively mimics the coordination among the SAPs. The current work primarily highlights the motion of the SAPs leading to an optimally mixed passive mixture. The operating area of the SAPs in such cases is observed to be fairly restricted to a small area offset from the domain centre, rather than dispersing across the domain, promoting a circular motion among the passive particles about the domain centre. An analogous Eulerian model involving an elliptical-shaped mixer (an ellipse-shaped void with a constant surface speed) positioned eccentrically in the domain yields a similar area fraction distribution with two immiscible fluids as that observed among the passive particles (refer to Sec. SI-5 and Fig. S7 of the Supplementary Information). To demonstrate the efficacy of the active particles controlled by a trained agent, the mixing induced by a set of run-and-tumble (RT) particles with similar tumble angles and run durations has been analysed as a baseline study. From the analysis, the peak $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi$$\end{document}$ increased from 0.9 for an RT-based system to 0.96 for an RL-based system in a much shorter time frame than the former ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 56\%$$\end{document}$ reduction in time to reach $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi =0.9$$\end{document}$ ), highlighting an improved mixing of the binary passive system. It is also noted that the dynamics of the RT particles provide findings identical to those generated by NTSAPs (see Sec. SI-6 and Fig. S8 of Supplementary Information). Apart from the mixing index, the mixing performance of the active particles has also been quantified through the probability distribution $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p(\zeta )$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta$$\end{document}$ being the ratio of the number of particles of opposite species to the total number of particles around each passive particle in a specified radius $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{c}$$\end{document}$ . Using trained SAPs to mix the system yields a Gaussian probability distribution with a peak value close to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.5$$\end{document}$ , which corresponds to an equal number of particles of opposite and similar types. At the same time, the distribution in the case of non-trained SAPs exhibits a flatter peak with more negative excess kurtosis, exhibiting an inability to optimise the mixing in the system. To generalise the findings of the work, training and testing of multiple systems with different initial positional distributions for the active and passive particles have been conducted. The obtained results strongly support the applicability of the same RL framework, regardless of the initial particle positions (refer to Sec. SI-7 and Fig. S9 of the Supplementary Information).

As the reward system defined in the current work focuses on maximising the mixing index, the RL agent finds an optimum at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.5$$\end{document}$ . However, a more intricate reward system with additional or different goals can also be tried to assess the effectiveness of integration. A multi-objective reward with suitable weights that balances complementary objectives, such as spatial dispersion, rate of mixing, and penalties for same-species clustering, is a worthwhile future scope to examine. The SAPs used in the current work have constant self-propulsion speeds, which can be added as another parameter to be controlled through continuous or discrete inputs from the agent. It is important to note, however, that any such endeavour will require major improvements to the neural network architecture used in the current work. This is also true when a larger number of SAPs are used in tandem, which would otherwise lead to subpar improvement or inferior performance compared to non-trained SAPs. Moreover, all the particles interact through an inter-particle collision drive. The current approach of using SAPs can be integrated with existing macroscale techniques involving electric or magnetic fields^62–64^. In such cases, attractive and repulsive forces can also be incorporated in the particle dynamics, and the strength of the field can be controlled in tandem with the motion of the SAPs to hasten the mixing process. Such a setup can also be used to segregate a mixed system.

The system described in the current work can be experimentally realised by substituting the SAPs with either light-activated Janus particles (externally controlled) or micro-robots (with either on-board or off-board actuation mechanisms), which exhibit comparable motion characteristics. Fluorescent labelling can be used for real-time tagging/tracking of the motion of the passive species and to distinguish between two or more species. An advantage of the current RL framework is its ease of adaptation to the intended active-passive experimental realisations, owing to its modular design. However, experimental realisations of such RL frameworks often struggle with issues related to real-time processing of large volumes of data, which can result in delayed feedback and inaccurate state inferences. Additionally, data may be noisy or partial, which makes it challenging for RL algorithms to accurately determine the true state of the environment required to make the best decisions. Leveraging a trained RL model from a minimalist model, as implemented here, can serve as a robust initialiser, significantly accelerating optimisation in the real environment compared to training a policy entirely from scratch.

Furthermore, the findings reported in this work have significant implications for the study of controllable active matter systems. Due to the generality of the micro-robotic system, the same system can be applied to various fields by simply redefining the agent’s input parameters, reward function, and governing equations, while selecting the appropriate neural networks and hyperparameters. By adjusting the objective function, these smart active particles can be used in a variety of fields, such as targeted drug delivery^65^, microswimmer-based mixing^66^, smart navigation in colloidal and complex environments^67^, and granular mixing and segregation^68^, where active particles interact and regulate passive entities. To enhance the realism of the current work, a logical next step could be the incorporation of polydispersity in the particle properties. In this scenario, the RL agent must be aware of both the particle positions and their sizes, as well as their identities. Such modification, accompanied by the refinement of the reward mechanism and the agent architecture, can further the functionality and practicality of the SAPs. Augmenting the system to three dimensions could enhance its versatility; however, this would come at the cost of substantially increasing mobility, interactions, and trajectories, necessitating three-dimensional spatial information for training the agent, which in turn translates to much higher computational costs. Finally, by offering an adaptable framework that can be adjusted for active particles involved in multi-body interactions, the current work promotes the integration of conventional active matter theory with powerful reinforcement learning techniques.

Supplementary Information

Supplementary Information.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mezey, D. et al. Purely vision-based collective movement of robots. npj Robotics 3, 11 (2025).
2Mototake, Y.-i., Ishida, S., Maruyama, N. & Ikegami, T. Topological data analysis of large swarming dynamics. In 37th Conference on Neural Information Processing Systems (Neur IPS 2023) (2023).
3Antony, S., Roy, R. & Bi, Y. Q-learning: Solutions for grid world problem. In Artificial Intelligence XL: 43rd SGAI International Conference on Artificial Intelligence, AI 2023, Cambridge, UK, December 12–14, 2023, Proceedings, 14381, 266 (Springer Nature, 2023).
4Tijsma, A. D., Drugan, M. M. & Wiering, M. A. Comparing exploration strategies for q-learning in random stochastic mazes. In 2016 IEEE symposium series on computational intelligence (SSCI), 1–8 (IEEE, 2016).
5Lee, M., Szuttor, K. & Holm, C. A computational model for bacterial run-and-tumble motion. The J. Chem. Phys. 150 (2019).10.1063/1.508583631067902 · doi ↗ · pubmed ↗
6Junot, G. et al. Run-to-tumble variability controls the surface residence times of e. coli bacteria. Phys. Rev. Lett. 128, 248101 (2022).10.1103/Phys Rev Lett.128.24810135776449 · doi ↗ · pubmed ↗
7Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. ar Xiv preprintar Xiv:1707.06347 (2017).