Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Xuefeng Wang; Lei Zhang; Henglin Pu; Husheng Li; Ahmed H. Qureshi

arXiv:2602.17078·cs.MA·February 20, 2026

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Xuefeng Wang, Lei Zhang, Henglin Pu, Husheng Li, Ahmed H. Qureshi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel continuous-time multi-agent reinforcement learning framework that incorporates safety constraints using an epigraph reformulation and PINN-based optimization, improving stability and performance in complex environments.

Contribution

It proposes a new CT-CMDP formulation with an epigraph-based reformulation and a PINN-based actor-critic method for safe continuous-time MARL, addressing safety and stability issues.

Findings

01

Smoother value function approximations achieved.

02

More stable training observed in experiments.

03

Improved performance over existing safe MARL methods.

Abstract

Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Addresses a real gap: explicit continuous-time safe MARL (state constraints) with a principled epigraph--HJB route and viscosity-solution framing. 2. Clean derivations: Lem.3.1/3.2 and Thm. 3.3 tightly connect DPP to the PDE with the $\ln\gamma$ term. 3. Practical training design: $z^\*$ computed during training; $z$-independent critics simplify deployment. 4. Broad evaluation \& ablations: consistent gains on CT-MPE/MA-MuJoCo; ablations clarify each loss term’s effect (target/VGI $>$

Weaknesses

1. Feasibility of the outer problem: Eq.(8) requires $\max\{V _{\text{cons}}(x),V _{\text{ret}}(x)-z\}\le 0$. If $V _{\text{cons}}(x)>0$, no $z$ is feasible; the paper clips $z^\*$ to $[z _{\min},z _{\max}]$ (Sec.~3.2.2), which does not restore feasibility or preserve the theory. 2. Model-bias vs. safety: residual/advantage (Eqs.~10,17) use learned $f_\xi,l_\phi$; PDE guarantees hold for true dynamics but not for the surrogate, and no robustness link to violation rates is provided. 3. “Contin

Reviewer 02Rating 4Confidence 3

Strengths

1. The motivation behind this work is interesting and necessary. 2. The theoretical proofs provided in this work are comprehensive.

Weaknesses

1. In the section “Related work”, what are the differences between this work and [1]? I feel that this work merely extends the ideas of [1] to the continuous-time scenario. I suggest that the authors restate the contributions of this paper. 2. In the section “Methodology”: ① The figure in Fig1 is incorrect. According to the description of Algorithm 1, at least three steps are not independent but form an integrated loop. An arrow from Learning to rollout should be added. ② I don't quite understan

Reviewer 03Rating 4Confidence 4

Strengths

- To the best of my knowledge, the use of the epigraph form for continuous-time safe MARL is novel. - The experimental results on MPE show that EPI has strong performance compared to existing methods. On MuJoCo EPI, it is not yet clear how the performance in terms of cost and constraint compare to existing methods.

Weaknesses

- Many details in the experiments section are not very clear (see questions) - It is not clear which parts of the proposed method are claimed as novel contributions and which ones are taken from existing methods. For example, are the concepts of the target loss and value gradients claimed to be novel, or were they proposed in previous works and then adapted to the current setting? - The experiments on multi-agent MuJoCo seem to only compare the reward (Figure 2). It is not clear how the differen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning