TL;DR
CoMaTrack introduces a multi-agent reinforcement learning framework for embodied visual tracking, enhancing robustness and adaptability through competitive scenarios and providing a new benchmark for evaluation.
Contribution
It presents a novel game-theoretic multi-agent training framework and a comprehensive open-source benchmark for language-conditioned embodied visual tracking.
Findings
Achieved state-of-the-art results on standard benchmarks and CoMaTrack-Bench.
A 3B VLM trained with CoMaTrack surpasses previous models on EVT-Bench.
The benchmark enables standardized robustness evaluation under adversarial interactions.
Abstract
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first open-source Habitat-based benchmark protocol and episode set for language-conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
