Macro-Action Based Multi-Agent Instruction Following through Value Cancellation
Wo Wei Lin, Ethan Rathbun, Enrico Marchesini Xiang Zhi Tan

TL;DR
This paper introduces MAVIC, a method for improving multi-agent reinforcement learning by correcting value estimates at instruction boundaries, ensuring better compliance with natural language instructions in complex environments.
Contribution
MAVIC is a novel value correction technique that maintains consistent value estimates during instruction interruptions, enhancing instruction compliance in multi-agent RL.
Findings
MAVIC achieves high instruction compliance in multi-agent environments.
MAVIC preserves base task performance despite instruction switching.
Theoretical analysis supports MAVIC's effectiveness.
Abstract
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
