Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
Nishad Singhi, Christian Bialas, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki, Marcus Rohrbach, Anna Rohrbach

TL;DR
This paper introduces VegAS, a verifier-guided framework that enhances the robustness of multimodal large language model-based embodied agents by explicitly verifying and selecting actions during inference.
Contribution
VegAS is a novel test-time approach that uses a generative verifier and data synthesis to improve agent robustness without altering the core policy.
Findings
VegAS achieves up to 36% relative performance improvement on challenging tasks.
Using an off-the-shelf LLM as verifier yields no benefit, motivating data synthesis.
Consistently improves generalization across Habitat and ALFRED benchmarks.
Abstract
Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
