Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Nishad Singhi; Christian Bialas; Snehal Jauhri; Vignesh Prasad; Georgia Chalvatzaki; Marcus Rohrbach; Anna Rohrbach

arXiv:2605.12620·cs.AI·May 14, 2026

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Nishad Singhi, Christian Bialas, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki, Marcus Rohrbach, Anna Rohrbach

PDF

TL;DR

This paper introduces VegAS, a verifier-guided framework that enhances the robustness of multimodal large language model-based embodied agents by explicitly verifying and selecting actions during inference.

Contribution

VegAS is a novel test-time approach that uses a generative verifier and data synthesis to improve agent robustness without altering the core policy.

Findings

01

VegAS achieves up to 36% relative performance improvement on challenging tasks.

02

Using an off-the-shelf LLM as verifier yields no benefit, motivating data synthesis.

03

Consistently improves generalization across Habitat and ALFRED benchmarks.

Abstract

Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.