PRISM: Perception Reasoning Interleaved for Sequential Decision Making
Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

TL;DR
PRISM introduces a dynamic perception-reasoning framework that enhances multimodal decision-making by enabling LLMs to critique and probe vision models, leading to significant performance improvements in embodied agent benchmarks.
Contribution
It presents a novel closed-loop interaction between vision and language models, improving scene understanding without handcrafted questions or answers.
Findings
PRISM outperforms state-of-the-art image-based models on benchmarks.
Interactive perception pipeline yields systematic performance gains.
PRISM operates fully automatically without manual question design.
Abstract
Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
