PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Mohamed Salim Aissi; Clemence Grislain; Clement Romac; Laure Soulier; Mohamed Chetouani; Olivier Sigaud; Nicolas Thome

arXiv:2605.05407·cs.AI·May 8, 2026

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

PDF

TL;DR

PRISM introduces a dynamic perception-reasoning framework that enhances multimodal decision-making by enabling LLMs to critique and probe vision models, leading to significant performance improvements in embodied agent benchmarks.

Contribution

It presents a novel closed-loop interaction between vision and language models, improving scene understanding without handcrafted questions or answers.

Findings

01

PRISM outperforms state-of-the-art image-based models on benchmarks.

02

Interactive perception pipeline yields systematic performance gains.

03

PRISM operates fully automatically without manual question design.

Abstract

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.