Multimodal Latent Reasoning via Predictive Embeddings
Ashutosh Adhikari, Mirella Lapata

TL;DR
Pearl introduces a latent space reasoning framework for visual language models that learns from tool-use trajectories without explicit tool invocation, improving efficiency and multi-step reasoning capabilities.
Contribution
The paper presents Pearl, a novel latent space predictive embedding method that eliminates the need for explicit tool calls and supports multi-step reasoning in multimodal tasks.
Findings
Pearl matches or outperforms supervised fine-tuning and reconstruction-based methods.
Reconstruction-based methods mainly learn embeddings, not image edits, in latent space.
Pearl supports multiple tool calls in reasoning trajectories.
Abstract
Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
