Multimodal Latent Reasoning via Predictive Embeddings

Ashutosh Adhikari; Mirella Lapata

arXiv:2604.08065·cs.LG·April 10, 2026

Multimodal Latent Reasoning via Predictive Embeddings

Ashutosh Adhikari, Mirella Lapata

PDF

TL;DR

Pearl introduces a latent space reasoning framework for visual language models that learns from tool-use trajectories without explicit tool invocation, improving efficiency and multi-step reasoning capabilities.

Contribution

The paper presents Pearl, a novel latent space predictive embedding method that eliminates the need for explicit tool calls and supports multi-step reasoning in multimodal tasks.

Findings

01

Pearl matches or outperforms supervised fine-tuning and reconstruction-based methods.

02

Reconstruction-based methods mainly learn embeddings, not image edits, in latent space.

03

Pearl supports multiple tool calls in reasoning trajectories.

Abstract

Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.