LanteRn: Latent Visual Structured Reasoning
Andr\'e G. Viveiros, Nuno Gon\c{c}alves, Matthias Lindemann, Andr\'e Martins

TL;DR
LanteRn introduces a novel framework enabling large multimodal models to perform visual reasoning directly in a compact latent space, improving efficiency and accuracy in perception-centric tasks.
Contribution
It proposes a new latent space reasoning approach that integrates visual embeddings into language models, reducing reliance on external modules and pixel-space computation.
Findings
Improves visual grounding and reasoning on perception benchmarks.
Demonstrates the effectiveness of latent representations for multimodal reasoning.
Achieves consistent performance gains across multiple perception-centric tasks.
Abstract
While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Explainable Artificial Intelligence (XAI)
