Semantic-Enriched Latent Visual Reasoning

Tianrun Xu; Yue Sun; Qixun Wang; Jingyi Lu; Yuan Wang; Tianren Zhang; Longteng Guo; Fengyun Rao; Jing Lyu; Feng Chen; Jing Liu

arXiv:2605.19342·cs.CV·May 20, 2026

Semantic-Enriched Latent Visual Reasoning

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu

PDF

TL;DR

This paper introduces SLVR, a two-stage framework that enriches latent visual representations with semantic attributes and aligns them with reasoning tasks, enhancing robustness and semantic consistency.

Contribution

SLVR is the first framework to incorporate attribute-level semantics into latent visual reasoning and align representations across multiple queries.

Findings

01

SLVR improves robustness of latent visual reasoning.

02

Enriched semantics lead to better reasoning consistency.

03

Constructed large-scale attribute and QA datasets for training and evaluation.

Abstract

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.