Semantic-Enriched Latent Visual Reasoning
Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu

TL;DR
This paper introduces SLVR, a two-stage framework that enriches latent visual representations with semantic attributes and aligns them with reasoning tasks, enhancing robustness and semantic consistency.
Contribution
SLVR is the first framework to incorporate attribute-level semantics into latent visual reasoning and align representations across multiple queries.
Findings
SLVR improves robustness of latent visual reasoning.
Enriched semantics lead to better reasoning consistency.
Constructed large-scale attribute and QA datasets for training and evaluation.
Abstract
Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
