TL;DR
HyLaR introduces a hybrid reasoning framework that combines discrete text generation with continuous visual latent representations, utilizing a novel optimization method to improve multimodal understanding.
Contribution
The paper presents HyLaR, a new hybrid latent reasoning framework with Decoupled Policy Optimization for better multimodal reasoning, surpassing existing methods.
Findings
HyLaR outperforms standard MLLMs on perception and understanding benchmarks.
DePO effectively optimizes hybrid discrete-continuous action spaces.
Extensive experiments validate HyLaR's superior reasoning capabilities.
Abstract
Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
