Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang

TL;DR
This paper introduces Autoregressive Semantic Visual Reconstruction (ASVR), a method that enhances vision-language models by jointly learning visual and textual data through semantic reconstruction, leading to improved multimodal understanding.
Contribution
The paper proposes ASVR, a novel autoregressive training approach that reconstructs semantic image representations, significantly improving multimodal understanding in vision-language models.
Findings
Autoregressive semantic reconstruction improves model performance.
Reconstructing semantic features outperforms raw visual reconstruction.
ASVR achieves a 5% gain on average across 14 benchmarks.
Abstract
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is for the most part accessible and well-written. The main design choices are clearly presented and motivated. - The integration between visual and textual data in VLMs is a timely topic, and clearly an open research issue. Methods that allow better information flow from images to text are particularly valuable.
- **Relation with DualToken [a].** The method presented in this paper has a strong link with the DualToken model, and the authors do not provide sufficient information to correctly position it with respect to such previous work. Specifically, the DualToken paper proposes an approach to train a VLM by applying next-token prediction on image patches, both using semantic and pixel tokens as target labels. This seems extremely related to the ASVR recipe proposed in the present manuscript, hence a cl
• The paper is well-motivated and provides a clean and intuitive extension of autoregressive supervision from text to vision. ASVR is simple and integrates naturally into existing LVLM architectures. • Experiments are relatively comprehensive, covering different visual tokenizers, visual encoders, LLM backbones, and a variety of multimodal benchmarks. Results on LLaVA-1.5 show clear improvements.
• The paper lacks intuitive qualitative examples beyond the attention maps shown in Appendix A.2. For example, visualizations from tasks like TextVQA could better illustrate how ASVR improves visual modeling and grounding. • The experiments focus mainly on LLaVA-1.5. It would strengthen the paper to evaluate on additional LVLM architectures (e.g., LLaVA-OV or video-based models) to confirm generality. Demonstrating consistent improvements across other model families or modalities (e.g., video) w
1. Originality. ASVR proposes a new paradigm for visual supervision in LVLMs by training on discrete semantic tokens (e.g., via DualToken) within an autoregressive objective—departing from pixel-level or diffusion-based reconstruction and moving beyond text-only pretraining. 2. Quality. The evaluation is comprehensive: 14 benchmarks, multiple backbones (Vicuna-7B/13B, Mistral-7B), and data scales from 665K to 3.5M samples. Ablations are convincing; for example, dual-stage training yields ~6% gai
1. Baseline coverage. Comparisons beyond LLaVA and ROSS are needed. Include strong, recent LVLMs (e.g., Qwen-VL, InternVL) under matched settings to substantiate superiority with both zero-shot and fine-tuned results. 2. Operational costs. The compute overhead of semantic tokenization and autoregressive visual decoding is unreported. Quantify training and inference latency, memory footprint, throughput, and FLOPs across resolutions/sequence lengths, with and without tokenizer caching. 3. Tokeniz
- **Timely and relevant contribution.** The paper addresses a key open problem in multimodal learning: how to achieve genuine, integrated visual–language understanding in large models. - **Practical plug-in for existing LVLMs.** ASVR can be added to standard LLaVA-style training pipelines with minimal engineering effort and yields consistent, if modest, gains across benchmarks and resolutions.
- The introduction emphasizes ASVR as a mechanism to leverage unlabeled or image-only data, arguing that autoregressive semantic reconstruction can replace textual supervision when captions are missing. Yet all reported experiments, including both pretraining and instruction-tuning, rely exclusively on captioned or instruction datasets (LLaVA-1.5, Bunny, LLaVA-OV, etc.), and no experiment incorporates genuinely unlabeled images. Consequently, the central motivation is untested: the method’s bene
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
