Reconstructive Visual Instruction Tuning
Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge,, Xiangyu Zhang, Zhaoxiang Zhang

TL;DR
This paper presents ROSS, a novel visual instruction tuning method that trains large multimodal models to reconstruct input images, improving their fine-grained understanding and reducing hallucinations by leveraging vision-centric supervision signals.
Contribution
ROSS introduces a reconstructive supervision approach for LMMs that enhances visual output quality and understanding by focusing on image reconstruction rather than text-only supervision.
Findings
ROSS improves fine-grained visual comprehension.
It reduces hallucinations in multimodal models.
Competitive performance with fewer visual experts.
Abstract
This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing…
Peer Reviews
Decision·ICLR 2025 Poster
**Novel Image-Based Supervision**: ROSS leverages image reconstruction as a supervisory signal, enabling the model to capture fine-grained visual features and semantics that significantly reduce hallucination artifacts compared to text-supervised approaches. The idea conceptually makes sense and is proven in the experiments. **Comprehensive Analysis of Model Variants**: The paper provides a thorough study of various architectural choices and configurations within the ROSS framework, offering in
- Potential Unfairness in Comparisons: While the paper includes an ablation study where variables like training data are controlled for fair comparison with other models, its main results table appears to use different datasets compared to competing methods. This inconsistency in data setup might lead to an unfair advantage for ROSS, making it difficult to assess the true comparative effectiveness of the approach against state-of-the-art methods. - Computational Overhead: The denoising process
Whereas text-based LLMs have achieved amazing results only with next-token prediction, when we have image + text VLMs, it has always seemed that only doing next-token prediction for text could be improved upon. In that regard, the technique proposed in this paper, to use image denoising as a pretext task, seems like step forward, as a way to add more supervision to the VLM and to improve results. The benefits to the metrics are actually significant in some cases, not just epsilon levels, whic
1. I wish the benchmarks cited in the paper to measure the benefits of their method, i wish those benchmarks more closely matched recent popular work such as "The Llama 3 Herd of Models" or "Qwen2-VL", which include benchmarks like TextVQA, DocVQA, etc ... It may not change the conclusion but when we compare methods, it's important to look at a representative distribution of benchmarks. Table 4 has some of these common benchmarks, but not all of them. Furthermore, i wish Table 4 (or perhaps Tabl
This paper is very novel and address the very important topic on vision-centric learning in LMM. Specifically, the paper introduces an innovative vision-centric supervision method that leverages the inherent richness of input images, addressing a clear gap in existing LMM training approaches. The use of denoising objectives for latent representation reconstruction is particularly clever as it handles the spatial redundancy problem. The authors conduct extensive experiments across multiple benc
I think the major weakness is about the Computational Costs. While the paper emphasizes the efficiency of using a single visual encoder, it lacks detailed analysis of training time, memory requirements, and computational costs compared to baseline methods. Besides, the paper doesn't thoroughly discuss the sensitivity of ROSS to various hyperparameters, such as the denoising schedule or architecture choices. It would be benefitical to add this part analysis and show ROSS's denoising part is robu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration · Tactile and Sensory Interactions
