Reconstructive Visual Instruction Tuning

Haochen Wang; Anlin Zheng; Yucheng Zhao; Tiancai Wang; Zheng Ge,; Xiangyu Zhang; Zhaoxiang Zhang

arXiv:2410.09575·cs.CV·January 3, 2025

Reconstructive Visual Instruction Tuning

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge,, Xiangyu Zhang, Zhaoxiang Zhang

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper presents ROSS, a novel visual instruction tuning method that trains large multimodal models to reconstruct input images, improving their fine-grained understanding and reducing hallucinations by leveraging vision-centric supervision signals.

Contribution

ROSS introduces a reconstructive supervision approach for LMMs that enhances visual output quality and understanding by focusing on image reconstruction rather than text-only supervision.

Findings

01

ROSS improves fine-grained visual comprehension.

02

It reduces hallucinations in multimodal models.

03

Competitive performance with fewer visual experts.

Abstract

This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

**Novel Image-Based Supervision**: ROSS leverages image reconstruction as a supervisory signal, enabling the model to capture fine-grained visual features and semantics that significantly reduce hallucination artifacts compared to text-supervised approaches. The idea conceptually makes sense and is proven in the experiments. **Comprehensive Analysis of Model Variants**: The paper provides a thorough study of various architectural choices and configurations within the ROSS framework, offering in

Weaknesses

- Potential Unfairness in Comparisons: While the paper includes an ablation study where variables like training data are controlled for fair comparison with other models, its main results table appears to use different datasets compared to competing methods. This inconsistency in data setup might lead to an unfair advantage for ROSS, making it difficult to assess the true comparative effectiveness of the approach against state-of-the-art methods. - Computational Overhead: The denoising process

Reviewer 02Rating 6Confidence 4

Strengths

Whereas text-based LLMs have achieved amazing results only with next-token prediction, when we have image + text VLMs, it has always seemed that only doing next-token prediction for text could be improved upon. In that regard, the technique proposed in this paper, to use image denoising as a pretext task, seems like step forward, as a way to add more supervision to the VLM and to improve results. The benefits to the metrics are actually significant in some cases, not just epsilon levels, whic

Weaknesses

1. I wish the benchmarks cited in the paper to measure the benefits of their method, i wish those benchmarks more closely matched recent popular work such as "The Llama 3 Herd of Models" or "Qwen2-VL", which include benchmarks like TextVQA, DocVQA, etc ... It may not change the conclusion but when we compare methods, it's important to look at a representative distribution of benchmarks. Table 4 has some of these common benchmarks, but not all of them. Furthermore, i wish Table 4 (or perhaps Tabl

Reviewer 03Rating 6Confidence 4

Strengths

This paper is very novel and address the very important topic on vision-centric learning in LMM. Specifically, the paper introduces an innovative vision-centric supervision method that leverages the inherent richness of input images, addressing a clear gap in existing LMM training approaches. The use of denoising objectives for latent representation reconstruction is particularly clever as it handles the spatial redundancy problem. The authors conduct extensive experiments across multiple benc

Weaknesses

I think the major weakness is about the Computational Costs. While the paper emphasizes the efficiency of using a single visual encoder, it lacks detailed analysis of training time, memory requirements, and computational costs compared to baseline methods. Besides, the paper doesn't thoroughly discuss the sensitivity of ROSS to various hyperparameters, such as the denoising schedule or architecture choices. It would be benefitical to add this part analysis and show ROSS's denoising part is robu

Code & Models

Repositories

haochen-wang409/ross
pytorch

Models

🤗
HaochenWang/ross-qwen2-7b
model· 2 dl· ♡ 3
2 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Technology Integration · Tactile and Sensory Interactions