$\Delta$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

Jucheng Hu; Suorong Yang; Dongzhan Zhou

arXiv:2508.09199·cs.CV·August 14, 2025

$\Delta$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

Jucheng Hu, Suorong Yang, Dongzhan Zhou

PDF

TL;DR

$ Delta$-AttnMask is a novel, efficient data selection method for Vision-Language Model fine-tuning that evaluates sample quality through attention-guided masking, reducing data needs and improving accuracy.

Contribution

It introduces a model-agnostic, data-agnostic framework that assesses sample quality via loss differences, enabling effective data selection without extra labels or training.

Findings

01

Achieves state-of-the-art performance with only 20% of data.

02

Accelerates training by 5 times compared to full datasets.

03

Surpasses full-dataset baselines by +10.1% accuracy.

Abstract

Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $Δ$ -AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model's hidden states, jointly evaluating image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.