$\Delta$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation
Jucheng Hu, Suorong Yang, Dongzhan Zhou

TL;DR
$ Delta$-AttnMask is a novel, efficient data selection method for Vision-Language Model fine-tuning that evaluates sample quality through attention-guided masking, reducing data needs and improving accuracy.
Contribution
It introduces a model-agnostic, data-agnostic framework that assesses sample quality via loss differences, enabling effective data selection without extra labels or training.
Findings
Achieves state-of-the-art performance with only 20% of data.
Accelerates training by 5 times compared to full datasets.
Surpasses full-dataset baselines by +10.1% accuracy.
Abstract
Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose -AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model's hidden states, jointly evaluating image-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
