Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Mingjie Xu; Andrew Estornell; Hongzheng Yang; Yuzhi Zhao; Zhaowei Zhu; Qi Xuan; Jiaheng Wei

arXiv:2506.08429·cs.CV·June 11, 2025

Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Mingjie Xu, Andrew Estornell, Hongzheng Yang, Yuzhi Zhao, Zhaowei Zhu, Qi Xuan, Jiaheng Wei

PDF

Open Access

TL;DR

This paper introduces SCALE, a data quality assessment and selection pipeline that improves vision-language models by evaluating and selecting high-quality, well-aligned image-text data, reducing data noise and enhancing reasoning capabilities.

Contribution

We propose SCALE, a unified modality scoring framework that assesses and filters multimodal data based on alignment, clarity, and task relevance, improving VLM instruction tuning.

Findings

01

SCALE effectively filters high-quality data for VLM training.

02

Generated captions help unify multimodal data into a single text modality.

03

Proper data selection enhances model reasoning and robustness.

Abstract

The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more comprehensive visual language datasets. However, the effectiveness of VLMs is highly dependent on large-scale, high-quality datasets that ensure precise recognition and accurate reasoning. Two key challenges hinder progress: (1) noisy alignments between images and the corresponding text, which leads to misinterpretation, and (2) ambiguous or misleading text, which obscures visual content. To address these challenges, we propose SCALE (Single modality data quality and Cross modality Alignment Evaluation), a novel quality-driven data selection pipeline for VLM instruction tuning datasets. Specifically, SCALE integrates a cross-modality assessment framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling