S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
Nitish Shukla, Surgan Jandial, Arun Ross

TL;DR
This paper introduces S2H-DPO, a framework that systematically enhances vision-language models' multi-image reasoning abilities through a hierarchical, prompt-driven preference data construction approach.
Contribution
It presents a novel Simple-to-Hard learning framework that improves multi-image reasoning in vision-language models without sacrificing single-image performance.
Findings
Significant performance improvements on LLaVA and Qwen-VL benchmarks.
Enhanced multi-image reasoning capabilities while maintaining single-image accuracy.
Effective use of prompt-driven complexity for data generation.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
