S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Nitish Shukla; Surgan Jandial; Arun Ross

arXiv:2604.18512·cs.CV·April 21, 2026

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Nitish Shukla, Surgan Jandial, Arun Ross

PDF

TL;DR

This paper introduces S2H-DPO, a framework that systematically enhances vision-language models' multi-image reasoning abilities through a hierarchical, prompt-driven preference data construction approach.

Contribution

It presents a novel Simple-to-Hard learning framework that improves multi-image reasoning in vision-language models without sacrificing single-image performance.

Findings

01

Significant performance improvements on LLaVA and Qwen-VL benchmarks.

02

Enhanced multi-image reasoning capabilities while maintaining single-image accuracy.

03

Effective use of prompt-driven complexity for data generation.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.