VisualDeltas: Learning Preferences from Visual Quality Perturbations
Hailiang Huang, Yihao Liu, Shengyue Guan, Haoze Li, Sujian Li

TL;DR
VisualDeltas is a preference-learning framework that uses visual quality variations in multimodal data to extract supervision signals, improving model generalization without needing human annotations.
Contribution
It introduces a novel, lightweight method for learning preferences from visual quality perturbations, applicable in both label-free and label-based settings.
Findings
Outperforms rejection-sampling fine-tuning across benchmarks
Enhances model generalization to visual degradations
Works across diverse multimodal datasets and model scales
Abstract
We present VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data. By leveraging the systematic impact of image quality on visual perception and reasoning, VisualDeltas induces informative preference signals without relying on human annotations or external teachers. The framework supports both label-free and label-based regimes, enabling flexible use of available supervision when present. Across diverse multimodal benchmarks and model scales, VisualDeltas consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image and Video Quality Assessment · Data Visualization and Analytics
