Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Giorgio Giannone; Ruoteng Li; Qianli Feng; Evgeny Perevodchikov; Rui Chen; Aleix Martinez

arXiv:2501.04568·cs.CV·May 20, 2025

Feedback-Driven Vision-Language Alignment with Minimal Human Supervision

Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez

PDF

Open Access

TL;DR

This paper presents SVP, a framework that improves vision-language model alignment using minimal human supervision by leveraging self-captioning and feedback, leading to significant performance gains across multiple tasks.

Contribution

Introduces SVP, a novel sampling-based framework that enhances vision-language alignment without extensive curated data or preference annotations.

Findings

01

14% average improvement in captioning tasks

02

Up to 12% increase in object recall

03

Significant hallucination reduction

Abstract

Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Sampling-based Visual Projection), a novel framework that enhances vision-language alignment without relying on manually curated text-image pairs or preference annotation. SVP leverages a small set of manually selected images, self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAsymmetric Hydrogenation and Catalysis

MethodsSparse Evolutionary Training