VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali; Adrian Bulat; Alexandros Xenos; Anestis Zaganidis; Ioannis Maniadis Metaxas; Brais Martinez; Georgios Tzimiropoulos

arXiv:2412.04378·cs.CV·May 12, 2025·2 cites

VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos

PDF

Open Access

TL;DR

VladVA introduces a novel training method that transforms LVLMs into discriminative models, combining the strengths of contrastive learning and language understanding for improved image-text discrimination.

Contribution

The paper presents a new training framework and adaptation method that enhance LVLMs' discriminative and compositional capabilities, outperforming existing models like CLIP.

Findings

01

Significant improvements in image-text retrieval benchmarks.

02

Enhanced compositional reasoning abilities.

03

Effective parameter-efficient fine-tuning approach.

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsControl Systems in Engineering · Power Systems and Technologies · Industrial Technology and Control Systems

MethodsDiscriminative Fine-Tuning · Contrastive Language-Image Pre-training