VladVA: Discriminative Fine-tuning of LVLMs
Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos

TL;DR
VladVA introduces a novel training method that transforms LVLMs into discriminative models, combining the strengths of contrastive learning and language understanding for improved image-text discrimination.
Contribution
The paper presents a new training framework and adaptation method that enhance LVLMs' discriminative and compositional capabilities, outperforming existing models like CLIP.
Findings
Significant improvements in image-text retrieval benchmarks.
Enhanced compositional reasoning abilities.
Effective parameter-efficient fine-tuning approach.
Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsControl Systems in Engineering · Power Systems and Technologies · Industrial Technology and Control Systems
MethodsDiscriminative Fine-Tuning · Contrastive Language-Image Pre-training
