Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving   Vision-Linguistic Compositionality

Youngtaek Oh; Jae Won Cho; Dong-Jin Kim; In So Kweon; Junmo Kim

arXiv:2410.05210·cs.CV·October 8, 2024

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces FSC-CLIP, a method that enhances compositional understanding in pre-trained vision-language models while maintaining their multi-modal capabilities, addressing limitations of traditional fine-tuning approaches.

Contribution

FSC-CLIP integrates local hard negative loss and selective regularization to improve compositionality without degrading multi-modal performance.

Findings

01

Achieves state-of-the-art compositionality performance

02

Retains strong multi-modal capabilities

03

Effective across diverse benchmarks

Abstract

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ytaek-oh/fsc-clip
pytorchOfficial

Models

🤗
ytaek-oh/fsc-clip
model· ♡ 1
♡ 1

Videos

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality· underline

Taxonomy

TopicsTactile and Sensory Interactions · Infrastructure Maintenance and Monitoring

MethodsContrastive Language-Image Pre-training