LMM-Regularized CLIP Embeddings for Image Classification

Maria Tzelepi; Vasileios Mezaris

arXiv:2412.11663·cs.CV·December 17, 2024

LMM-Regularized CLIP Embeddings for Image Classification

Maria Tzelepi, Vasileios Mezaris

PDF

Open Access

TL;DR

This paper introduces a novel regularization method for CLIP image embeddings using a Large Multimodal Model to generate semantic descriptions, enhancing image classification accuracy.

Contribution

It proposes a new LMM-based regularization technique that improves CLIP's image classification performance by aligning image embeddings with semantic descriptions.

Findings

01

Enhanced classification accuracy across three datasets

02

Improved embedding discrimination ability

03

Effective regularization method validated experimentally

Abstract

In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP's image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder's output to become similar to their corresponding LMM-generated mean semantic class descriptions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsContrastive Language-Image Pre-training