LMM-Regularized CLIP Embeddings for Image Classification
Maria Tzelepi, Vasileios Mezaris

TL;DR
This paper introduces a novel regularization method for CLIP image embeddings using a Large Multimodal Model to generate semantic descriptions, enhancing image classification accuracy.
Contribution
It proposes a new LMM-based regularization technique that improves CLIP's image classification performance by aligning image embeddings with semantic descriptions.
Findings
Enhanced classification accuracy across three datasets
Improved embedding discrimination ability
Effective regularization method validated experimentally
Abstract
In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP's image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder's output to become similar to their corresponding LMM-generated mean semantic class descriptions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsContrastive Language-Image Pre-training
