CountCLIP -- [Re] Teaching CLIP to Count to Ten

Harshvardhan Mestha; Tejas Agrawal; Karan Bania; Shreyas V; Yash; Bhisikar

arXiv:2406.03586·cs.CV·June 11, 2024

CountCLIP -- [Re] Teaching CLIP to Count to Ten

Harshvardhan Mestha, Tejas Agrawal, Karan Bania, Shreyas V, Yash, Bhisikar

PDF

Open Access 1 Repo

TL;DR

This paper reproduces and evaluates CountCLIP, a method that fine-tunes CLIP models to enhance zero-shot counting accuracy without sacrificing classification performance, demonstrating improved quantitative understanding of objects.

Contribution

It provides a reproducibility study of CountCLIP, confirming its effectiveness and offering an accessible implementation for further research.

Findings

01

Improved zero-shot counting accuracy on a subset of data

02

Maintained zero-shot classification performance

03

Reproducibility of CountCLIP's results confirmed

Abstract

Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model's performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at https://github.com/SforAiDl/CountCLIP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SforAiDl/CountCLIP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression

MethodsContrastive Language-Image Pre-training