# Teaching CLIP to Count to Ten

**Authors:** Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal, Irani, Tali Dekel

arXiv: 2302.12066 · 2023-02-24

## TL;DR

This paper enhances CLIP's ability to understand object counting by introducing a counting-contrastive loss, creating a new benchmark, and demonstrating improved performance in counting, retrieval, and generation tasks.

## Contribution

It proposes a novel counting-contrastive loss for finetuning CLIP, introduces CountBench for evaluation, and improves counting and generation capabilities of vision-language models.

## Key findings

- Significant improvement on CountBench benchmark.
- Enhanced accuracy in object counting tasks.
- Better performance in image retrieval and text-to-image generation.

## Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.12066/full.md

## Figures

22 figures with captions in the complete paper: https://tomesphere.com/paper/2302.12066/full.md

## References

61 references — full list in the complete paper: https://tomesphere.com/paper/2302.12066/full.md

---
Source: https://tomesphere.com/paper/2302.12066