Object Counting with GPT-4o and GPT-5: A Comparative Study
Richard F\"uzess\'ery, Kaziwa Saleh, S\'andor Sz\'en\'asi, Zolt\'an V\'amossy

TL;DR
This paper explores using GPT-4o and GPT-5, multimodal large language models, for zero-shot object counting, demonstrating competitive performance on standard datasets without supervision.
Contribution
It introduces a novel application of multimodal LLMs for zero-shot object counting and provides a comparative analysis of their effectiveness.
Findings
GPT-4o and GPT-5 achieve competitive zero-shot counting performance.
Models sometimes outperform existing state-of-the-art methods.
Evaluation on FSC-147 and CARPK datasets confirms effectiveness.
Abstract
Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
