LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

TL;DR
This paper identifies the counting limitations of large vision-language models and introduces a divide-and-conquer method to significantly improve their ability to count large numbers of objects accurately.
Contribution
It proposes a simple, effective baseline that enhances LVLMs' counting performance on large object counts using a divide-and-conquer approach with object-preserving mechanisms.
Findings
LVLMs struggle with counting large numbers of objects.
The proposed method improves counting accuracy across datasets.
The approach serves as a benchmark for future counting solutions.
Abstract
Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper explores a relatively novel approach by focusing on enhancing counting capabilities in large vision-language models (LVLMs) using a training-free methodology. By leveraging the power of LVLMs, the authors propose an effective paradigm that does not rely on additional training or fine-tuning, which is particularly advantageous in scenarios where labeled data is limited or unavailable. The method demonstrates a creative approach to addressing challenges in object counting, especially in
This paper also presents some limitations, as acknowledged in the final section. The proposed method heavily relies on the accuracy of the initial stages—specifically, object detection and instance segmentation. If either of these stages is inaccurate, it could significantly affect the downstream steps, potentially compromising the overall performance. This dependency raises questions about the robustness of the method on more challenging datasets, especially those with high levels of occlusion
The results obtained and demonstrated in the paper seem strong.
The main weaknesses of the paper lie in the lack of enough support for the claims made. In particular, the authors should address the following questions/comments in their responses and revisions: 1. In several places in the paper (e.g. lines 61-62), the authors mention that pipeline detects "the objects of interest". Are there even more than one types of objects to be counted in these datasets? If yes, how are objects of different categories handled? All the visual examples in the paper involv
A simple and intuitive pipeline for counting with LVLM Good presentation along with clear drawn figures. A newly introduced Emoji-Count benchmark is introduced, though the generation of this data is not complex but still useful as a testbed. Good performance margin achieved.
W1: At the very beginning, the authors should define more clearly what means by large number of objects, 10s, 100s, or 1000s, per image. As this defines the scope of this work in terms of crowdedness. W2: The key idea of this work, divide-and-conquer, can be hardly considered novel for two reasons: 1) in this context, counting by definition is a process of adding the number of objects from region to region. It is essential a process of summing up across regions; 2) Such an idea has appeared in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
