LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

Muhammad Fetrat Qharabagh; Mohammadreza Ghofrani; Kimon Fountoulakis

arXiv:2412.00686·cs.CV·February 17, 2026·2 cites

LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper identifies the counting limitations of large vision-language models and introduces a divide-and-conquer method to significantly improve their ability to count large numbers of objects accurately.

Contribution

It proposes a simple, effective baseline that enhances LVLMs' counting performance on large object counts using a divide-and-conquer approach with object-preserving mechanisms.

Findings

01

LVLMs struggle with counting large numbers of objects.

02

The proposed method improves counting accuracy across datasets.

03

The approach serves as a benchmark for future counting solutions.

Abstract

Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

This paper explores a relatively novel approach by focusing on enhancing counting capabilities in large vision-language models (LVLMs) using a training-free methodology. By leveraging the power of LVLMs, the authors propose an effective paradigm that does not rely on additional training or fine-tuning, which is particularly advantageous in scenarios where labeled data is limited or unavailable. The method demonstrates a creative approach to addressing challenges in object counting, especially in

Weaknesses

This paper also presents some limitations, as acknowledged in the final section. The proposed method heavily relies on the accuracy of the initial stages—specifically, object detection and instance segmentation. If either of these stages is inaccurate, it could significantly affect the downstream steps, potentially compromising the overall performance. This dependency raises questions about the robustness of the method on more challenging datasets, especially those with high levels of occlusion

Reviewer 02Rating 6Confidence 5

Strengths

The results obtained and demonstrated in the paper seem strong.

Weaknesses

The main weaknesses of the paper lie in the lack of enough support for the claims made. In particular, the authors should address the following questions/comments in their responses and revisions: 1. In several places in the paper (e.g. lines 61-62), the authors mention that pipeline detects "the objects of interest". Are there even more than one types of objects to be counted in these datasets? If yes, how are objects of different categories handled? All the visual examples in the paper involv

Reviewer 03Rating 3Confidence 5

Strengths

A simple and intuitive pipeline for counting with LVLM Good presentation along with clear drawn figures. A newly introduced Emoji-Count benchmark is introduced, though the generation of this data is not complex but still useful as a testbed. Good performance margin achieved.

Weaknesses

W1: At the very beginning, the authors should define more clearly what means by large number of objects, 10s, 100s, or 1000s, per image. As this defines the scope of this work in terms of crowdedness. W2: The key idea of this work, divide-and-conquer, can be hardly considered novel for two reasons: 1) in this context, counting by definition is a process of adding the number of objects from region to region. It is essential a process of summing up across regions; 2) Such an idea has appeared in

Code & Models

Repositories

mrghofrani/lvlm-count
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling