GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Boyuan Chen; Minghao Shao; Siddharth Garg; Ramesh Karri; Muhammad Shafique

arXiv:2603.10978·cs.CV·March 12, 2026

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique

PDF

Open Access

TL;DR

GroundCount enhances vision-language models' counting accuracy by integrating object detection for explicit spatial grounding, significantly reducing hallucinations and inference time across multiple architectures.

Contribution

It introduces GroundCount, a novel framework that combines object detection with VLMs to mitigate counting hallucinations through explicit grounding and prompt-based augmentation.

Findings

01

Achieves up to 81.3% counting accuracy, a 6.6pp improvement.

02

Reduces inference time by 22% by eliminating hallucination-driven reasoning.

03

Explicit grounding outperforms implicit feature fusion methods.

Abstract

Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Face Recognition and Perception