Getting to the Point: Why Pointing Improves LVLMs
Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

TL;DR
This paper investigates how pointing improves Large Vision-Language Models (LVLMs) by grounding objects and reasoning explicitly, demonstrating that coordinate-based pointing enhances out-of-distribution generalization and spatial reasoning in zero-shot counting tasks.
Contribution
The study introduces a Point-then-Count approach that leverages explicit coordinate grounding, showing improved generalization and understanding in LVLMs compared to direct counting methods.
Findings
Point-then-Count outperforms direct counting in out-of-distribution scenarios.
Predicted points are accurately grounded in over 89% of cases.
Spatial biases affect performance across different image regions.
Abstract
Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Ferroelectric and Negative Capacitance Devices · Neurobiology of Language and Bilingualism
