PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, Rose Hendrix, Noah A. Smith, Fei Xia, Dieter Fox, Ranjay Krishna

TL;DR
PointArena introduces a comprehensive platform with datasets, benchmarks, and real-world evaluation tools to advance multimodal grounding through language-guided pointing, highlighting the importance of specialized training for improved performance.
Contribution
We present PointArena, a new multi-component platform for evaluating and comparing multimodal pointing abilities across reasoning and real-world tasks.
Findings
Molmo-72B outperforms other models in pointing tasks
Supervised training on pointing tasks improves model performance
Strong correlations found across evaluation stages
Abstract
Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well written and easy to follow. 2. Treating pointing as a precise grounding interface is timely for robotics, assistive tech, and interactive systems. 3. Point-Bench covers five complementary categories (Spatial, Affordance, Counting, Steerable, Reasoning), enabling diagnostic analyses beyond simple referring localization. Point-Battle brings scalable preference data with blinded pairwise comparisons and anonymized voting. Point-Act grounds claims in physical manipulation on an
1. For non-counting tasks only the first returned point is scored, which may bias against models returning ranked or structured outputs. Consider top-k or confidence-weighted evaluation. 2. For counting or multi-target cases, the success criterion is set coverage without precision/recall trade-offs. A bipartite matching metric (Hungarian assignment with distance thresholds) plus F1 would better reflect partial correctness and over-/under-pointing behavior. 3. The Arena uses Elo with K=2 for stab
The dataset consists of 5 representative and challenging reasoning/localization tasks, making this dataset able to evaluate various aspects of multimodal point comprehension. Point-Battle is also a good approach to address the lack of referral image-text pairs and accurate evaluators.
1. Point-Act is under-explained. There's too little information about how the robot and experiments are set up. Evaluation metrics, examples, results, and how to access this platform, are missing. As robot pick-and-place is not only related to pointing itself, it's essential to clarify the rest of the setting, such as how the grasp motion and controller are implemented, and therefore how this platform contributes to the entire PointArena system. 2. Sec 5 seems unfinished. Discussion, limitation,
The overall writing is clear and easy to follow. The paper presents a well-motivated study that focuses on benchmarking the pointing ability of VLMs, with a clearly articulated purpose. In addition to reporting standard success rates, it introduces Point-Battle and Point-Act as complementary evaluation metrics, providing a more comprehensive assessment.
The paper lacks detailed statistics of the benchmark, such as the types of scenarios and object categories included, which are essential for understanding its coverage and diversity. Although Point-Act is presented as a key contribution, the paper provides insufficient details about the specific tasks performed by the robot and how the pointing ability translates into or supports manipulation performance. The inference time of different models is not reported. Moreover, as a benchmarking paper
The paper proposed a benchmark on language-image-poinintng with clear definition. The benchmark extensively evaluated various LLMs. Authors also discovered that chain-of-thought decreases the performance on the point task.
My main concern is that the language-image-pointing task seems to be a simplification of language-image-segmentation task, where the LLM is prompted to segment out the object(s) of interest. One can subsample a pixel from the segmentation mask as the final output “point”. If my understanding is correct, then why do we need the pointing benchmark? Furthermore, why not include segment anything as a baseline? Authors can prompt the language question and subsample a pixel as a point.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
MethodsFocus
