VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models
Zongjie Li, Chaozheng Wang, Chaowei Liu, Pingchuan Ma, Daoyuan Wu,, Shuai Wang, Cuiyun Gao

TL;DR
This paper introduces VRPTEST, a benchmark dataset for evaluating visual referring prompting strategies in large multimodal models, revealing significant impacts of prompt choice on model accuracy and understanding.
Contribution
It provides the first comprehensive analysis of visual referring prompting in LMMs, along with a new benchmark dataset and an automated evaluation framework.
Findings
Proprietary models outperform open-source models by 22.70% accuracy.
Prompt strategy significantly affects model accuracy, with variations from -17.5% to +7.3%.
Appropriate prompting improves context understanding, while poor prompts can cause answer rejection.
Abstract
With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
