TL;DR
This paper introduces a new fine-grained referring expression comprehension dataset with controllable difficulty and negative samples, and proposes collaborative methods combining specialist models and MLLMs to improve accuracy and efficiency.
Contribution
It presents a novel dataset with multi-level reasoning and negative samples, and introduces two collaborative methods integrating specialist models with MLLMs for enhanced REC performance.
Findings
Significant performance improvements on the new dataset and benchmarks.
Effective balancing of accuracy and efficiency through adaptive model assignment.
Enhanced model reasoning capabilities with specialist-MLLM collaboration.
Abstract
Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. It serves as an essential testing ground for Multimodal Large Language Models (MLLMs). To advance this field, we introduced a new REC dataset in our previous conference paper, characterized by two key features. First, it is designed with controllable difficulty levels, requiring multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Second, it incorporates negative text and images generated through fine-grained editing and augmentation, explicitly testing a model's ability to reject scenarios where the target object is absent, an often overlooked yet critical challenge in existing datasets. In this extended work, we propose two new methods to tackle the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
