New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration

Xuzheng Yang; Junzhuo Liu; Peng Wang; Guoqing Wang; Yang Yang; Heng Tao Shen

arXiv:2502.20104·cs.CV·June 16, 2025

New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration

Xuzheng Yang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen

PDF

1 Repo

TL;DR

This paper introduces a new fine-grained referring expression comprehension dataset with controllable difficulty and negative samples, and proposes collaborative methods combining specialist models and MLLMs to improve accuracy and efficiency.

Contribution

It presents a novel dataset with multi-level reasoning and negative samples, and introduces two collaborative methods integrating specialist models with MLLMs for enhanced REC performance.

Findings

01

Significant performance improvements on the new dataset and benchmarks.

02

Effective balancing of accuracy and efficiency through adaptive model assignment.

03

Enhanced model reasoning capabilities with specialist-MLLM collaboration.

Abstract

Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. It serves as an essential testing ground for Multimodal Large Language Models (MLLMs). To advance this field, we introduced a new REC dataset in our previous conference paper, characterized by two key features. First, it is designed with controllable difficulty levels, requiring multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Second, it incorporates negative text and images generated through fine-grained editing and augmentation, explicitly testing a model's ability to reject scenarios where the target object is absent, an often overlooked yet critical challenge in existing datasets. In this extended work, we propose two new methods to tackle the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liujunzhuo/FineCops-Ref
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.