FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu, Xuzheng Yang, Weiwei Li, Peng Wang

TL;DR
This paper introduces FineCops-Ref, a challenging new dataset for Referring Expression Comprehension that includes multi-level reasoning and negative samples, aiming to improve multi-modal understanding and grounding in AI models.
Contribution
The paper presents a novel dataset with controllable difficulty levels and negative samples, specifically designed to evaluate and enhance fine-grained multi-modal reasoning in REC tasks.
Findings
Significant performance gap in current models' grounding abilities
Dataset enables testing of multi-hop and attribute-based reasoning
Negative samples challenge models to reject incorrect references
Abstract
Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science
