Towards Complex-query Referring Image Segmentation: A Novel Benchmark

Wei Ji; Li Li; Hao Fei; Xiangyan Liu; Xun Yang; Juncheng Li; Roger; Zimmermann

arXiv:2309.17205·cs.CV·October 2, 2023·1 cites

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, Roger, Zimmermann

PDF

Open Access 4 Reviews

TL;DR

This paper introduces RIS-CQ, a new benchmark dataset with complex language queries for Referring Image Segmentation, and proposes DuMoGa, a novel model that outperforms existing methods on this challenging benchmark.

Contribution

The paper creates RIS-CQ, a large-scale benchmark with complex queries, and develops DuMoGa, a dual-modality graph alignment model that advances RIS performance.

Findings

01

DuMoGa outperforms existing RIS methods on RIS-CQ.

02

RIS-CQ enables more realistic and challenging RIS research.

03

The benchmark facilitates evaluation of models with complex language queries.

Abstract

Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms. However, there has been a lack of research investigating how existing algorithms should be benchmarked with complex language queries, which include more informative descriptions of surrounding objects and backgrounds (\eg \textit{"the black car."} vs. \textit{"the black car is parking on the road and beside the bus."}). Given the significant improvement in the semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, building upon the existing RefCOCO and Visual Genome datasets, we propose a new RIS benchmark with complex queries, namely \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale,…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 2

Strengths

The paper proposes a method of annotation generation that is scaleable by leveraging existing foundational methods. The proposed benchmark dataset has more complex queries than existing benchmarks

Weaknesses

The results on the proposed DUMOGA method are not clear from table2. It's not clear exactly what datasets the model was evaluated on Ideally, it should be evaluated on the proposed benchmark as well as on existing benchmarks to help calibrate the improvements

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

- It is not just the dataset a good contribution to the referring expression community. But also, the way that the dataset is constructed by using ChatGPT to enrich the expression and reduce the human workforce necessary to annotate this complex task is clearly a new trend and a good thing to use. - The dataset is bigger than the previous dataset in the quantity of images and the length of the queries.

Weaknesses

- Not sure about the system statement Sec 3: "Unlike classic dense predictions models with complex design in RIS, which take expensive computational power to make inference, the graph learning-based architecture provides efficient training process and promising results". Graph learning-based architecture needs a couple of other methods that are complex to get the scene graph and the dependency trees. The computational power to make inferences per image requires three methods to run over the imag

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

- They provide a new dataset with complex text queries. The complex text queries are generated based on GPT 3.5 with inputs being the triplets of object relationships. - The baseline is intuitive and works well. They fuse multimodal information Graph alignment: scene graph and text sentence.

Weaknesses

1. Only a simple comparison with RefCOCOg is given. What are the advantages compared with RefCOCOg? Is there essential difference between query lengths of 8.43 and 13.18? 2. The comparison with `PhraseCut' is missed, which is also a large-scale dataset with complex queries. 3. The key in the proposed method is scene-graph-based cross-modal alignment. However, this way is widely used in cross-modal retrieval works. 4. How to get mask annotations? 5. VCTree is a method to generate bounding boxes.

Reviewer 04Rating 8· accept, good paperConfidence 5

Strengths

1. The paper addresses a pivotal challenge in referring image segmentation, specifically targeting intricate linguistic queries. By launching the RIS-CQ benchmark, the authors compellingly substantiate the essence of their exploration. This benchmark holds the potential to be an instrumental anchor for subsequent investigations in this realm. 2. A key contribution in the paper is the introduction of the RIS-CQ dataset. This is a commendable gift to the academic domain. The amalgamation of this

Weaknesses

The paper might benefit from further improvements by addressing some of the minor issues I found: 1. While the paper presents a strong case for the effectiveness of the proposed method, it would be beneficial to include some negative or challenging cases in the experimental section. Providing examples where the approach may have limitations or difficulties would add depth to the evaluation. 2. The related work section could be improved by incorporating more recent works in the field. This woul

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning