Context Does Matter: End-to-end Panoptic Narrative Grounding with   Deformable Attention Refined Matching Network

Yiming Lin; Xiao-Bo Jin; Qiufeng Wang; Kaizhu Huang

arXiv:2310.16616·cs.CV·October 26, 2023·1 cites

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

Yiming Lin, Xiao-Bo Jin, Qiufeng Wang, Kaizhu Huang

PDF

Open Access

TL;DR

This paper introduces DRMN, a novel framework that uses deformable attention to incorporate contextual information for improved panoptic narrative grounding, significantly enhancing phrase-to-pixel matching accuracy.

Contribution

The paper proposes a deformable attention-based iterative learning framework that refines pixel representations for better text-to-image segmentation in panoramic narrative grounding.

Findings

01

Achieves state-of-the-art performance on PNG benchmark

02

Improves average recall by 3.5%

03

Effectively incorporates context to reduce phrase-to-pixel mismatch

Abstract

Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar $k$ image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization