Towards Real-Time Panoptic Narrative Grounding by an End-to-End   Grounding Network

Haowei Wang; Jiayi Ji; Yiyi Zhou; Yongjian Wu; Xiaoshuai Sun

arXiv:2301.03160·cs.CV·January 10, 2023·1 cites

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Xiaoshuai Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces EPNG, a real-time, end-to-end network for Panoptic Narrative Grounding that improves accuracy and speed over existing two-stage methods by using innovative attention and semantic alignment techniques.

Contribution

The paper presents a novel one-stage network with Locality-Perceptive Attention and Semantic Alignment Loss for efficient and accurate PNG, enabling real-time performance.

Findings

01

Achieves up to 9.4% higher accuracy than baseline

02

10 times faster inference compared to two-stage models

03

Demonstrates strong zero-shot generalization on other grounding tasks

Abstract

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is computationally expensive. In this paper, we propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG), which directly generates masks for referents. Specifically, we propose two innovative designs, i.e., Locality-Perceptive Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly handle the many-to-many relationship between textual expressions and visual objects. LPA embeds the local spatial priors into attention modeling, i.e., a pixel may belong to multiple masks at different scales, thereby improving segmentation. To help understand the complex semantic relationships, SAL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mr-neko/epng
pytorchOfficial

Videos

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling