Self-paced Multi-grained Cross-modal Interaction Modeling for Referring   Expression Comprehension

Peihan Miao; Wei Su; Gaoang Wang; Xuewei Li; Xi Li

arXiv:2204.09957·cs.CV·March 13, 2024·1 cites

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

Peihan Miao, Wei Su, Gaoang Wang, Xuewei Li, Xi Li

PDF

Open Access

TL;DR

This paper introduces a self-paced, multi-grained cross-modal interaction framework for referring expression comprehension, leveraging transformer-based attention and adaptive learning to improve localization accuracy across diverse visual and linguistic data.

Contribution

It proposes a novel self-paced learning approach combined with multi-grained cross-modal attention for enhanced reasoning in REC tasks.

Findings

01

Outperforms state-of-the-art on multiple datasets

02

Effectively utilizes multi-grained information in visual and linguistic modalities

03

Adaptive learning improves performance on hard examples

Abstract

As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Hand Gesture Recognition Systems