A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

Hongbing Li; Jiamin Liu; Shuo Zhang; Bo Xiao

arXiv:2603.17314·cs.CV·March 20, 2026

A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

Hongbing Li, Jiamin Liu, Shuo Zhang, Bo Xiao

PDF

Open Access

TL;DR

This paper introduces a proposal-free, query-guided network for grounded multimodal named entity recognition that improves accuracy by unifying reasoning and decoding through text guidance, outperforming existing models.

Contribution

The proposed QGN model eliminates the need for pre-trained object detectors, enabling more precise grounding of entities in images through integrated multimodal reasoning.

Findings

01

QGN achieves top performance on benchmark datasets.

02

It outperforms models using pre-trained detectors.

03

The approach enhances grounding accuracy in open-domain scenarios.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques