E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

Meng Zhang; Jinzhong Ning; Xiaolong Wu; Hongfei Lin; Yijia Zhang

arXiv:2604.17319·cs.CV·April 21, 2026

E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

Meng Zhang, Jinzhong Ning, Xiaolong Wu, Hongfei Lin, Yijia Zhang

PDF

1 Repo

TL;DR

E2E-GMNER introduces a fully end-to-end generative multimodal NER framework that unifies recognition, grounding, and reasoning, improving robustness and performance over pipeline methods.

Contribution

It proposes a novel instruction-tuned generative model with chain-of-thought reasoning and Gaussian risk-aware box perturbation for robust multimodal entity recognition.

Findings

01

Achieves competitive results on Twitter-GMNER and Twitter-FMNERG benchmarks.

02

Demonstrates the effectiveness of end-to-end training and noise-aware grounding supervision.

03

Validates improved robustness against annotation noise and discretization errors.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Finch-coder/E2E-GMNER
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.