TL;DR
E2E-GMNER introduces a fully end-to-end generative multimodal NER framework that unifies recognition, grounding, and reasoning, improving robustness and performance over pipeline methods.
Contribution
It proposes a novel instruction-tuned generative model with chain-of-thought reasoning and Gaussian risk-aware box perturbation for robust multimodal entity recognition.
Findings
Achieves competitive results on Twitter-GMNER and Twitter-FMNERG benchmarks.
Demonstrates the effectiveness of end-to-end training and noise-aware grounding supervision.
Validates improved robustness against annotation noise and discretization errors.
Abstract
Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
