Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Jinyuan Li; Ziyan Li; Han Li; Jianfei Yu; Rui Xia; Di Sun; Gang Pan

arXiv:2406.07268·cs.MM·September 3, 2025·1 cites

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Jinyuan Li, Ziyan Li, Han Li, Jianfei Yu, Rui Xia, Di Sun, Gang Pan

PDF

Open Access 2 Repos

TL;DR

This paper introduces RiVEG, a novel framework that reformulates grounded multimodal NER using LLMs and segmentation models, improving performance and scalability across multiple datasets and tasks.

Contribution

The paper proposes a unified LLM-based reformulation for GMNER, eliminating the need for pre-extracted features and unifying visual and entity grounding with scalable modules.

Findings

01

RiVEG outperforms state-of-the-art methods on four datasets.

02

The framework effectively addresses ungroundable entities and ambiguity issues.

03

The SMNER task and dataset demonstrate the utility of segmentation in fine-grained NER.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis