SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Jielong Tang; Xujie Yuan; Jiayang Liu; Jianxing Yu; Xiao Dong; Lin Chen; Yunlai Teng; Shimin Di; Jian Yin

arXiv:2604.20146·cs.IR·April 23, 2026

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin

PDF

TL;DR

SAKE is an innovative framework for grounded multimodal named entity recognition that balances internal knowledge use and external search, improving accuracy on social media data.

Contribution

It introduces a self-aware agentic approach with a two-stage training paradigm combining explicit uncertainty signals and reinforcement learning.

Findings

01

SAKE outperforms existing methods on social media benchmarks.

02

The Difficulty-aware Search Tag Generation improves entity uncertainty estimation.

03

Agentic reinforcement learning enhances retrieval decision-making.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.