I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking
Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu

TL;DR
This paper introduces I2CR, a novel multimodal entity linking framework that emphasizes iterative reasoning over visual clues to improve accuracy, outperforming existing methods on multiple datasets.
Contribution
The paper presents a new LLM-based framework that uses intra- and inter-modal reflections with multi-round reasoning, addressing limitations of previous one-time visual feature extraction methods.
Findings
Outperforms state-of-the-art methods on three datasets
Achieves 3.2%, 5.1%, and 1.6% improvements respectively
Demonstrates effectiveness of iterative visual reasoning in multimodal linking
Abstract
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
