DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model
Shezheng Song, Shasha Li, Jie Yu, Shan Zhao, Xiaopeng Li, Jun Ma,, Xiaodong Liu, Zhuo Li, Xiaoguang Mao

TL;DR
This paper introduces DIM, a novel approach that leverages large language models like ChatGPT and BLIP-2 for dynamic multimodal entity linking, significantly improving performance on multiple datasets.
Contribution
The paper presents a new method combining dynamic entity extraction with LLM-based visual understanding for enhanced multimodal entity linking.
Findings
Outperforms existing methods on three datasets
Achieves state-of-the-art results on enhanced datasets
Demonstrates effective integration of multimodal information
Abstract
Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
