TL;DR
UniMEL introduces a unified LLM-based framework for multimodal entity linking that effectively integrates textual and visual information, achieving state-of-the-art results while requiring minimal fine-tuning.
Contribution
The paper presents UniMEL, a novel framework leveraging Large Language Models for multimodal entity linking, simplifying the process and enhancing performance across benchmarks.
Findings
Achieves state-of-the-art performance on three datasets.
Effectively integrates multimodal information with minimal fine-tuning.
Verifies the importance of each module through ablation studies.
Abstract
Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
