LION : Empowering Multimodal Large Language Model with Dual-Level Visual   Knowledge

Gongwei Chen; Leyang Shen; Rui Shao; Xiang Deng; Liqiang Nie

arXiv:2311.11860·cs.CV·November 28, 2023·1 cites

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie

PDF

Open Access 1 Repo

TL;DR

LION enhances multimodal large language models by integrating dual-level visual knowledge, including fine-grained spatial details and high-level semantic evidence, through progressive training and soft prompting, leading to improved multi-modal understanding.

Contribution

The paper introduces a novel dual-level visual knowledge injection method into MLLMs, combining spatial-aware visual integration with semantic visual evidence via soft prompting.

Findings

01

Improves accuracy on VSR by 5%

02

Enhances TextCaps CIDEr score by 3%

03

Boosts RefCOCOg accuracy by 5%

Abstract

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rshaojimmy/jiutian
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques