HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal   Large Language Models

Wenqiao Zhang; Tianwei Lin; Jiang Liu; Fangxun Shu; Haoyuan Li; Lei; Zhang; He Wanggui; Hao Zhou; Zheqi Lv; Hao Jiang; Juncheng Li; Siliang Tang,; Yueting Zhuang

arXiv:2403.13447·cs.AI·March 21, 2024·2 cites

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei, Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang,, Yueting Zhuang

PDF

Open Access 1 Repo

TL;DR

HyperLLaVA introduces a dynamic tuning approach for multimodal large language models, leveraging HyperNetworks to adapt visual and language experts, significantly improving performance on multiple benchmarks over static models.

Contribution

It proposes a novel dynamic tuning method using HyperNetworks for visual and language experts, surpassing static tuning strategies in multimodal large language models.

Findings

01

Outperforms LLaVA on multiple benchmarks

02

Demonstrates the effectiveness of adaptive expert tuning

03

Achieves significant performance improvements

Abstract

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcdmllm/hyperllava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques