HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei, Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang,, Yueting Zhuang

TL;DR
HyperLLaVA introduces a dynamic tuning approach for multimodal large language models, leveraging HyperNetworks to adapt visual and language experts, significantly improving performance on multiple benchmarks over static models.
Contribution
It proposes a novel dynamic tuning method using HyperNetworks for visual and language experts, surpassing static tuning strategies in multimodal large language models.
Findings
Outperforms LLaVA on multiple benchmarks
Demonstrates the effectiveness of adaptive expert tuning
Achieves significant performance improvements
Abstract
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
