Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

TL;DR
This paper introduces VisPer-LM, a novel method that enhances multimodal large language models by distilling visual perception knowledge from expert encoders into the LLM's hidden states, improving spatial reasoning and visual understanding.
Contribution
It proposes a coupled optimization pretraining approach that infuses visual perception knowledge into LLMs, outperforming existing baselines in multimodal tasks.
Findings
VisPer-LM outperforms baselines in various benchmarks.
Embedding optimization improves visual representation quality.
Achieves up to 8.7% improvement on Depth task in CV-Bench.
Abstract
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Semantic Web and Ontologies · Video Analysis and Summarization
MethodsSparse Evolutionary Training
