Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain; Zhengyuan Yang; Humphrey Shi; Jianfeng Gao; Jianwei Yang

arXiv:2412.09585·cs.CV·October 20, 2025

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VisPer-LM, a novel method that enhances multimodal large language models by distilling visual perception knowledge from expert encoders into the LLM's hidden states, improving spatial reasoning and visual understanding.

Contribution

It proposes a coupled optimization pretraining approach that infuses visual perception knowledge into LLMs, outperforming existing baselines in multimodal tasks.

Findings

01

VisPer-LM outperforms baselines in various benchmarks.

02

Embedding optimization improves visual representation quality.

03

Achieves up to 8.7% improvement on Depth task in CV-Bench.

Abstract

In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shi-labs/ola-vlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Semantic Web and Ontologies · Video Analysis and Summarization

MethodsSparse Evolutionary Training