LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation

Patrick Amadeus Irawan; Erland Hilman Fuadi; Shanu Kumar; Alham Fikri Aji; Yova Kementchedjhieva

arXiv:2604.00829·cs.CV·April 28, 2026

LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation

Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

PDF

TL;DR

LinguDistill is a method that restores linguistic abilities in vision-language models by distilling knowledge from a frozen language model without adding extra modules, effectively recovering performance lost during multimodal adaptation.

Contribution

It introduces a novel adapter-free distillation approach using layer-wise KV-cache sharing to recover language capabilities without modifying model architecture.

Findings

01

Recovers approximately 10% of lost performance on language and knowledge benchmarks.

02

Maintains comparable performance on vision-heavy tasks.

03

Does not require additional modules or architectural changes.

Abstract

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.