LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

TL;DR
LinguDistill is a method that restores linguistic abilities in vision-language models by distilling knowledge from a frozen language model without adding extra modules, effectively recovering performance lost during multimodal adaptation.
Contribution
It introduces a novel adapter-free distillation approach using layer-wise KV-cache sharing to recover language capabilities without modifying model architecture.
Findings
Recovers approximately 10% of lost performance on language and knowledge benchmarks.
Maintains comparable performance on vision-heavy tasks.
Does not require additional modules or architectural changes.
Abstract
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
