From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Ayan Sengupta; Shantanu Dixit; Md Shad Akhtar; Tanmoy Chakraborty

arXiv:2603.10877·cs.CL·March 12, 2026

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty

PDF

Open Access

TL;DR

This paper introduces ARMADA, a scalable and efficient cross-modal knowledge distillation framework that transfers knowledge from large vision-language models, including black-box models, to language-only models without extensive pre-training.

Contribution

ARMADA provides a novel alignment-based method for distilling knowledge from multimodal teachers to language models without modifying the teacher or requiring multimodal pre-training.

Findings

01

Achieves up to 3.4% improvement on language understanding tasks

02

Boosts generative reasoning performance by 2.6%

03

Works effectively with large models like DeBERTa, OPT, and LLaMA

Abstract

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques