Efficient Speech Translation through Model Compression and Knowledge Distillation

Yasmin Moslem

arXiv:2505.20237·cs.CL·August 14, 2025

Efficient Speech Translation through Model Compression and Knowledge Distillation

Yasmin Moslem

PDF

1 Repo

TL;DR

This paper presents a system combining model pruning, quantization, and knowledge distillation to significantly reduce the size of speech translation models while maintaining high translation quality.

Contribution

It introduces a novel combination of layer pruning, low-rank adaptation with quantization, and knowledge distillation for efficient speech translation models.

Findings

01

Up to 50% reduction in model size and storage footprint.

02

Maintains 97-100% of translation quality of larger models.

03

Effective for speech translation into German and Chinese.

Abstract

Efficient deployment of large audio-language models for speech translation remains challenging due to their significant computational requirements. In this paper, we address this challenge through our system submissions to the "Model Compression" track at the International Conference on Spoken Language Translation (IWSLT 2025). We experiment with a combination of approaches including iterative layer pruning based on layer importance evaluation, low-rank adaptation with 4-bit quantization (QLoRA), and knowledge distillation. In our experiments, we use Qwen2-Audio-7B-Instruct for speech translation into German and Chinese. Our pruned (student) models achieve up to a 50% reduction in both model parameters and storage footprint, while retaining 97-100% of the translation quality of the in-domain (teacher) models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ymoslem/model-compression
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning