Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Joyeeta Datta, Niclas Doll, Qusai Ramadan, Zeyd Boukhers

TL;DR
This study explores how effectively large language models can be compressed via knowledge distillation for question answering tasks, achieving high performance with significantly fewer parameters in resource-limited settings.
Contribution
It demonstrates that knowledge distillation can compress LLMs by over 50% while maintaining over 90% of their QA performance, with benefits from minimal prompting techniques.
Findings
Student models retain over 90% of teacher performance.
Parameter reduction of up to 57.1%.
One-shot prompting improves results over zero-shot.
Abstract
Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models' performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
