Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks
Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, Sung Ju Hwang

TL;DR
This paper introduces KARD, a method to enhance small language models' reasoning abilities by distilling knowledge from large models and external knowledge bases, significantly improving performance on knowledge-intensive tasks.
Contribution
The paper proposes KARD, a novel knowledge-augmented distillation approach that enables small LMs to better memorize and utilize external knowledge for reasoning tasks.
Findings
KARD improves small T5 and GPT models on reasoning datasets.
250M T5 models outperform larger 3B models with KARD.
Significant performance gains on MedQA-USMLE, StrategyQA, OpenbookQA.
Abstract
Large Language Models (LLMs) have shown promising performance in knowledge-intensive reasoning tasks that require a compound understanding of knowledge. However, deployment of the LLMs in real-world applications can be challenging due to their high computational requirements and concerns on data privacy. Previous studies have focused on building task-specific small Language Models (LMs) by fine-tuning them with labeled data or distilling LLMs. However, these approaches are ill-suited for knowledge-intensive reasoning tasks due to the limited capacity of small LMs in memorizing the knowledge required. Motivated by our theoretical analysis on memorization, we propose Knowledge-Augmented Reasoning Distillation (KARD), a novel method that fine-tunes small LMs to generate rationales obtained from LLMs with augmented knowledge retrieved from an external knowledge base. Moreover, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Adafactor · Adam · Inverse Square Root Schedule · Discriminative Fine-Tuning · Weight Decay
