Training Task Experts through Retrieval Based Distillation
Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig

TL;DR
ReBase is a retrieval-based distillation method that enhances domain-specific data quality and reasoning capabilities of models by leveraging online sources, leading to significant performance improvements on multiple benchmarks.
Contribution
The paper introduces ReBase, a novel retrieval-based distillation approach that improves data diversity and reasoning in models for specialized tasks.
Findings
Up to 7.8% performance improvement on SQuAD
Enhanced reasoning with Chain-of-Thought generation
Significant gains on multiple benchmarks
Abstract
One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
S1. The paper presents the efficacy of retrieval augmentation for new data generation. S2. The proposed method demonstrates improvement over baseline methods.
W1. Missing important related work "Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks"[1]. W2. Usage of the HuggingFace datasets may cause contamination if not explicitly taken care of. W3. The choice of teacher model is unclear as the ability varies for tasks such as coding or math. How is coding or math correctness assessed for the synthetic samples? W4. It is vague how the data sources are selected. [1] Seo, Minju, et al. "Retrieval-augmented data augmentation for low-r
- Data Diversity: ReBase enhances data diversity by retrieving data from multiple online sources - Reasoning Distillation: ReBase introduces a Chain-of-Thought transformation phase, allowing smaller models to be trained using the reasoning processes generated by larger models, which is particularly useful for reasoning tasks. - Performance Improvement: In multiple benchmark tests, ReBase has demonstrated significant performance improvements compared to traditional data generation methods.
- The LLM used in the text is Llama3-8B. Have you tried any other models (different size of params, different model families) ? - Processing large amounts of data into datastore and retrieve from them can be computational costly, esp. computing cosine similarities. Any efforts to make such computation less so? - Have there been any assessments of the data quality after TRANSFORMATION, such as issues like hallucinations, the CoT is complete and correct, etc.?
The motivation is practical and significant, offering a cost-effective solution for obtaining high quality domain-specific datas using online datas. Paper demonstrates the performance improvement of the ReBase method in multiple benchmarks, such as achieving 7.8%, 1.37%, and 1.94% performance improvements on SQuAD, MNLI, and BigBench Hard datasets, respectively. These results demonstrate the effectiveness of the method.
Although ReBase performs well on specific benchmark tests, its generalization ability on other types of tasks has not been fully validated. Because the three datasets involved in the article are too simple and early, there is a risk of label exposure. Meanwhile, these three datasets are all generic datasets and do not match the domain specific mentioned in the article. For specific domain and unseen types of data, online data may not have similar data, and enhancing through retrieval may not b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Intelligent Tutoring Systems and Adaptive Learning · Industrial Automation and Control Systems
