Training Task Experts through Retrieval Based Distillation

Jiaxin Ge; Xueying Jia; Vijay Viswanathan; Hongyin Luo; Graham Neubig

arXiv:2407.05463·cs.CL·July 9, 2024

Training Task Experts through Retrieval Based Distillation

Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig

PDF

Open Access 3 Reviews

TL;DR

ReBase is a retrieval-based distillation method that enhances domain-specific data quality and reasoning capabilities of models by leveraging online sources, leading to significant performance improvements on multiple benchmarks.

Contribution

The paper introduces ReBase, a novel retrieval-based distillation approach that improves data diversity and reasoning in models for specialized tasks.

Findings

01

Up to 7.8% performance improvement on SQuAD

02

Enhanced reasoning with Chain-of-Thought generation

03

Significant gains on multiple benchmarks

Abstract

One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

S1. The paper presents the efficacy of retrieval augmentation for new data generation. S2. The proposed method demonstrates improvement over baseline methods.

Weaknesses

W1. Missing important related work "Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks"[1]. W2. Usage of the HuggingFace datasets may cause contamination if not explicitly taken care of. W3. The choice of teacher model is unclear as the ability varies for tasks such as coding or math. How is coding or math correctness assessed for the synthetic samples? W4. It is vague how the data sources are selected. [1] Seo, Minju, et al. "Retrieval-augmented data augmentation for low-r

Reviewer 02Rating 5Confidence 4

Strengths

- Data Diversity: ReBase enhances data diversity by retrieving data from multiple online sources - Reasoning Distillation: ReBase introduces a Chain-of-Thought transformation phase, allowing smaller models to be trained using the reasoning processes generated by larger models, which is particularly useful for reasoning tasks. - Performance Improvement: In multiple benchmark tests, ReBase has demonstrated significant performance improvements compared to traditional data generation methods.

Weaknesses

- The LLM used in the text is Llama3-8B. Have you tried any other models (different size of params, different model families) ? - Processing large amounts of data into datastore and retrieve from them can be computational costly, esp. computing cosine similarities. Any efforts to make such computation less so? - Have there been any assessments of the data quality after TRANSFORMATION, such as issues like hallucinations, the CoT is complete and correct, etc.?

Reviewer 03Rating 5Confidence 4

Strengths

The motivation is practical and significant, offering a cost-effective solution for obtaining high quality domain-specific datas using online datas. Paper demonstrates the performance improvement of the ReBase method in multiple benchmarks, such as achieving 7.8%, 1.37%, and 1.94% performance improvements on SQuAD, MNLI, and BigBench Hard datasets, respectively. These results demonstrate the effectiveness of the method.

Weaknesses

Although ReBase performs well on specific benchmark tests, its generalization ability on other types of tasks has not been fully validated. Because the three datasets involved in the article are too simple and early, there is a risk of label exposure. Meanwhile, these three datasets are all generic datasets and do not match the domain specific mentioned in the article. For specific domain and unseen types of data, online data may not have similar data, and enhancing through retrieval may not b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Intelligent Tutoring Systems and Adaptive Learning · Industrial Automation and Control Systems