Superfiltering: Weak-to-Strong Data Filtering for Fast   Instruction-Tuning

Ming Li; Yong Zhang; Shwai He; Zhitao Li; Hongyu Zhao; Jianzong Wang,; Ning Cheng; Tianyi Zhou

arXiv:2402.00530·cs.CL·June 11, 2024·2 cites

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang,, Ning Cheng, Tianyi Zhou

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

Superfiltering introduces a cost-effective data filtering method using weaker models to select high-quality instruction data, resulting in faster filtering and improved performance of larger language models.

Contribution

The paper proposes Superfiltering, a novel approach that leverages smaller, weaker models for data filtering, reducing costs and maintaining effectiveness in instruction tuning.

Findings

01

Superfiltering speeds up data filtering significantly.

02

Filtered data leads to better performance on benchmarks.

03

Weak models can reliably perceive instruction difficulty.

Abstract

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianyi-lab/superfiltering
pytorchOfficial

Datasets

Videos

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning· underline

Taxonomy

TopicsImage Enhancement Techniques · Machine Learning and ELM · Machine Learning and Data Classification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings