Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang,, Ning Cheng, Tianyi Zhou

TL;DR
Superfiltering introduces a cost-effective data filtering method using weaker models to select high-quality instruction data, resulting in faster filtering and improved performance of larger language models.
Contribution
The paper proposes Superfiltering, a novel approach that leverages smaller, weaker models for data filtering, reducing costs and maintaining effectiveness in instruction tuning.
Findings
Superfiltering speeds up data filtering significantly.
Filtered data leads to better performance on benchmarks.
Weak models can reliably perceive instruction difficulty.
Abstract
Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Enhancement Techniques · Machine Learning and ELM · Machine Learning and Data Classification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
