Large Language Models are Demonstration Pre-Selectors for Themselves
Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

TL;DR
This paper introduces FEEDER, a pre-selection framework that efficiently identifies a representative subset of training data for large language models, reducing data size by over 20% while maintaining performance in in-context learning and fine-tuning.
Contribution
FEEDER proposes a novel pre-selection method using sufficiency and necessity metrics, along with a tree-based algorithm, to select representative training examples tailored to specific LLMs, improving efficiency.
Findings
Reduces training data size by over 20% without performance loss.
Maintains comparable in-context learning performance with fewer demonstrations.
Enhances fine-tuning efficiency through bi-level optimization.
Abstract
In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Machine Learning and Data Classification
