SelectLLM: Can LLMs Select Important Instructions to Annotate?
Ritik Sachin Parkar, Jaehyung Kim, Jong Inn Park, Dongyeop Kang

TL;DR
SelectLLM is a framework that uses clustering and LLM prompting to select high-quality instructions for annotation, improving dataset quality efficiently.
Contribution
It introduces a novel method combining coreset clustering and LLM prompting for instruction selection, outperforming existing approaches.
Findings
Outperforms state-of-the-art methods like Alpagasus
Effective across multiple LLMs such as ChatGPT and LLaMA-3.1-70B
Maintains high performance on both human and synthetic datasets
Abstract
Instruction tuning benefits from large and diverse datasets; however, creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to select unlabeled instructions more effectively. Specifically, SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for enlarging diversity and prompting of LLM to identify the most beneficial instructions within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Mathematics, Computing, and Information Processing
