Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration
Thomas Walshe, Sae Young Moon, Chunyang Xiao, Yawwani Gunawardana,, Fran Silavong

TL;DR
This paper presents Retrieval Augmented Classification (RAC), a novel method that leverages open-source LLMs for automatic data labelling by dynamically integrating label schemas, improving performance on high-cardinality tasks while addressing privacy and cost issues.
Contribution
The paper introduces RAC, a new approach that enhances open-source LLM-based data labelling through dynamic label schema integration and iterative inference, outperforming naive methods.
Findings
RAC improves labelling accuracy on high-cardinality tasks.
Dynamic label schema integration enhances performance over static descriptions.
RAC enables automatic labelling with a trade-off between quality and coverage.
Abstract
Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
