The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

Tzu-Heng Huang; Catherine Cao; Vaishnavi Bhargava; Frederic Sala

arXiv:2407.11004·cs.CL·February 4, 2025·3 cites

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala

PDF

Open Access 1 Repo 1 Video

TL;DR

The ALCHEmist system uses program generation to produce labels from models, significantly reducing costs while maintaining or improving annotation quality compared to direct LLM-based labeling.

Contribution

We introduce a cost-effective method that generates reusable label-producing programs from models, outperforming traditional LLM annotation in accuracy and efficiency.

Findings

01

Achieves 12.9% performance improvement over LLM annotation.

02

Reduces total labeling costs by approximately 500 times.

03

Produces reusable, extendable label-generating programs.

Abstract

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sprocketlab/alchemist
noneOfficial

Videos

The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling