Cost-aware LLM-based Online Dataset Annotation

Eray Can Elumar; Cem Tekin; Osman Yagan

arXiv:2505.15101·cs.LG·December 16, 2025

Cost-aware LLM-based Online Dataset Annotation

Eray Can Elumar, Cem Tekin, Osman Yagan

PDF

Open Access 1 Video

TL;DR

This paper introduces CaMVo, an adaptive, cost-aware framework for LLM-based dataset annotation that reduces computational costs while maintaining high accuracy by selectively querying models based on confidence and context.

Contribution

The paper presents a novel online selection mechanism for LLMs that balances cost and confidence, improving efficiency without sacrificing annotation quality.

Findings

01

CaMVo achieves similar or better accuracy than full majority voting.

02

It significantly reduces computational costs in dataset annotation.

03

Empirical results on MMLU and IMDB datasets validate effectiveness.

Abstract

Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Cost-aware LLM-based Online Dataset Annotation· slideslive

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Computational Techniques and Applications · Neural Networks and Applications