Cost-aware LLM-based Online Dataset Annotation
Eray Can Elumar, Cem Tekin, Osman Yagan

TL;DR
This paper introduces CaMVo, an adaptive, cost-aware framework for LLM-based dataset annotation that reduces computational costs while maintaining high accuracy by selectively querying models based on confidence and context.
Contribution
The paper presents a novel online selection mechanism for LLMs that balances cost and confidence, improving efficiency without sacrificing annotation quality.
Findings
CaMVo achieves similar or better accuracy than full majority voting.
It significantly reduces computational costs in dataset annotation.
Empirical results on MMLU and IMDB datasets validate effectiveness.
Abstract
Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Computational Techniques and Applications · Neural Networks and Applications
