Evaluating Large Language Models as Expert Annotators

Yu-Min Tseng; Wei-Lin Chen; Chung-Chi Chen; Hsin-Hsi Chen

arXiv:2508.07827·cs.CL·August 12, 2025

Evaluating Large Language Models as Expert Annotators

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

PDF

Open Access

TL;DR

This study evaluates whether large language models can replace human experts in specialized domain annotation tasks, finding limited benefits from reasoning techniques and revealing certain model behaviors in multi-agent discussions.

Contribution

It introduces a multi-agent discussion framework for LLM-based annotation and assesses their effectiveness in finance, biomedicine, and law domains, highlighting the limited gains from reasoning models.

Findings

01

Reasoning techniques offer minimal performance improvements.

02

Multi-agent discussions reveal persistent model behaviors.

03

LLMs show limited ability to adapt annotations in expert domains.

Abstract

Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification