Refining and Reusing Annotation Guidelines for LLM Annotation
Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa

TL;DR
This paper presents an iterative moderation framework for refining annotation guidelines to improve LLM performance on specialized biomedical NER tasks, demonstrating empirical success across multiple models and datasets.
Contribution
It introduces a systematic reuse and refinement process for annotation guidelines as an alignment mechanism for LLMs, tested through an iterative moderation framework.
Findings
Guideline integration improves LLM annotation quality.
Reasoning-optimized models outperform standard models.
Moderation under minimal supervision is feasible.
Abstract
While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
