Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Jinsook Lee; Kirk Vanacore; Zhuqian Zhou; Bakhtawar Ahtisham; Rene F. Kizilcec

arXiv:2604.03127·cs.CL·April 6, 2026

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Bakhtawar Ahtisham, Rene F. Kizilcec

PDF

TL;DR

This paper introduces a domain-adapted retrieval-augmented generation pipeline that improves pedagogical dialogue act annotation by fine-tuning a lightweight embedding model and indexing dialogues at the utterance level, outperforming baselines.

Contribution

It demonstrates that domain-adapted retrieval significantly enhances dialogue act annotation accuracy without fine-tuning large language models.

Findings

01

Achieves Cohen's κ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi datasets.

02

Utterance-level indexing is the main factor driving performance gains.

03

Retrieval corrects systematic label biases and improves rare label detection.

Abstract

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ( $κ = 0.275$ - $0.413$ and $0.160$ - $0.410$ ). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.