Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren

TL;DR
Amulet is a framework that enhances large language model judges by incorporating dialog acts and maxims, significantly improving their accuracy in evaluating complex multi-turn conversations across diverse datasets.
Contribution
It introduces a novel approach leveraging linguistic concepts to improve LLM-based judgment accuracy in multi-turn dialogues, with practical applications as a standalone judge or jury.
Findings
Humans often change conversation intents between turns (60-70%).
Dialog acts and maxims differentiate preference responses in 75% of cases.
Amulet improves judgment accuracy across four challenging datasets.
Abstract
Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLegal Systems and Judicial Processes · Legal Education and Practice Innovations · Criminal Law and Evidence
