When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
Micha{\l} Wawer, Jaros{\l}aw A. Chudziak

TL;DR
This paper explores how analyzing reasoning traces of AI agents can reveal meaningful disagreement signals in hate speech moderation, suggesting a shift from consensus to uncertainty-surfacing in multi-agent systems.
Contribution
It introduces a taxonomy-based analysis of agent disagreement patterns that correlates with human disagreement, proposing a new approach for human-AI collaboration.
Findings
Agent reasoning similarity weakly predicts human disagreement.
Agreement among agents correlates with lower human disagreement.
Disagreement structure provides a stronger signal than magnitude.
Abstract
When LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
