FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Hagyeong Shin; Binoy Robin Dalal; Iwona Bialynicka-Birula; Navjot Matharu; Ryan Muir; Xingwei Yang; Samuel W. K. Wong

arXiv:2508.00889·cs.CL·August 5, 2025

FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong

PDF

Open Access

TL;DR

This paper introduces FECT, a new benchmark dataset and a 3D annotation paradigm for evaluating the factuality of interpretive AI-generated claims in contact center transcripts, addressing hallucination issues in enterprise NLP applications.

Contribution

The paper proposes a novel 3D paradigm for factuality annotation, creates the FECT benchmark dataset, and evaluates LLM-judges' alignment with this paradigm for contact center conversation analysis.

Findings

01

LLMs often hallucinate in contact center analysis tasks.

02

The 3D paradigm improves grounding of factuality labels.

03

LLM-judges can be aligned with the 3D evaluation criteria.

Abstract

Large language models (LLMs) are known to hallucinate, producing natural language outputs that are not grounded in the input, reference materials, or real-world knowledge. In enterprise applications where AI features support business decisions, such hallucinations can be particularly detrimental. LLMs that analyze and summarize contact center conversations introduce a unique set of challenges for factuality evaluation, because ground-truth labels often do not exist for analytical interpretations about sentiments captured in the conversation and root causes of the business problems. To remedy this, we first introduce a \textbf{3D} -- \textbf{Decompose, Decouple, Detach} -- paradigm in the human annotation guideline and the LLM-judges' prompt to ground the factuality labels in linguistically-informed evaluation criteria. We then introduce \textbf{FECT}, a novel benchmark dataset for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling