EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

Shijian Ma (1); Yan Lin (2); Yi Yang (1) ((1) The Hong Kong University of Science; Technology; Hong Kong SAR; China; (2) University of Macau; Macau SAR; China)

arXiv:2601.09142·cs.LG·February 5, 2026

EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

Shijian Ma (1), Yan Lin (2), Yi Yang (1) ((1) The Hong Kong University of Science, Technology, Hong Kong SAR, China, (2) University of Macau, Macau SAR, China)

PDF

Open Access 2 Models 3 Datasets

TL;DR

EvasionBench is a large-scale benchmark dataset and evaluation framework designed to detect evasive responses in earnings call Q&A sessions, utilizing a novel annotation pipeline and a fine-tuned classifier to improve detection accuracy.

Contribution

The paper introduces EvasionBench, the first large-scale benchmark for detecting managerial evasion, with a comprehensive dataset, a multi-model consensus annotation pipeline, and a high-performing classifier.

Findings

01

The MMC annotation pipeline achieves Cohen's Kappa of 0.835.

02

The Eva-4B classifier reaches 84.9% Macro-F1.

03

Multi-model consensus labeling outperforms single-model annotation.

Abstract

We present EvasionBench, a comprehensive benchmark for detecting evasive responses in corporate earnings call question-and-answer sessions. Drawing from 22.7 million Q&A pairs extracted from S&P Capital IQ transcripts, we construct a rigorously filtered dataset and introduce a three-level evasion taxonomy: direct, intermediate, and fully evasive. Our annotation pipeline employs a Multi-Model Consensus (MMC) framework, combining dual frontier LLM annotation with a three-judge majority voting mechanism for ambiguous cases, achieving a Cohen's Kappa of 0.835 on human inter-annotator agreement. We release: (1) a balanced 84K training set, (2) a 1K gold-standard evaluation set with expert human labels, and (3) [Eva-4B], a 4-billion parameter classifier fine-tuned from Qwen3-4B that achieves 84.9% Macro-F1, outperforming Claude 4.5, GPT-5.2, and Gemini 3 Flash. Our ablation studies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuditing, Earnings Management, Governance · Explainable Artificial Intelligence (XAI) · Expert finding and Q&A systems