BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
Chuyuan Li, Giuseppe Carenini

TL;DR
BeDiscovER is a comprehensive benchmark suite for evaluating discourse understanding in modern language models, covering diverse tasks from discourse parsing to semantic phenomena, revealing strengths and weaknesses of current models.
Contribution
The paper introduces BeDiscovER, a new benchmark aggregating 52 datasets across discourse levels, including novel challenges like discourse particle disambiguation, for evaluating reasoning language models.
Findings
State-of-the-art models excel in temporal reasoning.
Models struggle with full document reasoning and subtle discourse phenomena.
GPT-5-mini shows strong arithmetic reasoning but limited discourse understanding.
Abstract
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
