Do Chatbot LLMs Talk Too Much? The YapBench Benchmark
Vadim Borisov, Michael Gr\"oger, Mina Mikhael, Richard H. Schreiber

TL;DR
This paper introduces YapBench, a benchmark for measuring how much large language models tend to over-generate responses, especially when brevity is expected, and evaluates 76 models to identify their verbosity patterns.
Contribution
YapBench provides a novel, tokenizer-independent metric for quantifying over-generation in LLMs and offers a comprehensive evaluation across diverse prompts and models.
Findings
Significant variation in over-generation among models.
Identification of category-specific verbosity failure modes.
Benchmark and leaderboard for tracking verbosity over time.
Abstract
Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · AI in Service Interactions · Topic Modeling
