Do Chatbot LLMs Talk Too Much? The YapBench Benchmark

Vadim Borisov; Michael Gr\"oger; Mina Mikhael; Richard H. Schreiber

arXiv:2601.00624·cs.LG·January 5, 2026

Do Chatbot LLMs Talk Too Much? The YapBench Benchmark

Vadim Borisov, Michael Gr\"oger, Mina Mikhael, Richard H. Schreiber

PDF

Open Access

TL;DR

This paper introduces YapBench, a benchmark for measuring how much large language models tend to over-generate responses, especially when brevity is expected, and evaluates 76 models to identify their verbosity patterns.

Contribution

YapBench provides a novel, tokenizer-independent metric for quantifying over-generation in LLMs and offers a comprehensive evaluation across diverse prompts and models.

Findings

01

Significant variation in over-generation among models.

02

Identification of category-specific verbosity failure modes.

03

Benchmark and leaderboard for tracking verbosity over time.

Abstract

Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · AI in Service Interactions · Topic Modeling