BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts
Erin Feiglin, Nir Hutnik, Raz Lapid

TL;DR
This paper introduces BenchOverflow, a benchmark to measure and analyze the overflow phenomenon in large language models, which causes excessive output length, increased costs, and environmental impact, and proposes a simple mitigation strategy.
Contribution
The paper presents a standardized benchmark for measuring overflow in LLMs, evaluates multiple models and prompting strategies, and demonstrates a lightweight mitigation method to reduce output length tail risks.
Findings
Overflow is widespread across models and prompts.
Length control significantly impacts cost and sustainability.
A simple conciseness reminder mitigates overflow effectively.
Abstract
We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Security and Verification in Computing · Adversarial Robustness in Machine Learning
