BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Erin Feiglin; Nir Hutnik; Raz Lapid

arXiv:2601.08490·cs.CL·January 14, 2026

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Erin Feiglin, Nir Hutnik, Raz Lapid

PDF

Open Access

TL;DR

This paper introduces BenchOverflow, a benchmark to measure and analyze the overflow phenomenon in large language models, which causes excessive output length, increased costs, and environmental impact, and proposes a simple mitigation strategy.

Contribution

The paper presents a standardized benchmark for measuring overflow in LLMs, evaluates multiple models and prompting strategies, and demonstrates a lightweight mitigation method to reduce output length tail risks.

Findings

01

Overflow is widespread across models and prompts.

02

Length control significantly impacts cost and sustainability.

03

A simple conciseness reminder mitigates overflow effectively.

Abstract

We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Security and Verification in Computing · Adversarial Robustness in Machine Learning