ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak; Minju Kim; Dongha Lim; Hyungjoo Chae; Dongjin Kang; Sunghwan Kim; Dongil Yang; Jinyoung Yeo

arXiv:2505.23662·cs.CL·November 24, 2025

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo

PDF

Open Access 1 Repo

TL;DR

ToolHaystack is a benchmark designed to evaluate large language models' ability to maintain effective tool use and context over long-term, realistic interactions, revealing significant robustness gaps in current models.

Contribution

We introduce ToolHaystack, a novel benchmark for assessing long-term tool use in language models during realistic, noisy conversations, highlighting limitations of current models.

Findings

01

Current models perform well in short-term multi-turn tasks.

02

Models struggle with long-term context maintenance and noise robustness.

03

Significant gaps in long-term robustness are revealed by the benchmark.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bwookwak/toolhaystack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications