StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR
StreamBench is a new benchmark designed to evaluate and facilitate the development of continuous improvement strategies for large language model agents in streaming, online learning environments.
Contribution
The paper introduces StreamBench, the first benchmark for assessing LLMs' ability to improve over time with feedback, along with baseline methods and analysis of key components for streaming strategies.
Findings
StreamBench effectively evaluates LLMs' improvement over input-feedback sequences.
Baseline strategies show varying success, highlighting critical components for streaming improvement.
Analysis identifies key factors influencing continuous learning performance.
Abstract
Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Natural Language Processing Techniques
