StreamBench: Towards Benchmarking Continuous Improvement of Language   Agents

Cheng-Kuang Wu; Zhi Rui Tam; Chieh-Yen Lin; Yun-Nung Chen; Hung-yi Lee

arXiv:2406.08747·cs.CL·November 1, 2024

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

StreamBench is a new benchmark designed to evaluate and facilitate the development of continuous improvement strategies for large language model agents in streaming, online learning environments.

Contribution

The paper introduces StreamBench, the first benchmark for assessing LLMs' ability to improve over time with feedback, along with baseline methods and analysis of key components for streaming strategies.

Findings

01

StreamBench effectively evaluates LLMs' improvement over input-feedback sequences.

02

Baseline strategies show varying success, highlighting critical components for streaming improvement.

03

Analysis identifies key factors influencing continuous learning performance.

Abstract

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stream-bench/stream-bench
noneOfficial

Datasets

appier-ai-research/StreamBench
dataset· 756 dl
756 dl

Videos

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents· slideslive

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Natural Language Processing Techniques