Beyond Prompts: Dynamic Conversational Benchmarking of Large Language   Models

David Castillo-Bolado; Joseph Davidson; Finlay Gray; Marek Rosa

arXiv:2409.20222·cs.CL·October 14, 2024

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

David Castillo-Bolado, Joseph Davidson, Finlay Gray, Marek Rosa

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a dynamic benchmarking system for conversational AI that evaluates long-term memory, continual learning, and information integration through simulated multi-task interactions, revealing limitations of current large language models.

Contribution

It introduces a novel dynamic benchmarking framework that simulates realistic multi-task conversations to assess LLMs' long-term capabilities and highlights their challenges in interleaved task scenarios.

Findings

01

LLMs perform well on single tasks but struggle with interleaved tasks.

02

Short-context LLMs with LTM can match or outperform larger-context models.

03

Current benchmarks do not fully capture challenges in natural, multi-task interactions.

Abstract

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user $\leftrightarrow$ agent interaction. The interaction is a conversation between the user and agent, where multiple tasks are introduced and then undertaken concurrently. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents. Results from both proprietary and open-source Large-Language Models show that LLMs in general perform well on single-task interactions, but they struggle on the same tasks when they are interleaved. Notably, short-context LLMs supplemented with an LTM system perform as well as or better than those with larger contexts. Our benchmark suggests that there are other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GoodAI/goodai-ltm-benchmark
noneOfficial

Videos

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques