From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Nathana\"el Carraz Rakotonirina; Mohammed Hamdy; Jon Ander Campos; Lucas Weber; Alberto Testoni; Marzieh Fadaee; Sandro Pezzelle; Marco Del Tredici

arXiv:2502.13791·cs.CL·June 10, 2025

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Nathana\"el Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper evaluates the ability of large language models to collaborate over multiple sessions in coding tasks, revealing current limitations in long-term information retention and integration.

Contribution

Introduces MemoryCode, a synthetic dataset for testing LLMs' multi-session collaboration, and analyzes models' performance, highlighting a key limitation in long-term interaction capabilities.

Findings

01

Models perform well on isolated instructions

02

Performance drops when instructions are spread across sessions

03

Current LLMs struggle with long-term information retrieval

Abstract

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

for-ai/memorycode
noneOfficial

Datasets

CohereLabsCommunity/memorycode
dataset· 15 dl
15 dl

Videos

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions· underline

Taxonomy

TopicsDigital Rights Management and Security · Advanced Data Storage Technologies