Minerva: A Programmable Memory Test Benchmark for Language Models

Menglin Xia; Victor Ruehle; Saravan Rajmohan; Reza Shokri

arXiv:2502.03358·cs.CL·June 10, 2025

Minerva: A Programmable Memory Test Benchmark for Language Models

Menglin Xia, Victor Ruehle, Saravan Rajmohan, Reza Shokri

PDF

Open Access 1 Video

TL;DR

This paper introduces Minerva, a new programmable benchmark for evaluating language models' memory capabilities through automated, comprehensive, and interpretable tests covering simple to complex tasks.

Contribution

The paper presents a framework for automatically generating diverse memory tests, extending beyond traditional benchmarks to include complex, composite tasks for detailed model assessment.

Findings

01

Models are evaluated on memory recall, editing, matching, and comparison tasks.

02

The benchmark reveals specific strengths and weaknesses in models' memory usage.

03

It provides actionable insights into models' memory-related capabilities.

Abstract

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Minerva: A Programmable Memory Test Benchmark for Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus