TL;DR
This paper introduces WMF-AM, a diagnostic tool for evaluating large language models' ability to maintain and update intermediate states across multiple steps, helping to understand their working memory capabilities.
Contribution
The authors present WMF-AM, a novel, adaptable benchmark that isolates cumulative load in LLMs, enabling detailed analysis of their working memory performance.
Findings
28 models tested with arithmetic accumulation reveal varying working memory capabilities.
Non-arithmetic tasks confirm the generality of the cumulative load challenge.
Ablation studies show cumulative load, not arithmetic skill, affects difficulty.
Abstract
Existing large language models (LLMs) evaluations use fixed-difficulty benchmarks that cannot adapt as models improve, and rarely isolate specific cognitive processes. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a probe of cumulative state tracking, the ability to maintain and update intermediate results across K sequential operations within a single query, without a scratchpad. Unlike multi-step agent benchmarks that stress task orchestration, WMF-AM isolates within-pass cumulative load by parameterizing depth K. The core probe uses arithmetic accumulation on 28 models from 12 families (0.5B to frontier); a matched non-arithmetic extension (permissions, schedules, inventories) confirms the design generalizes beyond arithmetic. Three construct-isolation ablations confirm that cumulative load, not arithmetic skill or entity tracking, drives difficulty. We release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
