ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

Daivik Patel; Shrenik Patel

arXiv:2511.12960·cs.MA·February 4, 2026

ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

Daivik Patel, Shrenik Patel

PDF

Open Access 3 Reviews

TL;DR

ENGRAM is a simple, efficient memory system for conversational agents that organizes and retrieves different memory types to improve long-term consistency and performance, outperforming complex existing systems.

Contribution

Introduces ENGRAM, a lightweight memory architecture using typed memory and dense retrieval, achieving state-of-the-art results with minimal complexity.

Findings

01

State-of-the-art on LoCoMo benchmark

02

15-point improvement on LongMemEval

03

Uses only about 1% of tokens compared to full context

Abstract

Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

* The proposed method shows even better performance than full-context while achieving lower latency. * The proposed method is simple. * The ablation study shows that separating episodic, semantic and procedural memory reduces retrieval competition and improves reasoning diversity.

Weaknesses

* While ENGRAM demonstrates impressive results on LoCoMo and LongMemEval, the evaluation scope remains relatively narrow and may overstate its general effectiveness. Both benchmarks are synthetic and constrained to conversational QA settings which do not full represent the complexity of long-horizon reasoning in interactive agents. The large performance gap compared to baselines might partly stem from dataset alignment with ENGRAM's design rather than true general improvements in long-term memor

Reviewer 02Rating 6Confidence 4

Strengths

The design is clear and small: three typed stores, one router, one dense retriever, and a fixed template. This reduces orchestration knobs and makes analyses easier. The formulation spells out record schemas and the retrieval/aggregation steps, including speaker-aware banks and a deterministic template, which helps reproducibility. Empirically, the method delivers strong semantic correctness with a strict token budget. Reporting both judge-based and lexical metrics, plus retrieval and end-to-e

Weaknesses

1. The main metric is LLM-as-a-judge with GPT-4o-mini; while they report mean ± sd, judge bias is a risk. A human-rated subset or cross-judge agreement study would raise confidence. 2. As the main and only judge, a larger/better LLM model should be considered.

Reviewer 03Rating 2Confidence 4

Strengths

1. The writing of the paper is easy to follow. The author also provides detailed prompts for each module. 2. Benchmarks on the LoCoMo dataset show better retrieval performance. 3. The system is straightforward to implement and can potentially be extended to different memory types as well.

Weaknesses

1. The memory system treats the input utterance independently of the existing conversation history. When storing this information in the vector database, each chunk does not have the context of the conversation. Often, during a conversation with LLMs, users refer to different components in the chat. The current proposed system seems unable to handle these references. 2. The chunks stored in the vector store seem to be independent over time. Recency dependency and the order of the utterance are n

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques