ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory
Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

TL;DR
ShardMemo introduces a masked MoE routing approach for sharded agentic LLM memory, improving retrieval efficiency and accuracy in multi-agent and long-horizon tasks through structured eligibility constraints and cost-aware gating.
Contribution
It proposes a novel tiered memory system with masked MoE routing for efficient shard selection, outperforming baseline methods in various benchmarks.
Findings
Improves F1 scores on LoCoMo by +5.11 to +6.82
Reduces retrieval work by 20.5% and latency by 20 ms
Achieves high precision and step reduction on ToolBench
Abstract
Agentic large language model (LLM) systems rely on external memory for long-horizon state and concurrent multi-agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per-agent working state, Tier B sharded evidence with shard-local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope-before-routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture-of-experts (MoE) routing over eligible shards, probing up to shards via Top- or adaptive Top-, and use cost-aware gating over profile/observation/session shard families; the router is trained from evidence-to-shard supervision. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
