ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language
Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr

TL;DR
This paper introduces ABBEL, a framework enabling LLM agents to maintain concise, interpretable belief states in language, reducing memory use in multi-step tasks and improving performance through reinforcement learning.
Contribution
The paper proposes a novel belief bottleneck framework for LLM agents, combining language-based belief states with RL training to enhance efficiency and interpretability in sequential decision-making.
Findings
ABBEL maintains near-constant memory over steps.
RL improves belief quality and task performance.
Belief-based agents outperform full context in memory efficiency.
Abstract
As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Usage of belief state to compress the history trajectory to assist LLM for effective actions sampling, making the internal state compact and inspectable. 2. RL with belief grading improves small language model in multi-objective QA tasks. Use a model to parse belief into characters and then comparing with ground truth posterior is a simple yet effective solution.
1. RL with belief grading heavily relies on ground truth posterior. It’s unclear how robust grading signals will be in complex, non-synthetic settings where ground-truth posteriors aren’t computable. (as mentioned in the "Limitations" sections) 2. Only one benchmark in the main text. I would expect more benchmarks on results of RL + belief grading, like WebShop [1] as the Mem1 [3] authors did, and maybe also ALFWorld [2], a text based environment for agents to reason and interact with. 3. It see
* Clear, interpretable bottleneck: separating stored “belief” from transient reasoning is simple, model-agnostic, and yields near-constant memory across steps while often reducing tokens and action-side reasoning. * Solid empirical sweep and diagnostics: six environments, ablations (vanilla / belief-prompting / ABBEL), and candid analysis of failure modes (propagated belief errors, hallucinated past steps). * RL contributions are practical: outcome-based RL recovers most performance; belie
* Novelty/positioning: very close to prior “learned memory” agents (MEM1/VeRL/rLLM); the belief–reasoning split reads as incremental rather than fundamentally new. Missing/under-cited contemporaries (e.g., MemAgent) weaken SOTA claims. * Baselines/fairness: QA compares ABBEL-RL (7B) to an untrained 14B full-history model; no apples-to-apples 7B full-history RL baseline reported. Combination-Lock gains hinge on a toy setting and ground-truth belief grading; generalization to realistic tas
- Improving the performance of LLMs in multi-turn interactions is an interesting problem, but this reviewer is not fully convinced of the novelty or significance of this work due to limited empirical demonstrations (see Weaknesses). - Clarity: The writing is clear, and Figure 1 clearly illustrates the difference between ABEEL and the existing approaches (vanilla and belief prompting).
1: Questions about the effectiveness of belief-bottlenecked policies: - Figure 2 shows that belief-bottlenecked models perform significantly worse than full interaction-based models or models that incorporate both belief and past history. Given the efforts and advances in increasing context lengths for newer models, it is unclear what advantages belief-bottlenecked models offer that long-context models cannot handle. The performance improvement achieved through post-training (when comparing amo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
