ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan; Jakob Bjorner; Satvik Golechha; Kartik Goyal; and Alane Suhr

arXiv:2512.20111·cs.CL·December 24, 2025

ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ABBEL, a framework enabling LLM agents to maintain concise, interpretable belief states in language, reducing memory use in multi-step tasks and improving performance through reinforcement learning.

Contribution

The paper proposes a novel belief bottleneck framework for LLM agents, combining language-based belief states with RL training to enhance efficiency and interpretability in sequential decision-making.

Findings

01

ABBEL maintains near-constant memory over steps.

02

RL improves belief quality and task performance.

03

Belief-based agents outperform full context in memory efficiency.

Abstract

As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Usage of belief state to compress the history trajectory to assist LLM for effective actions sampling, making the internal state compact and inspectable. 2. RL with belief grading improves small language model in multi-objective QA tasks. Use a model to parse belief into characters and then comparing with ground truth posterior is a simple yet effective solution.

Weaknesses

1. RL with belief grading heavily relies on ground truth posterior. It’s unclear how robust grading signals will be in complex, non-synthetic settings where ground-truth posteriors aren’t computable. (as mentioned in the "Limitations" sections) 2. Only one benchmark in the main text. I would expect more benchmarks on results of RL + belief grading, like WebShop [1] as the Mem1 [3] authors did, and maybe also ALFWorld [2], a text based environment for agents to reason and interact with. 3. It see

Reviewer 02Rating 4Confidence 3

Strengths

* Clear, interpretable bottleneck: separating stored “belief” from transient reasoning is simple, model-agnostic, and yields near-constant memory across steps while often reducing tokens and action-side reasoning. * Solid empirical sweep and diagnostics: six environments, ablations (vanilla / belief-prompting / ABBEL), and candid analysis of failure modes (propagated belief errors, hallucinated past steps). * RL contributions are practical: outcome-based RL recovers most performance; belie

Weaknesses

* Novelty/positioning: very close to prior “learned memory” agents (MEM1/VeRL/rLLM); the belief–reasoning split reads as incremental rather than fundamentally new. Missing/under-cited contemporaries (e.g., MemAgent) weaken SOTA claims. * Baselines/fairness: QA compares ABBEL-RL (7B) to an untrained 14B full-history model; no apples-to-apples 7B full-history RL baseline reported. Combination-Lock gains hinge on a toy setting and ground-truth belief grading; generalization to realistic tas

Reviewer 03Rating 4Confidence 4

Strengths

- Improving the performance of LLMs in multi-turn interactions is an interesting problem, but this reviewer is not fully convinced of the novelty or significance of this work due to limited empirical demonstrations (see Weaknesses). - Clarity: The writing is clear, and Figure 1 clearly illustrates the difference between ABEEL and the existing approaches (vanilla and belief prompting).

Weaknesses

1: Questions about the effectiveness of belief-bottlenecked policies: - Figure 2 shows that belief-bottlenecked models perform significantly worse than full interaction-based models or models that incorporate both belief and past history. Given the efforts and advances in increasing context lengths for newer models, it is unclear what advantages belief-bottlenecked models offer that long-context models cannot handle. The performance improvement achieved through post-training (when comparing amo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling