The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation
Junichiro Niimi

TL;DR
This paper introduces a modular architecture that separates world modeling from language generation, demonstrating improved controllability and coherence in text output by connecting a domain-specific energy-based world model with a frozen language model.
Contribution
It proposes a novel framework that explicitly decouples world understanding from language modeling, enabling better control and coherence in generated text.
Findings
World model conditioning reduces cross-entropy and increases semantic similarity.
Energy function effectively distinguishes plausible from implausible configurations.
Causal interventions on attributes influence generated text in a statistically consistent manner.
Abstract
Large Language Models (LLMs) generate fluent text, yet whether they truly understand the world or merely produce plausible texts about it remains contested. We propose an architectural principle, the mouth is not the brain, that explicitly separates world models from language models. Our architecture comprises three components: a DBM that captures domain structure as an energy-based world model, an adapter that projects latent belief states into embedding space, and a frozen GPT-2 that provides linguistic competence without domain knowledge. We instantiate this framework in the consumer review domain using Amazon smartphone reviews. Experiments demonstrate that (1) world model conditioning achieves lower cross-entropy loss and higher semantic similarity than architectural baselines including direct projection and full fine-tuning, while qualitative analysis reveals that soft prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
