DREAMSTATE: Diffusing States and Parameters for Recurrent Large Language Models

Liu Xiao

arXiv:2601.19221·cs.CL·January 28, 2026

DREAMSTATE: Diffusing States and Parameters for Recurrent Large Language Models

Liu Xiao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DREAMSTATE, a framework that models and edits RNN states using diffusion techniques, and proposes a hybrid architecture combining RNNs with global context-aware parameter generation, advancing RNN interpretability and adaptability.

Contribution

It presents the DREAMSTATE framework for modeling RNN states with diffusion models and introduces a hybrid RNN architecture with dynamic, context-aware parameters.

Findings

01

DREAMSTATE effectively models RNN states and enables editing.

02

The hybrid architecture achieves stable training with multi-objective loss.

03

Experimental results validate the model's ability to adapt to global context.

Abstract

Modern Recurrent Neural Networks (RNNs), such as RWKV, are distinguished by their powerful short-range modeling capabilities and efficient fixed-size states, which constitute a core advantage over standard Transformers. However, there is a significant lack of research into their internal state as an editable knowledge representation. To fill this gap, we first explore the representational properties of the RWKV state by proposing the DREAMSTATE framework. This framework utilizes a conditional Diffusion Transformer (DiT) to directly model the probability manifold of the state, enabling its generation and editing. The structural nature of this representation is validated through t-SNE visualizations and controlled generation experiments. After successfully uncovering and modeling the state's representational potential, we further propose a novel hybrid architecture that combines the local…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

Novel application of diffusion to RNN states. The paper introduces diffusion models for generating RWKV hidden states (Section 3.1), which differs from prior work on variational state modeling (VRNN, Chung et al. 2015) or weight generation (Neural Network Diffusion, Wang et al. 2024). The conditional generation framework in Equation 3 enables direct manipulation of the state manifold, validated through clustering in Figure 2 where different personas (code, creative writing, planning) form distin

Weaknesses

Experimental validation lacks quantitative results and proper baselines. Section 4.3 claims "training stability" based solely on Figure 3 showing decreasing loss curves, but provides no numerical results for language modeling performance. Table 1 presents only qualitative text samples without perplexity measurements or accuracy on state-tracking tasks (parity, modular arithmetic) mentioned in Section 1. The paper cites Grazzi et al. (2024) regarding RWKV-7's ability to solve parity problems, but

Reviewer 02Rating 2Confidence 2

Strengths

1) Author provided steps to replicate the results. This is quite awesome. 2) Using a DiT model to model the RNN parameters unique and interesting. And author showed the model can be stably trained.

Weaknesses

1) The experiments are not complete, with a new model, we need to compare with the standard RWKV model at least. A model can be trained stably is not enough, we also need to show that its performance is at least comparable with other models. 2) Because the new hybrid model can model long context directly, it would be great if it can add benchmarks to show case its strength.

Reviewer 03Rating 2Confidence 4

Strengths

The work poses interesting questions regarding the structural noise of fixed parameters. The proposed solution seems feasible and some efforts were made to give evidence for its potential.

Weaknesses

The work lacks empirical evaluation of the performance of the method. Language modeling evaluations are needed. The theoretical contributions are limited to architecture design, but the design is not validated with empirical results. The quantitative evidence for the proposed approach is weak, amounting to a few examples of incoherent text and a low-dimensional visualization of state clusters from similar prompts. No empirical measure of similarity is used to validate the clustering.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis