Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets
Bo Xue, Yunchong Song, Fanghao Shao, Xuekai Zhu, Lin Chen, Luoyi Fu, Xinbing Wang, Zhouhan Lin

TL;DR
This paper introduces FoSS, a GFlowNets-based framework for dynamic span generation in language models, enabling exploration of diverse compositional paths and improving text quality and task performance.
Contribution
It proposes a novel GFlowNets approach for span-based language modeling with a DAG-structured state space, enhancing diversity and generalization over traditional token-level models.
Findings
Up to 12.5% improvement in MAUVE scores on text generation
3.5% gains on knowledge-intensive tasks
Better scalability with larger models and richer data
Abstract
Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a…
Peer Reviews
Decision·ICLR 2026 Poster
- Casting span generation as an explicit DAG and optimizing it with a GFlowNet is a clean way to expose multiple compositional paths, addressing a known bias of tree-structured token decoders. I find the core idea both interesting and promising. - The experiments are comprehensive and the results are highly positive. FoSS improves MAUVE both in-domain and out-of-domain (Table 1), shows positive GPT-4 preferences (Table 2), and is competitive on QA tasks (Table 3). The ablation study showing a d
- Section 4.3 (“Scaling behavior”) is somewhat misleading. In Figures 2 and 3, the x-axis is rendered with equal spacing while the labels are logarithmically spaced (0.001 → 1.0 with tripling and 0.47 → 15 with doubling). Please use a logarithmic x-axis and plot against numeric proportions to avoid misrepresenting the trends. Figure 4 should use the number of parameters as the x-axis instead of model names. The sizes for GPT-2 Small, Medium, Large, and XL are 124M, 355M, 774M, and 1.5B, respecti
* Overall, I think this is a great paper. The idea is a very neat one, the sort that comes from recognizing the structure of data and realizing what this implies for methods to apply. * The experiments are quite comprehensive, and show the effectiveness of FoSS, with e.g. quite large improvements in MAUVE scores (Table 1) and strong preferences for FoSS text (Table 2). Beyond the values of the metrics, the case study in Fig 5 in the appendix gives a feel for how much better the text generated
Nothing at all major. * There are a few places where some additional remarks on setup could help. For instance, there’s no discussion of greedy vs nucleus in terms of setup, what they might tell us, etc; the numbers are just presented in Table 1. Table captions in general could be a bit more informative.
1.The paper connects well with relevant literature. Each component of the proposed FoSS model is discussed in relation to previous studies, making the distinctions from existing work clear. 2.While the method combines ideas from GFlowNets and dynamic span vocabularies (neither originally introduced by the authors), it tackles non-trivial challenges in integrating these concepts. The authors address these challenges in a thoughtful and technically sound manner. 3.The paper provides detailed exp
1.Lack of comparison with more recent and important baselines. As mentioned by the authors in the related work, two recent and important studies on dynamic vocabularies [1][2] were not included in the main experimental comparisons. 2.For scalability, the largest model evaluated in the scalability experiments is GPT2-XL (~1.5B parameters). It would be valuable to test the approach on larger and more recent models such as LLaMA 3 or Qwen 3. If direct training is infeasible due to resource constra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
