Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens
Meicong Zhang, Tiancheng su, Guoxiu He

TL;DR
This paper introduces STIG, a novel method that encodes the logical structure of introduction generation directly into LLMs using stage tokens, enabling single-inference multi-stage writing without external workflows.
Contribution
The paper proposes STIG, a parametric approach that replaces external agentic workflows with stage tokens embedded in the model, improving coherence and efficiency in introduction generation.
Findings
STIG outperforms traditional workflows on semantic similarity.
STIG produces more structurally rational introductions.
Single-inference generation reduces complexity and errors.
Abstract
In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper proposes STIG, a method that can combine multi-stage agentic workflow generation of research writing into a single inference pass. - The paper constructs a high-quality dataset from over 3,800 scientific papers from ACL main conferences, utilizing MinerU, GPT-4o, and the Semantic Scholar API.
The evaluation metrics are insufficient. The reliance on automated metrics without human validation means we don't actually know if STIG produces good introductions. We only know that it produces text that scores well on these specific metrics. Furthermore, for scientific writing, one of the most important metrics that you should consider evaluating on is the factual accuracy (whether the claims in the introduction is accurate, whether there is fabricated content, etc). Furthermore, I am not sur
1. STIG outperforms several training-free agentic baselines (AutoSurvey, Outline-Writing) in structural rationality, content coverage, while using fewer tokens. 2. First work to parameterise an entire writing workflow into stage tokens. 3. Contribute a customized dataset tailored for training and testing introduction generation, derived from over 3,800 ACL main conference papers.
1. Trained only on ACL NLP papers; no CV, Theory or other domains. Claims “research introductions” but evidence is NLP-only (ACL). 2. The eight stage tokens are defined for the four subsections that appear tailored to research-style papers; however, ACL also contains many dataset papers whose introductions do not necessarily follow the Background–Problem & Limitations–Method & Experimental Results & Contributions structure. 3. The 'SR' metric is aligned with STIG’s own staged structure, mak
S1. The paper introduces an original idea that embeds workflow logic directly into model parameters via parametric stage tokens. S2. This approach reduces multi-agent dependency and improves inference efficiency in structured text generation. S3. The dataset of 3,800 ACL papers is large and well-structured. It represents a meaningful contribution for future research in the field. S4. Five multi-dimensional evaluation metrics comprehensively capture semantic, structural, and narrative quality.
W1. The paper lacks clarity in distinguishing between the STIG framework and the STIG fine-tuned model. While the conclusion claims STIG eliminates agentic workflows, the framework still depends on them for data construction and stage definition. W2. Fine-tuning details are incomplete. No hyperparameter settings, training configurations, or sensitivity analyses are reported. W3. A hyperparameter study is essential to confirm the stability of stage-token fine-tuning. All five evaluation metric
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Scientific Computing and Data Management
