TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Kaijie Zhu; Yuzhou Nie; Yijiang Li; Yiming Huang; Jialian Wu; Jiang Liu; Ximeng Sun; Zhenfei Yin; Lun Wang; Zicheng Liu; Emad Barsoum; William Yang Wang; Wenbo Guo

arXiv:2602.07274·cs.AI·February 10, 2026

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, Wenbo Guo

PDF

Open Access 1 Models

TL;DR

TermiGen introduces a comprehensive pipeline for creating high-fidelity, verifiable environments and resilient trajectories, significantly improving open-weight LLMs' ability to execute complex terminal tasks by reducing hallucinations and enhancing error recovery.

Contribution

It presents a novel end-to-end method for synthesizing environments and trajectories with error correction, leading to state-of-the-art performance on terminal task benchmarks.

Findings

01

Achieved 31.3% pass rate on TerminalBench with TermiGen-Qwen2.5-Coder-32B.

02

Outperformed existing baselines and proprietary models like o4-mini.

03

Generated diverse, verifiable environments and error-rich trajectories for training.

Abstract

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
UCSB-SURFI/TermiGen-32B
model· 284 dl· ♡ 4
284 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Topic Modeling