On Data Engineering for Scaling LLM Terminal Capabilities
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping

TL;DR
This paper systematically studies data engineering practices for large language model terminal capabilities, introduces a synthetic task pipeline, and trains models that significantly improve terminal task performance, opening resources for further research.
Contribution
It presents a novel synthetic task generation pipeline and a comprehensive analysis of data strategies, leading to improved terminal models and an open-source dataset.
Findings
Models trained on the new dataset outperform previous versions.
Scaling model size improves terminal task accuracy.
Open-sourcing datasets and checkpoints facilitates future research.
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- nvidia/Nemotron-Terminal-Corpusdataset· 3.1k dl3.1k dl
- AmanPriyanshu/tool-reasoning-sft-CODING-Nemotron-Terminal-Corpus-data-cleaned-rectifieddataset· 59 dl59 dl
- CathleenTico/Nemotron-Terminal-Corpusdataset· 67 dl67 dl
- txchmechanicus/Nemotron-Terminal-Corpusdataset· 59 dl59 dl
- CathleenTico/Nemotron-Terminal-Corpus2dataset· 53 dl53 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
