TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability
Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

TL;DR
TokenSmith is an open-source tool that simplifies editing, inspecting, and analyzing large-scale training datasets for language models, enhancing workflow efficiency and accessibility.
Contribution
It introduces a modular, user-friendly library for dataset management that integrates seamlessly with existing pretraining frameworks, enabling structured data editing without changing training code.
Findings
Supports a wide range of dataset operations
Enables structured editing without code modifications
Facilitates dataset debugging and validation
Abstract
Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug-and-play addition to existing large language model pretraining workflows, thereby democratizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Model-Driven Software Engineering Techniques · Explainable Artificial Intelligence (XAI)
