TrainVerify: Equivalence-Based Verification for Distributed LLM Training
Yunchi Lu, Youshan Miao, Cheng Tan, Peng Huang, Yi Zhu, Xian Zhang, Fan Yang

TL;DR
TrainVerify is a system that formally verifies the correctness of distributed large language model training plans, ensuring they are mathematically equivalent to their specifications and reducing verification complexity for models with billions of parameters.
Contribution
It introduces shape-reduction and stage-wise verification techniques enabling scalable formal verification of large-scale distributed LLM training plans.
Findings
Successfully verified training plans for Llama3 (405B) and DeepSeek-V3 (671B) models.
Significantly reduces verification complexity for billion-parameter models.
Provides a formal correctness guarantee for distributed LLM training.
Abstract
Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation
