Redox: Improving I/O Efficiency of Model Training Through File Redirection
Yuhao Li, Xuanhua Shi, Yunfei Zhao, Yongluan Zhou, Yusheng Hua, Xuehai Qian

TL;DR
Redox is a system that enhances I/O efficiency in model training by leveraging file redirection, enabling batch reads and prefetching, which significantly accelerates training times.
Contribution
Redox introduces a novel file redirection technique and a batch read protocol to improve I/O efficiency in distributed model training.
Findings
Achieves up to 4.57x faster training compared to PyTorch.
Redox's file redirection has minimal impact on training randomness.
Efficient local and distributed read protocols reduce wasted data reads.
Abstract
This paper proposes Redox, a training data management system designed to achieve high I/O efficiency. The key insight is a new observation of file redirection: for model training, when training data in one file is requested, the system has the flexibility to return the data of another file. Based on this property, Redox starts with a bold design principle that chunks of data files are always read from disk in batch, and once loaded, all files in the chunk will be consumed without being loaded again. We propose efficient local and distributed file read protocol based on this principle that both minimizes the wasted data read and enables opportunistic prefetch from remote node. Moreover, we analyze file redirection's impact on randomness, and show that it has little effects on training efficiency. Experimental results indicate that Redox significantly accelerates data fetching in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Cloud Data Security Solutions
