Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

Omkar Desai (Syracuse University); Ziyang Jiao (Syracuse University); Shuyi Pei (Samsung Semiconductor Inc.); Janki Bhimani (Florida International University); Bryan S. Kim (Syracuse University)

arXiv:2511.13724·cs.OS·November 19, 2025

Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

Omkar Desai (Syracuse University), Ziyang Jiao (Syracuse University), Shuyi Pei (Samsung Semiconductor Inc.), Janki Bhimani (Florida International University), Bryan S. Kim (Syracuse University)

PDF

Open Access

TL;DR

Seneca is a data loading system that optimizes cache partitioning and data sampling to significantly reduce training time and improve throughput for multimedia ML models.

Contribution

It introduces a novel cache partitioning and sampling approach that enhances data pipeline efficiency during concurrent ML training.

Findings

01

Reduces training makespan by 45.23% compared to PyTorch.

02

Increases data processing throughput by up to 3.45x.

03

Outperforms state-of-the-art caching systems in experiments.

Abstract

Input data preprocessing is a common bottleneck when concurrently training multimedia machine learning (ML) models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data loading system that optimizes cache partitioning and data sampling for the data storage and ingestion (DSI) pipeline. The design of Seneca contains two key techniques. First, Seneca uses a performance model for the data pipeline to optimally partition the cache for three different forms of data (encoded, decoded, and augmented). Second, Seneca opportunistically serves cached data over uncached ones during random batch sampling so that concurrent jobs benefit from each other. We implement Seneca by modifying PyTorch and demonstrate its effectiveness by comparing it against several state-of-the-art caching systems for DNN training. Seneca reduces the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Neural Network Applications · Cloud Computing and Resource Management