A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

Chen Gong; Rui Xing; Zhenzhe Zheng; Fan Wu

arXiv:2505.16563·cs.LG·June 11, 2025

A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

Chen Gong, Rui Xing, Zhenzhe Zheng, Fan Wu

PDF

TL;DR

This paper introduces Titan, a two-stage data selection framework that enhances data utilization and training efficiency on edge devices, leading to faster training and improved accuracy.

Contribution

The paper presents a novel two-stage data selection method with a theoretically optimal strategy for on-device model training, improving efficiency and accuracy.

Findings

01

Up to 43% reduction in training time

02

6.2% increase in final accuracy

03

Minor system overheads maintained

Abstract

The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained manner.In the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodstravel james