Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

Kunjal Panchal; Saayan Mitra; Somdeb Sarkhel; Haoliang Wang; Ishita Dasgupta; Gang Wu; Hui Guan

arXiv:2512.17108·cs.LG·December 22, 2025

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

Kunjal Panchal, Saayan Mitra, Somdeb Sarkhel, Haoliang Wang, Ishita Dasgupta, Gang Wu, Hui Guan

PDF

Open Access

TL;DR

Atom is a modular on-device video-language system that reuses model components to significantly reduce latency and maintain high performance in mobile applications.

Contribution

The paper introduces Atom, a system that decomposes large models into reusable modules for efficient, parallel execution on mobile devices, reducing latency without performance loss.

Findings

01

Achieves 27-33% faster execution on smartphones

02

Maintains retrieval recall within 2.3% of baseline

03

Maintains captioning quality within 1.5 CIDEr points

Abstract

Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ( $\leq$ 2.3 Recall@1 in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization