Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse
Kunjal Panchal, Saayan Mitra, Somdeb Sarkhel, Haoliang Wang, Ishita Dasgupta, Gang Wu, Hui Guan

TL;DR
Atom is a modular on-device video-language system that reuses model components to significantly reduce latency and maintain high performance in mobile applications.
Contribution
The paper introduces Atom, a system that decomposes large models into reusable modules for efficient, parallel execution on mobile devices, reducing latency without performance loss.
Findings
Achieves 27-33% faster execution on smartphones
Maintains retrieval recall within 2.3% of baseline
Maintains captioning quality within 1.5 CIDEr points
Abstract
Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ( 2.3 Recall@1 in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
