S2M3: Split-and-Share Multi-Modal Models for Distributed Multi-Task Inference on the Edge
JinYi Yoon, JiHo Lee, Ting He, Nakjung Choi, Bo Ji

TL;DR
S2M3 introduces a split-and-share architecture for multi-modal, multi-task inference on edge devices, significantly reducing resource usage and latency while maintaining accuracy, enabling efficient on-device AI for multi-modal applications.
Contribution
The paper proposes a novel split-and-share multi-modal model architecture with greedy module placement for efficient multi-task inference on edge devices, addressing resource constraints.
Findings
Reduces memory usage by up to 62% in multi-task settings.
Achieves up to 56.9% reduction in inference latency.
Maintains accuracy comparable to cloud AI across multiple benchmarks.
Abstract
With the advancement of Artificial Intelligence (AI) towards multiple modalities (language, vision, speech, etc.), multi-modal models have increasingly been used across various applications (e.g., visual question answering or image generation/captioning). Despite the success of AI as a service for multi-modal applications, it relies heavily on clouds, which are constrained by bandwidth, latency, privacy concerns, and unavailability under network or server failures. While on-device AI becomes popular, supporting multiple tasks on edge devices imposes significant resource challenges. To address this, we introduce S2M3, a split-and-share multi-modal architecture for multi-task inference on edge devices. Inspired by the general-purpose nature of multi-modal models, which are composed of multiple modules (encoder, decoder, classifier, etc.), we propose to split multi-modal models at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
