TL;DR
MMEdge is a novel on-device multimodal inference framework that employs pipelined sensing and encoding to reduce latency while maintaining accuracy, suitable for resource-constrained edge devices.
Contribution
It introduces a pipelined sensing and encoding approach with temporal aggregation, adaptive configuration, and speculative skipping for efficient multimodal inference.
Findings
Significantly reduces end-to-end latency on UAV testbed.
Maintains high task accuracy across system and data variations.
Demonstrates effectiveness on public multimodal datasets.
Abstract
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, a new on-device multimodal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
