DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
Qi Guo, Zheming Yang, Yunqing Hu, Chang Zhao, and Wen Ji

TL;DR
DAT introduces a collaborative cascade and adaptive transmission strategy for efficient multimodal LLM inference on edge-cloud systems, significantly reducing latency and bandwidth usage while maintaining high accuracy.
Contribution
The paper proposes a novel dual-aware adaptive transmission framework with a small-large model cascade and bandwidth-aware optimization for multimodal LLMs in constrained environments.
Findings
Achieves 98.83% recognition accuracy and 100% output consistency.
Reduces weighted semantic alert delay by up to 77.5% under congestion.
Delivers 98.33% of visual evidence within 0.5 seconds.
Abstract
Multimodal large language models (MLLMs) have shown strong capability in semantic understanding and visual reasoning, yet their use on continuous video streams in bandwidth-constrained edge-cloud systems incurs prohibitive computation and communication overhead and hinders low-latency alerting and effective visual evidence delivery. To address this challenge, we propose DAT to achieve high-quality semantic generation, low-latency event alerting, and effective visual evidence supplementation. To reduce unnecessary deep reasoning costs, we propose a collaborative small-large model cascade. A lightweight edge-side small model acts as a gating module to filter non-target-event frames and perform object detection, triggering MLLM inference only for suspicious frames. Building on this, we introduce an efficient fine-tuning strategy with visual guidance and semantic prompting, which improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
