Collaborative Edge-to-Server Inference for Vision-Language Models
Soochang Song, Yongjune Kim

TL;DR
This paper introduces a collaborative edge-to-server inference framework for vision-language models that minimizes communication by selectively retransmitting high-detail regions, preserving accuracy while reducing data transfer.
Contribution
It presents a novel two-stage inference approach that uses internal attention and confidence measures to selectively retransmit image regions, improving efficiency in vision-language model deployment.
Findings
Significantly reduces communication cost in VLM inference.
Maintains high inference accuracy with selective retransmission.
Effective across multiple VLM architectures.
Abstract
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
