Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding
Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Mu Yuan, Xiaowen Chu, Weijie Hong, Xu Chen

TL;DR
Venus is an edge-cloud system that enables efficient, real-time video understanding by hierarchical memory management and adaptive keyframe retrieval, significantly reducing latency while maintaining high reasoning accuracy.
Contribution
Venus introduces a novel edge-cloud architecture with hierarchical memory and adaptive retrieval for efficient online video understanding on resource-constrained devices.
Findings
Achieves 15x-131x speedup in response latency
Maintains comparable or superior reasoning accuracy
Enables real-time video understanding on edge devices
Abstract
Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
