V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
Donghyuk Kim, Sejeong Yang, Wonjin Shin, and Joo-Young Kim

TL;DR
V-Rex is a novel hardware-software co-designed system that accelerates real-time streaming video LLM inference on edge devices by reducing KV cache memory and computation through a dynamic retrieval algorithm and specialized hardware, achieving significant speed and energy efficiency gains.
Contribution
This work introduces ReSV, a training-free dynamic KV cache retrieval algorithm, and a dedicated hardware accelerator, V-Rex, to enable real-time streaming video LLM inference on edge devices, addressing memory and computation bottlenecks.
Findings
Achieves 3.9-8.3 FPS in real-time streaming video tasks.
Provides 1.9-19.7x speedup over GPU implementations.
Offers 3.1-18.5x energy efficiency improvements.
Abstract
Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Video Analysis and Summarization
