CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

Yulin Zou; Yan Chen; Wenyan Chen; JooYoung Park; Shivaraman Nitin; Luo Tao; Francisco Romero; Dmitrii Ustiugov

arXiv:2604.06036·cs.DC·April 10, 2026

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

PDF

TL;DR

CodecSight leverages video codec metadata to enable real-time, efficient vision-language model inference in streaming video, significantly reducing computation and transmission costs while maintaining accuracy.

Contribution

It introduces a novel codec-guided approach that uses existing codec metadata for online optimization, eliminating the need for offline training and enabling scalable streaming analytics.

Findings

01

Up to 3x throughput improvement over baselines.

02

Up to 87% GPU compute reduction while maintaining accuracy.

03

Achieves only 0-8% F1 drop in performance.

Abstract

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.