V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models
Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren

TL;DR
V-CAST is a novel, training-free token pruning method for VideoLLMs that improves long-context video inference efficiency by preserving spatio-temporal information and reducing memory and latency.
Contribution
It introduces a curvature-guided temporal allocation and dual-anchor spatial selection mechanism for effective token compression without training.
Findings
Achieves 98.6% of original performance on VideoLLMs.
Reduces peak memory and latency to approximately 86% of baseline.
Outperforms existing methods by 1.1% on average.
Abstract
Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
