V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Xinying Lin; Xuyang Liu; Yiyu Wang; Teng Ma; Wenqi Ren

arXiv:2603.27650·cs.CV·March 31, 2026

V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren

PDF

TL;DR

V-CAST is a novel, training-free token pruning method for VideoLLMs that improves long-context video inference efficiency by preserving spatio-temporal information and reducing memory and latency.

Contribution

It introduces a curvature-guided temporal allocation and dual-anchor spatial selection mechanism for effective token compression without training.

Findings

01

Achieves 98.6% of original performance on VideoLLMs.

02

Reduces peak memory and latency to approximately 86% of baseline.

03

Outperforms existing methods by 1.1% on average.

Abstract

Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.