VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
JF Bastien, Sam D'Amico

TL;DR
This paper introduces a training-free method for video vision-language models to reuse scene state across frames, significantly reducing latency without sacrificing accuracy.
Contribution
It proposes adaptive scene state reuse techniques that optimize video processing efficiency in vision-language models without additional training.
Findings
Achieves up to 35.92x latency reduction on Qwen2.5-VL-7B-Instruct-4bit.
Maintains correctness and fidelity over multiple question turns.
Demonstrates real speedups with minimal drift or parse failures.
Abstract
Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
