Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang,, Yining Sun

TL;DR
DYTO introduces a dynamic token merging framework that adaptively compresses video tokens, enhancing zero-shot video understanding performance without fine-tuning, by balancing efficiency and semantic preservation.
Contribution
It presents a novel, training-free approach with hierarchical frame selection and bipartite token merging to improve zero-shot video understanding.
Findings
Outperforms existing methods on multiple benchmarks.
Sets new state-of-the-art for zero-shot video understanding.
Efficiently balances computational cost with semantic detail.
Abstract
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics
