Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang; Zhuokai Zhao; Zhaorun Chen; Zenghui Ding; Xianjun Yang,; Yining Sun

arXiv:2411.14401·cs.CV·March 25, 2025

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang,, Yining Sun

PDF

Open Access

TL;DR

DYTO introduces a dynamic token merging framework that adaptively compresses video tokens, enhancing zero-shot video understanding performance without fine-tuning, by balancing efficiency and semantic preservation.

Contribution

It presents a novel, training-free approach with hierarchical frame selection and bipartite token merging to improve zero-shot video understanding.

Findings

01

Outperforms existing methods on multiple benchmarks.

02

Sets new state-of-the-art for zero-shot video understanding.

03

Efficiently balances computational cost with semantic detail.

Abstract

Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics