Video Token Merging for Long-form Video Understanding

Seon-Ho Lee; Jue Wang; Zhikang Zhang; David Fan; Xinyu Li

arXiv:2410.23782·cs.CV·November 1, 2024

Video Token Merging for Long-form Video Understanding

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li

PDF

Open Access

TL;DR

This paper introduces a learnable video token merging method that enhances long-form video understanding by reducing memory usage and increasing processing speed while maintaining high accuracy.

Contribution

It proposes a novel saliency-aware, learnable token merging algorithm for long-form videos, improving efficiency and performance over existing methods.

Findings

01

Achieves comparable or better accuracy on multiple datasets.

02

Reduces memory costs by 84%.

03

Increases throughput by approximately 6.89 times.

Abstract

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Pose and Action Recognition