Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao, Gen Li, Shreyank N Gowda, Robert B Fisher, Jonathan Huang,, Anurag Arnab, Laura Sevilla-Lara

TL;DR
This paper investigates the nature of visual tokens in video transformers, revealing that most tokens carry minimal information, and proposes a lightweight model, LITE, that efficiently selects tokens to outperform existing methods in accuracy and efficiency.
Contribution
The paper uncovers five principles of visual tokens and introduces LITE, a model that effectively selects tokens based on these principles, improving efficiency and accuracy in video understanding.
Findings
LITE outperforms state-of-the-art models on Kinetics-400 and Something-Something-V2.
Most tokens follow a Pareto distribution, with few carrying most perceptual information.
LITE generalizes well across datasets and tasks without retraining.
Abstract
Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become particularly relevant. Some creative solutions include token selection and merging. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. For example, we observe that the value of tokens follows a clear Pareto-distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. We build on these and further insights to propose a lightweight video model, LITE, that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques
