SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang, Yujia Chen, Du Tran, Vishnu Naresh Boddeti, Wen-Sheng Chu

TL;DR
SEAL introduces a semantic attention learning framework that efficiently represents long videos by decomposing them into scenes, objects, and actions, reducing redundancy and computational cost for improved performance in various understanding tasks.
Contribution
The paper proposes a novel unified long video representation method using semantic entities and an attention module that balances relevance and diversity, advancing long video understanding.
Findings
Outperforms state-of-the-art in video question answering
Achieves superior results in temporal grounding tasks
Demonstrates versatility across multiple benchmarks
Abstract
Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
