SEAL: Semantic Attention Learning for Long Video Representation

Lan Wang; Yujia Chen; Du Tran; Vishnu Naresh Boddeti; Wen-Sheng Chu

arXiv:2412.01798·cs.CV·April 18, 2025

SEAL: Semantic Attention Learning for Long Video Representation

Lan Wang, Yujia Chen, Du Tran, Vishnu Naresh Boddeti, Wen-Sheng Chu

PDF

Open Access

TL;DR

SEAL introduces a semantic attention learning framework that efficiently represents long videos by decomposing them into scenes, objects, and actions, reducing redundancy and computational cost for improved performance in various understanding tasks.

Contribution

The paper proposes a novel unified long video representation method using semantic entities and an attention module that balances relevance and diversity, advancing long video understanding.

Findings

01

Outperforms state-of-the-art in video question answering

02

Achieves superior results in temporal grounding tasks

03

Demonstrates versatility across multiple benchmarks

Abstract

Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training