Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
Jing Tan, Yuhong Wang, Gangshan Wu, Limin Wang

TL;DR
The paper introduces Temporal Perceiver, a Transformer-based architecture that unifies the detection of various generic video boundaries, achieving state-of-the-art results across multiple benchmarks with reduced computational complexity.
Contribution
It proposes a novel unified Transformer architecture with latent feature queries and an alignment loss for efficient, boundary-agnostic video segmentation.
Findings
Achieves state-of-the-art results on multiple GBD benchmarks.
Reduces attention complexity from quadratic to linear.
Effectively handles boundaries at shot, event, and scene levels.
Abstract
Generic Boundary Detection (GBD) aims at locating the general boundaries that divide videos into semantically coherent and taxonomy-free units, and could serve as an important pre-processing step for long-form video understanding. Previous works often separately handle these different types of generic boundaries with specific designs of deep networks from simple CNN to LSTM. Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant video input into a fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it greatly reduces the quadratic complexity of attention operation to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Byte Pair Encoding · Adam · Dense Connections · Absolute Position Encodings
