Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for   Long-form Video Understanding

Mohamed Afham; Satya Narayan Shukla; Omid Poursaeed; Pengchuan Zhang,; Ashish Shah; Sernam Lim

arXiv:2309.11569·cs.CV·September 22, 2023

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang,, Ashish Shah, Sernam Lim

PDF

Open Access

TL;DR

This paper introduces an adaptive, unsupervised Kernel Temporal Segmentation method for sampling long videos, improving upon uniform sampling and achieving state-of-the-art results in long-form video understanding tasks.

Contribution

The paper proposes a novel, task-agnostic KTS-based sampling approach for long videos, replacing uniform sampling with semantically meaningful segments.

Findings

01

Consistent performance improvements over existing methods

02

State-of-the-art results in long-form video classification

03

Effective in temporal action localization

Abstract

While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization