Exploring High-Order Self-Similarity for Video Understanding
Manjin Kim, Heeseung Kwon, Karteek Alahari, Minsu Cho

TL;DR
This paper introduces a Multi-Order Self-Similarity (MOSS) module that captures different levels of space-time self-similarity in videos, improving various video understanding tasks with minimal additional computational cost.
Contribution
The paper proposes a novel MOSS module that learns and integrates multi-order space-time self-similarity features for enhanced video modeling.
Findings
MOSS improves performance on action recognition, VQA, and robotic tasks.
It captures distinct aspects of temporal dynamics at different orders.
The module is lightweight and broadly applicable.
Abstract
Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
