TL;DR
ViSiL introduces a novel CNN-based architecture for fine-grained spatio-temporal video similarity learning, capturing detailed intra- and inter-frame relations to improve video retrieval accuracy.
Contribution
The paper presents a new method that computes video similarity from frame-level relations without aggregating features prematurely, enhancing retrieval performance.
Findings
Achieves large improvements over state-of-the-art on five benchmark datasets.
Effectively captures temporal similarity patterns between matching frame sequences.
Demonstrates robustness across four different video retrieval tasks.
Abstract
In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video similarity from refined frame-to-frame similarity matrices, so as to consider both intra- and inter-frame relations. In the proposed method, pairwise frame similarity is estimated by applying Tensor Dot (TD) followed by Chamfer Similarity (CS) on regional CNN frame features - this avoids feature aggregation before the similarity calculation between frames. Subsequently, the similarity matrix between all video frames is fed to a four-layer CNN, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTriplet Loss
