TL;DR
This paper introduces AuSiL, a novel audio similarity learning method for near-duplicate video retrieval that leverages CNN-based audio descriptors and temporal pattern analysis, demonstrating robustness and competitive performance.
Contribution
The paper presents a new audio similarity learning approach that captures temporal audio patterns using CNNs trained on large-scale audio data, improving robustness to speed transformations.
Findings
Achieves competitive results against state-of-the-art methods.
Robust to speed transformations in audio duplicates.
Effectively captures temporal audio patterns.
Abstract
In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTriplet Loss
