Variable-Length Audio Fingerprinting
Hongjie Chen, Hanyu Meng, Huimin Zeng, Ryan A. Rossi, Lie Lu, Josh Kimball

TL;DR
This paper introduces VLAFP, a novel deep learning method for audio fingerprinting that processes variable-length audio clips, improving recognition accuracy in real-world scenarios.
Contribution
VLAFP is the first deep audio fingerprinting model supporting variable-length audio, addressing the limitations of fixed-length segmentation in prior approaches.
Findings
VLAFP outperforms existing methods in live audio identification.
VLAFP achieves higher accuracy in audio retrieval tasks.
The model effectively handles variable-length audio inputs.
Abstract
Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.
Peer Reviews
Decision·Submitted to ICLR 2026
### Originality - The paper provides a clean extension of neural audio fingerprinting by replacing the convolutional encoder with a transformer encoder and by proposing a segment-level embedding refinement strategy. - The model is trained using a standard contrastive loss as used in the baseline Neural Audio FingerPrinting (NAFP) [Chen et al.]. - The idea of segmenting database recordings into variable-length segments based on spectral entropy is conceptually interesting and practically motiv
### Conceptual - The paper is primarily empirical, with limited theoretical analysis. - The empirical analysis could be enhanced by including other tasks that rely on audio representation learning. Currently, the scope is narrowly focused on audio retrieval; broader implications for general-purpose audio representation learning are unexplored. - The method restricts segment lengths to a predefined range \([T_{\min}, T_{\max}]\), raising a question about how queries shorter than \(T_{\min}\)
1. Novelty of Variable-Length Processing: To the best of the reviewers’ knowledge, VLAFP is the first deep audio fingerprinting model capable of handling variable-length inputs during both training and inference. This addresses a critical limitation in existing methods and enables more natural and semantically meaningful segmentation. 2. Well-Motivated and Effective Segmentation Strategy: The spectral entropy-based segmentation is well-justified and grounded in signal processing principles. The
1. Computational Overhead at Inference: The paper notes that VLAFP has a longer inference time due to the overhead of handling variable-length segments. This could be a practical limitation for real-time applications, and the trade-off between accuracy and efficiency is not deeply analyzed. 2. Evaluation on Synthetic Distortions: The audio augmentations (time-stretching, background noise, impulse response) are synthetically applied. While standard in the field, real-world broadcast distortions
The core idea leverages two attention mechanisms—inter-frame and frame-to-segment attention—to capture both local and global contextual information from audio frames, which is a reasonable design choice.
However, several significant issues undermine the paper’s contribution and rigor: 1. **Lack of Technical Precision**: The mathematical formulation is often imprecise. For instance, in Section 3.1, the use of an approximation symbol (≈) is misleading; the intent appears to be minimizing the distance between *z* and *z*′ , which should be explicitly stated as an optimization objective. Moreover, symbols are frequently introduced without definition—e.g., *H* and *s* appear abruptly in the Frame-to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
