TL;DR
JiTTER introduces a hierarchical temporal shuffle reconstruction pretraining method for sound event detection, improving temporal modeling and event boundary detection by forcing the model to learn correct temporal order and transient details.
Contribution
The paper proposes JiTTER, a novel self-supervised learning framework that uses hierarchical temporal shuffling and noise injection to enhance temporal reasoning in transformer-based sound event detection.
Findings
JiTTER outperforms MAT-SED with a 5.89% PSDS improvement.
Structured temporal reconstruction improves event boundary detection.
Explicit temporal reasoning enhances sound event representation learning.
Abstract
Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Linear Layer · Absolute Position Encodings · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
