JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for   Self-Supervised Sound Event Detection

Hyeonuk Nam; Yong-Hwa Park

arXiv:2502.20857·eess.AS·March 3, 2025

JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection

Hyeonuk Nam, Yong-Hwa Park

PDF

1 Repo

TL;DR

JiTTER introduces a hierarchical temporal shuffle reconstruction pretraining method for sound event detection, improving temporal modeling and event boundary detection by forcing the model to learn correct temporal order and transient details.

Contribution

The paper proposes JiTTER, a novel self-supervised learning framework that uses hierarchical temporal shuffling and noise injection to enhance temporal reasoning in transformer-based sound event detection.

Findings

01

JiTTER outperforms MAT-SED with a 5.89% PSDS improvement.

02

Structured temporal reconstruction improves event boundary detection.

03

Explicit temporal reasoning enhances sound event representation learning.

Abstract

Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

frednam93/JiTTER-SED
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Linear Layer · Absolute Position Encodings · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer