# Sound event detection with audio-text models and heterogeneous temporal annotations

**Authors:** Manu Harju, Annamaria Mesaros

arXiv: 2508.20703 · 2025-08-29

## TL;DR

This paper introduces a novel approach to sound event detection that leverages free-form text and synthetic captions to improve detection accuracy, especially when only partial strong labels are available.

## Contribution

It proposes a method integrating audio-text models with heterogeneous temporal annotations, enhancing sound event detection performance with synthetic captions and weak labels.

## Key findings

- Synthetic captions improve detection accuracy.
- Performance gains observed with partial weak labels.
- PSDS-1 score increases significantly with proposed method.

## Abstract

Recent advances in generating synthetic captions based on audio and related metadata allow using the information contained in natural language as input for other audio tasks. In this paper, we propose a novel method to guide a sound event detection system with free-form text. We use machine-generated captions as complementary information to the strong labels for training, and evaluate the systems using different types of textual inputs. In addition, we study a scenario where only part of the training data has strong labels, and the rest of it only has temporally weak labels. Our findings show that synthetic captions improve the performance in both cases compared to the CRNN architecture typically used for sound event detection. On a dataset of 50 highly unbalanced classes, the PSDS-1 score increases from 0.223 to 0.277 when trained with strong labels, and from 0.166 to 0.218 when half of the training data has only weak labels.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20703/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20703/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/2508.20703/full.md

---
Source: https://tomesphere.com/paper/2508.20703