CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata; Takehiro Imamura; Taichi Nishimura; Tatsuya Komatsu

arXiv:2511.15131·eess.AS·January 30, 2026

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

PDF

Open Access 2 Datasets

TL;DR

CASTELLA is a large-scale, human-annotated audio dataset designed to improve the reliability of audio moment retrieval models in real-world scenarios, significantly surpassing previous datasets in size and annotation quality.

Contribution

This paper introduces CASTELLA, the first large-scale, human-annotated dataset for audio moment retrieval, and establishes baseline models demonstrating improved performance.

Findings

01

Fine-tuning on CASTELLA improves model recall by 10.4 points.

02

CASTELLA is 24 times larger than previous datasets.

03

Models trained on CASTELLA outperform synthetic-only models.

Abstract

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing