TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie; Ho-Lam Chung; Yi-Cheng Lin; Ke-Han Lu; Wenze Ren; Xie Chen; Hung-yi Lee

arXiv:2603.05094·cs.SD·May 14, 2026

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee

PDF

TL;DR

This paper introduces TW-Sound580K, a large Taiwanese audio-text dataset created with a verification-guided pipeline, and demonstrates its effectiveness in improving localized audio-language modeling with a new Tai-LALM system.

Contribution

The paper presents a novel dataset curation method using Verify-Generate-Critique and Dual-ASR validation, along with a specialized LALM model for regional dialects.

Findings

01

Tai-LALM achieves 49.1% accuracy on TAU Benchmark.

02

The dataset improves LALM performance over zero-shot baselines.

03

Dynamic Dual-ASR arbitration enhances transcription quality.

Abstract

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.