ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

Siyuan Fu; Xuchen Guo; Mingjun Liu; Hongxiang Li; Boyin Tan; Gongxi Zhu; Xianwei Zhuang; Jinghan Ru; Yuxin Xie; Yuguo Yin

arXiv:2512.19703·eess.AS·March 25, 2026

ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

Siyuan Fu, Xuchen Guo, Mingjun Liu, Hongxiang Li, Boyin Tan, Gongxi Zhu, Xianwei Zhuang, Jinghan Ru, Yuxin Xie, Yuguo Yin

PDF

Open Access

TL;DR

The paper introduces ASK, a framework that enhances audio-text retrieval by overcoming local optimization limitations and aligning external knowledge with evolving models, leading to state-of-the-art results.

Contribution

ASK innovatively combines multi-grained knowledge injection and dynamic refinement to address Gradient Locality Bottleneck and Representation-Drift Mismatch in ATR.

Findings

01

Achieves new state-of-the-art performance on multiple benchmarks.

02

Effectively mitigates acoustic ambiguities and long-tail concept learning.

03

Demonstrates robustness across various backbone architectures.

Abstract

The dominant paradigm for Audio-Text Retrieval (ATR) relies on dual-encoder architectures optimized via mini-batch contrastive learning. However, restricting optimization to local in-batch samples creates a fundamental limitation we term the Gradient Locality Bottleneck (GLB), which prevents the resolution of acoustic ambiguities and hinders the learning of rare long-tail concepts. While external knowledge injection can break this bottleneck, it often triggers a problem called Representation-Drift Mismatch (RDM), where a static knowledge base becomes misaligned with evolving encoders, degrading guidance into noise. To address these intertwined challenges, we propose the Adaptive Self-improving Knowledge (ASK) framework. ASK breaks the GLB via multi-grained knowledge injection and mitigates RDM through a dynamic refinement strategy that synchronizes the knowledge base with the model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing