CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation
Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

TL;DR
This paper introduces CORD, a novel cross-modal alignment framework that improves audio-text reasoning in large audio language models by using online self-distillation and multi-level alignment, significantly reducing the performance gap with text models.
Contribution
CORD presents a unified, on-policy cross-modal self-distillation method that aligns audio and text reasoning, improving reasoning capabilities with minimal synthetic data.
Findings
CORD enhances audio-conditioned reasoning across benchmarks.
It reduces the audio-text performance gap effectively.
The method achieves strong results with only 80k synthetic samples.
Abstract
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
