CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

Jing Hu; Danxiang Zhu; Xianlong Luo; Dan Zhang; Shuwei He; Yishu Lei; Haitao Zheng; Shikun Feng; Jingzhou He; Yu Sun; Hua Wu; Haifeng Wang

arXiv:2601.16547·cs.SD·January 26, 2026

CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

PDF

Open Access

TL;DR

This paper introduces CORD, a novel cross-modal alignment framework that improves audio-text reasoning in large audio language models by using online self-distillation and multi-level alignment, significantly reducing the performance gap with text models.

Contribution

CORD presents a unified, on-policy cross-modal self-distillation method that aligns audio and text reasoning, improving reasoning capabilities with minimal synthetic data.

Findings

01

CORD enhances audio-conditioned reasoning across benchmarks.

02

It reduces the audio-text performance gap effectively.

03

The method achieves strong results with only 80k synthetic samples.

Abstract

Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Multimodal Machine Learning Applications