X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

Di Cao; Dongjie Fu; Hai Yu; Siqi Zheng; Xu Tan; Tao Jin

arXiv:2603.24596·eess.AS·March 31, 2026

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin

PDF

TL;DR

X-OPD is a novel framework that aligns speech LLMs with text-based models through cross-modal on-policy distillation, improving performance on complex tasks.

Contribution

It introduces a new distillation method that enables speech LLMs to explore and learn capabilities from text-based teachers via on-policy rollouts.

Findings

01

X-OPD significantly narrows the performance gap in complex tasks.

02

The method preserves the inherent capabilities of speech LLMs.

03

Experiments across multiple benchmarks validate the effectiveness of X-OPD.

Abstract

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.