Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Mengjie Zhao; Lianbo Liu; Yusuke Fujita; Hao Shi; Yuan Gao; Roman Koshkin; Yui Sudo

arXiv:2603.12565·cs.SD·March 16, 2026

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Mengjie Zhao, Lianbo Liu, Yusuke Fujita, Hao Shi, Yuan Gao, Roman Koshkin, Yui Sudo

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a preference-based alignment method to adapt Japanese SpeechLLMs for producing speech-worthy outputs, improving naturalness and conversational quality for speech synthesis while maintaining written-style performance.

Contribution

It presents a novel alignment approach tailored for Japanese SpeechLLMs and introduces SpokenElyza, a benchmark for evaluating speech-worthiness in Japanese dialogue systems.

Findings

01

Significant improvement on SpokenElyza benchmark

02

Preserves original written-style performance

03

Enhances naturalness and conversational quality

Abstract

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sbintuitions/voicebench-ja
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis