Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel; Lilja {\O}vrelid; Erik Velldal; Andrey Kutuzov

arXiv:2512.08777·cs.CL·March 30, 2026

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja {\O}vrelid, Erik Velldal, Andrey Kutuzov

PDF

2 Models 1 Datasets 1 Video

TL;DR

This paper introduces a post-training method for lower-resource languages that maintains language model fluency when aligned with disfluent reward models, using on-policy training without instruction data.

Contribution

It presents a novel on-policy training approach that improves fluency in language models for lower-resource languages without requiring instruction-tuned data.

Findings

01

On-policy training outperforms supervised finetuning and multilingual finetuning.

02

The method preserves fluency in Norwegian Bokmål as assessed by native speakers.

03

Approach does not rely on instruction-tuning data or native-language datasets.

Abstract

We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokm{\aa}l and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ltg/normistral-fluency-annotation
dataset· 29 dl
29 dl

Videos

Fluent Alignment with Disfluent Judges: Post-training for lower-resource languages· slideslive