Adapting Chat Language Models Using Only Target Unlabeled Language Data

Atsuki Yamaguchi; Terufumi Morishita; Aline Villavicencio; Nikolaos Aletras

arXiv:2412.11704·cs.CL·October 21, 2025

Adapting Chat Language Models Using Only Target Unlabeled Language Data

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

PDF

Open Access 1 Repo 10 Models

TL;DR

ElChat is a novel method for adapting chat language models directly on unlabeled target data, avoiding the need for a base model and enhancing language, safety, and instruction-following capabilities.

Contribution

ElChat introduces a new approach that directly adapts chat models on unlabeled data, outperforming previous methods that rely on base models and weight differences.

Findings

01

ElChat achieves superior performance in language adaptation.

02

It maintains robust chat abilities and safety standards.

03

Outperforms previous conversion-based methods.

Abstract

Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gucci-j/chat-cve
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsBalanced Selection