Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation
Roman Kyslyi, Yuliia Maksymiuk, Ihor Pysmennyi

TL;DR
This paper presents a novel approach to adapting large language models for low-resource dialectal translation, specifically for the Ukrainian Hutsul dialect, by creating a parallel corpus, generating synthetic data, and fine-tuning open-source LLMs, achieving superior translation performance.
Contribution
The paper introduces the first dialect-specific adaptation of LLMs for Ukrainian Hutsul, including a new corpus, synthetic data generation pipeline, and evaluation methodology.
Findings
Small fine-tuned models outperform zero-shot GPT-4o baselines.
Synthetic data expansion improves translation quality.
Multi-metric evaluation effectively assesses low-resource dialect translation.
Abstract
In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Language and cultural evolution
