WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang; Yunzhong Hou; Naijing Liu; Liang Zheng

arXiv:2605.13846·cs.CL·May 14, 2026

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

PDF

1 Repo

TL;DR

This paper presents WARDEN, a specialized two-stage model for transcribing and translating the endangered Wardaman language with only 6 hours of data, outperforming larger models in low-resource settings.

Contribution

The paper introduces a novel two-stage approach for low-resource language transcription and translation, utilizing phonemic initialization and domain-specific knowledge.

Findings

01

WARDEN outperforms larger models with only 6 hours of data.

02

Phonemic initialization accelerates transcription model fine-tuning.

03

Domain-specific dictionary improves translation accuracy.

Abstract

This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.