Improving Language and Modality Transfer in Translation by Character-level Modeling

Ioannis Tsiamas; David Dale; Marta R. Costa-juss\`a

arXiv:2505.24561·cs.CL·June 2, 2025

Improving Language and Modality Transfer in Translation by Character-level Modeling

Ioannis Tsiamas, David Dale, Marta R. Costa-juss\`a

PDF

Open Access 1 Video

TL;DR

This paper introduces a character-level modeling approach for multilingual translation and speech translation, enhancing adaptability to low-resource and unseen languages by leveraging cross-modal knowledge transfer and a fixed embedding space.

Contribution

It proposes a novel character-based method utilizing SONAR embeddings and a teacher-student training scheme to improve language transfer and zero-shot generalization in translation systems.

Findings

01

Outperforms subword models in low-resource language transfer

02

Achieves state-of-the-art speech-to-text translation on FLEURS benchmark

03

Demonstrates strong zero-shot generalization to unseen languages

Abstract

Current translation systems, despite being highly multilingual, cover only 5% of the world's languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and cross-modal knowledge transfer. To this end, we propose a character-based approach to improve adaptability to new languages and modalities. Our method leverages SONAR, a multilingual fixed-size embedding space with different modules for encoding and decoding. We use a teacher-student approach with parallel translation data to obtain a character-level encoder. Then, using ASR data, we train a lightweight adapter to connect a massively multilingual CTC ASR model (MMS), to the character-level encoder, potentially enabling speech translation from 1,000+ languages. Experimental results in text translation for 75 languages on FLORES+ demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Language and Modality Transfer in Translation by Character-level Modeling· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAdapter