Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

TL;DR
This paper investigates enhancing audio language models for low-resource languages like Thai by optimizing data mixtures and integrating instruction-following capabilities, resulting in a model that outperforms existing open-source models and rivals state-of-the-art systems.
Contribution
It introduces Typhoon-Audio, a model that improves low-resource language processing and instruction-following by balancing multilingual and language-specific training data.
Findings
Typhoon-Audio outperforms existing open-source models.
The model achieves comparable performance to Gemini-1.5-Pro.
Balanced data mixtures enhance instruction-following in low-resource languages.
Abstract
Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis
