Efficient Adaptation of Multilingual Models for Japanese ASR
Mark Bajo, Haruka Fukukawa, Ryuji Morita, and Yuma Ogasawara

TL;DR
This paper demonstrates that fine-tuning multilingual ASR models like Whisper-Tiny with Japanese datasets and LoRA significantly improves Japanese speech recognition accuracy, surpassing monolingual models in CER.
Contribution
It introduces a fine-tuning approach using LoRA and end-to-end training to adapt multilingual models for Japanese ASR, achieving state-of-the-art performance.
Findings
CER reduced from 32.7 to 14.7 with fine-tuning
Fine-tuning surpasses Whisper-Base's CER of 20.2
Method retains model flexibility for language-specific tasks
Abstract
This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI's Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny's Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base's CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
