Transcription and translation of videos using fine-tuned XLSR Wav2Vec2   on custom dataset and mBART

Aniket Tathe; Anand Kamble; Suyash Kumbharkar; Atharva Bhandare,; Anirban C. Mitra

arXiv:2403.00212·cs.CL·March 4, 2024·1 cites

Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART

Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare,, Anirban C. Mitra

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper presents a method for personalized video transcription and translation using a small dataset, combining fine-tuned XLSR Wav2Vec2 and mBART models to produce synchronized multilingual transcriptions.

Contribution

It introduces a novel approach that fine-tunes XLSR Wav2Vec2 on minimal data and integrates mBART for translation, enabling efficient personalized video transcription and translation.

Findings

01

Achieved effective transcription with only 14 minutes of data.

02

Successfully translated Hindi videos with synchronized text.

03

Developed a web GUI for accessible multilingual transcription.

Abstract

This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Aniket-Tathe-08/XLSR-Wav2Vec2-Finetuned-14min-dataset
model· 2 dl
2 dl

Datasets

Aniket-Tathe-08/Custom_Common_Voice_16.0_dataset_using_RVC_14min_data
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCancer-related molecular mechanisms research · Natural Language Processing Techniques

MethodsXLSR · mBART