MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Woohyun Cho; Youngmin Kim; Sunghyun Lee; Youngjae Yu

arXiv:2505.18614·cs.CL·September 19, 2025

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces MAVL, a comprehensive multilingual multimodal dataset for animated song translation, and proposes SylAVL-CoT, a model leveraging audio, video, and syllabic constraints to improve translation quality.

Contribution

The paper presents the first multilingual multimodal benchmark for animated song translation and a novel syllable-constrained model that enhances singability and contextual accuracy.

Findings

01

SylAVL-CoT outperforms text-only models in singability.

02

Multimodal approaches improve translation quality.

03

MAVL enables richer, more expressive lyrics translation.

Abstract

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k1064190/MAVL
noneOfficial

Datasets

Noename/MAVL
dataset· 61 dl
61 dl

Videos

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation· underline

Taxonomy

TopicsMusic and Audio Processing · Human Motion and Animation

MethodsMultiscale Attention ViT with Late fusion