MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Shuowei Li; Yuming Zhao; Parth Bhalerao; Oana Ignat

arXiv:2605.16716·cs.CV·May 19, 2026

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

PDF

1 Repo 1 Datasets

TL;DR

MAVEN is a multi-agent framework that enhances cultural fidelity in text-to-video generation by decomposing prompts and systematically evaluating across diverse cultural scenarios.

Contribution

Introduces MAVEN, a novel multi-agent prompt refinement framework, and a comprehensive benchmark for evaluating cultural fidelity in T2V generation.

Findings

01

Multi-agent refinement improves cultural relevance.

02

Parallel specialization enhances visual quality and consistency.

03

Benchmark enables systematic evaluation of cultural fidelity.

Abstract

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIM-SCU/CRAFT
github

Datasets

AIM-SCU/MAVEN_Multicultura_Text-to-Video_Generation
dataset· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.