Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Ashwini Dasare; Nirmesh Shah; Ashishkumar Gudmalwar; Pankaj Wasnik

arXiv:2603.28717·eess.AS·April 27, 2026

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik

PDF

TL;DR

This paper introduces a hierarchical multimodal model for evaluating AI-dubbed content, combining audio, video, and text cues to predict human perception efficiently.

Contribution

The work presents a novel multimodal architecture with parameter-efficient fine-tuning and proxy MOS for scalable, perceptually aligned evaluation of AI dubbing.

Findings

01

Achieves PCC > 0.75 in perceptual alignment

02

Trained on 12k Hindi-English dubbed clips

03

Uses proxy MOS derived from objective metrics

Abstract

Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.