Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Yassir Benhammou; Suman Kalyan; Sujay Kumar

arXiv:2511.17596·cs.CV·November 25, 2025

Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Yassir Benhammou, Suman Kalyan, Sujay Kumar

PDF

Open Access

TL;DR

This paper introduces a reconstruction-driven multimodal autoencoder that learns unified representations across text, audio, and visual data to improve automated media understanding and metadata generation.

Contribution

It proposes a novel multimodal autoencoder trained on the LUMA dataset to learn modality-invariant semantic structures without large paired datasets.

Findings

01

Significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI).

02

Effective cross-modal retrieval and semantic clustering.

03

Enhanced automation and searchability in broadcast media workflows.

Abstract

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Music and Audio Processing