Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding
Yassir Benhammou, Suman Kalyan, Sujay Kumar

TL;DR
This paper introduces a reconstruction-driven multimodal autoencoder that learns unified representations across text, audio, and visual data to improve automated media understanding and metadata generation.
Contribution
It proposes a novel multimodal autoencoder trained on the LUMA dataset to learn modality-invariant semantic structures without large paired datasets.
Findings
Significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI).
Effective cross-modal retrieval and semantic clustering.
Enhanced automation and searchability in broadcast media workflows.
Abstract
Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Music and Audio Processing
