Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
Donghuo Zeng, Hao Niu, Masato Taya

TL;DR
HSC-MAE is a novel unsupervised audio-visual learning framework that enforces semantic consistency across multiple levels of representation, improving alignment and discriminative features without labeled data.
Contribution
It introduces a hierarchical, multi-level correlation-aware autoencoder framework combining global, local, and sample-level semantic constraints for better multimodal embedding alignment.
Findings
Significant mAP improvements on AVE and VEGAS datasets.
Robust audio-visual representations validated by experiments.
Effective multi-level semantic correlation enforcement.
Abstract
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
