Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng; Hao Niu; Masato Taya

arXiv:2604.04229·cs.MM·April 7, 2026

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng, Hao Niu, Masato Taya

PDF

TL;DR

HSC-MAE is a novel unsupervised audio-visual learning framework that enforces semantic consistency across multiple levels of representation, improving alignment and discriminative features without labeled data.

Contribution

It introduces a hierarchical, multi-level correlation-aware autoencoder framework combining global, local, and sample-level semantic constraints for better multimodal embedding alignment.

Findings

01

Significant mAP improvements on AVE and VEGAS datasets.

02

Robust audio-visual representations validated by experiments.

03

Effective multi-level semantic correlation enforcement.

Abstract

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.