Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization
Bixing Wu, Yuhong Zhao, Zongli Ye, Jiachen Lian, Xiangyu Yue, Gopala Anumanchipalli

TL;DR
This paper introduces Asymmetric Hierarchical Anchoring (AHA), a novel framework for audio-visual joint representation learning that addresses information allocation ambiguity and improves cross-modal generalization by enforcing directional semantic anchoring.
Contribution
The paper proposes AHA, which uses hierarchical semantic anchors and adversarial decoupling to enhance cross-modal transfer and reduce semantic leakage in audio-visual representations.
Findings
AHA outperforms symmetric baselines on AVE and AVVP benchmarks.
The framework improves semantic consistency and disentanglement in learned representations.
AHA demonstrates broader applicability in tasks like talking-face disentanglement.
Abstract
Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis
