Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

Bixing Wu; Yuhong Zhao; Zongli Ye; Jiachen Lian; Xiangyu Yue; Gopala Anumanchipalli

arXiv:2602.03570·cs.LG·February 4, 2026

Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

Bixing Wu, Yuhong Zhao, Zongli Ye, Jiachen Lian, Xiangyu Yue, Gopala Anumanchipalli

PDF

Open Access

TL;DR

This paper introduces Asymmetric Hierarchical Anchoring (AHA), a novel framework for audio-visual joint representation learning that addresses information allocation ambiguity and improves cross-modal generalization by enforcing directional semantic anchoring.

Contribution

The paper proposes AHA, which uses hierarchical semantic anchors and adversarial decoupling to enhance cross-modal transfer and reduce semantic leakage in audio-visual representations.

Findings

01

AHA outperforms symmetric baselines on AVE and AVVP benchmarks.

02

The framework improves semantic consistency and disentanglement in learned representations.

03

AHA demonstrates broader applicability in tasks like talking-face disentanglement.

Abstract

Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis