Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Xinmeng Xu; Haoran Xie; S. Joe Qin; Lin Li; Xiaohui Tao; Fu Lee Wang

arXiv:2605.01673·cs.SD·May 5, 2026

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Xinmeng Xu, Haoran Xie, S. Joe Qin, Lin Li, Xiaohui Tao, Fu Lee Wang

PDF

TL;DR

This paper introduces DPC-Net, a framework that improves stage-wise audio-visual learning by estimating and correcting representation readiness, leading to better performance across multiple tasks.

Contribution

It formulates the readiness deficiency problem and proposes DPC-Net to localize and correct bottlenecks in representation propagation for audio-visual tasks.

Findings

01

DPC-Net improves performance in speech separation, event localization, and speech recognition.

02

The method effectively localizes intervention-sensitive bottlenecks.

03

Readiness-guided correction enhances the quality of fused representations.

Abstract

Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also needs sufficient cross-layer and cross-modal support before it can reliably guide later fusion. This paper studies this issue through propagation-aware representation readiness and formulates premature perceptual commitment as a readiness-deficiency problem, where local plausibility, propagation influence, and support insufficiency jointly appear at an intermediate stage. We propose the Delayed Perceptual Commitment Network (DPC-Net), an encoder-level framework that estimates an observable readiness-deficiency surrogate, localizes the intervention-sensitive bottleneck, and applies support-aware correction with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.