AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

TL;DR
This paper introduces AG-REPA, a causal layer selection method for better representation alignment in audio flow matching, improving generative model training by focusing on causally influential layers.
Contribution
The paper proposes a novel causal layer selection strategy using attribution-guided analysis and a forward-only gate ablation, enhancing alignment effectiveness in audio flow models.
Findings
AG-REPA outperforms baseline methods across speech and audio datasets.
Causally dominant layers are more effective for alignment than representationally rich layers.
Alignment to causally influential layers improves generative performance.
Abstract
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
