AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Pengfei Zhang; Tianxin Xie; Minghao Yang; Li Liu

arXiv:2603.01006·cs.SD·March 3, 2026

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

PDF

Open Access

TL;DR

This paper introduces AG-REPA, a causal layer selection method for better representation alignment in audio flow matching, improving generative model training by focusing on causally influential layers.

Contribution

The paper proposes a novel causal layer selection strategy using attribution-guided analysis and a forward-only gate ablation, enhancing alignment effectiveness in audio flow models.

Findings

01

AG-REPA outperforms baseline methods across speech and audio datasets.

02

Causally dominant layers are more effective for alignment than representationally rich layers.

03

Alignment to causally influential layers improves generative performance.

Abstract

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing