FlexiAST: Flexibility is What AST Needs
Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

TL;DR
This paper introduces FlexiAST, a training method that enables Audio Spectrogram Transformers to operate effectively across various patch sizes during inference without architectural modifications.
Contribution
The paper proposes a simple training procedure using random patch size selection and embedding resizing to make AST models flexible at inference time.
Findings
FlexiAST maintains performance across different patch sizes.
FlexiAST matches standard AST accuracy on multiple datasets.
The method requires no architectural changes.
Abstract
The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate changes in patch sizes. To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST. This proposed training approach simply utilizes random patch size selection and resizing of patch and positional embedding weights. Our experiments show that FlexiAST gives similar performance to standard AST models while maintaining its evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
