AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Weichuang Shao; Iman Yi Liao; Tomas Henrique Bode Maul; and Tissa Chandesa

arXiv:2510.19368·cs.SD·November 25, 2025

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, and Tissa Chandesa

PDF

TL;DR

AMAuT is a versatile, training-from-scratch audio transformer framework that supports arbitrary input lengths and rates, achieving high accuracy efficiently without relying on pre-trained models.

Contribution

It introduces a novel multiview learning framework with a CNN bottleneck and dual tokens, enabling flexible and efficient audio classification from scratch.

Findings

01

Achieves up to 99.8% accuracy on multiple benchmarks.

02

Uses less than 3% of GPU hours compared to pre-trained models.

03

Supports arbitrary sample rates and audio lengths.

Abstract

Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.