FAST: Fast Audio Spectrogram Transformer
Anugunj Naman, Gaibo Zhang

TL;DR
FAST is a lightweight, efficient audio spectrogram transformer that combines CNNs and transformers, achieving state-of-the-art results in real-time audio classification with significantly fewer parameters.
Contribution
Introduces FAST, a novel architecture combining CNNs and transformers with Lipschitz attention for efficient, stable, real-time audio classification.
Findings
Achieves state-of-the-art performance on ADIMA and AudioSet datasets.
Uses up to 150x fewer parameters than existing models.
Demonstrates effectiveness in multilingual profanity and abuse detection.
Abstract
In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Advanced Adaptive Filtering Techniques
