Transformer Based Machine Fault Detection From Audio Input

Kiran Voderhobli Holla

arXiv:2604.12733·cs.SD·April 15, 2026

Transformer Based Machine Fault Detection From Audio Input

Kiran Voderhobli Holla

PDF

TL;DR

This paper explores the use of transformer-based models for machine fault detection from audio data, showing their potential advantages over traditional CNN approaches.

Contribution

It demonstrates the effectiveness of transformer architectures in analyzing sound data for machine fault detection and compares their embeddings with CNNs.

Findings

01

Transformer models outperform CNNs in spectrogram analysis for fault detection.

02

Transformers generate more relevant embeddings for machine failure prediction.

03

Lower inductive biases in transformers lead to better performance with sufficient data.

Abstract

In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net (CNN) architectures have been used to analyze spectrogram images generated from the sounds captured and predict if the machine is functioning as expected. CNN architectures seem to work well empirically even though they have biases like locality and parameter-sharing which may not be completely relevant for spectrogram analysis. With the successful application of transformer-based models in the field of image processing starting with Vision Transformer (ViT) in 2020, there has been significant interest in leveraging these in the field of Sound AI. Since transformer-based architectures have significantly lower inductive biases, they are expected to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.