Multimodal Real-Time Anomaly Detection and Industrial Applications
Aman Verma, Keshav Samdani, Mohd. Samiuddin Shafi

TL;DR
This paper develops a multimodal, real-time anomaly detection system using synchronized video and audio, demonstrating significant accuracy and robustness improvements for industrial monitoring applications.
Contribution
The paper introduces an advanced multimodal system with multi-model audio ensembles, hybrid object detection, and cross-modal attention, enhancing real-time anomaly detection in industrial environments.
Findings
Achieves high accuracy in real-time industrial monitoring
Demonstrates robustness across various scenarios
Improves detection performance with multi-model fusion
Abstract
This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Music and Audio Processing · Seismology and Earthquake Studies
