Look and Listen: A Multi-modality Late Fusion Approach to Scene   Classification for Autonomous Machines

Jordan J. Bird; Diego R. Faria; Cristiano Premebida; Anik\'o Ek\'art,; George Vogiatzis

arXiv:2007.10175·cs.CV·July 21, 2020

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anik\'o Ek\'art,, George Vogiatzis

PDF

TL;DR

This paper presents a multi-modality scene classification method combining image and audio data through deep late fusion, significantly improving accuracy over single-modality classifiers in complex environments.

Contribution

It introduces a novel multi-modality late fusion approach that enhances scene classification accuracy by integrating image and audio data with a tertiary neural network.

Findings

01

Achieved 96.81% accuracy with multi-modality fusion.

02

Late fusion outperforms classical classifiers by around 3%.

03

Corrected misclassifications caused by single-modality anomalies.

Abstract

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.