A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning   Based Multimodal Approach (A use case of riot or violent context detection)

Lam Pham; Phat Lam; Tin Nguyen; Hieu Tang; Alexander Schindler

arXiv:2407.03110·cs.SD·July 4, 2024

A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Alexander Schindler

PDF

Open Access

TL;DR

This paper introduces a versatile deep learning-based toolchain that integrates multiple audio and video analysis tasks for applications like event detection, summarization, and context identification, demonstrating flexibility and effectiveness.

Contribution

It presents a comprehensive, adaptable toolchain combining various multimodal deep learning tasks for audio/video analysis, enabling new applications like riot detection.

Findings

01

Effective integration of multiple audio/video analysis tasks

02

Applications in event detection and summarization demonstrated

03

Flexible architecture for future model integration

Abstract

In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing