AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic   Contrastive Learning

Jisheng Bai; Han Yin; Mou Wang; Dongyuan Shi; Woon-Seng Gan; Jianfeng; Chen; Susanto Rahardja

arXiv:2311.12371·eess.AS·January 5, 2024·ICME·1 cites

AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning

Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng, Chen, Susanto Rahardja

PDF

Open Access 1 Repo

TL;DR

AudioLog introduces a novel LLM-powered system with hybrid contrastive learning for long audio logging, significantly improving scene and event detection while effectively summarizing lengthy audio sequences.

Contribution

This work is the first to leverage large language models for summarizing long audio sequences using a hybrid token-semantic contrastive learning approach.

Findings

01

Outperforms existing methods in scene classification

02

Achieves superior sound event detection accuracy

03

Effectively summarizes long audio sequences

Abstract

Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-semantic audio Transformer by incorporating contrastive learning between hybrid acoustic representations. We then leverage LLMs to generate audio logs that summarize textual descriptions of the acoustic environment. Finally, we evaluate the AudioLog system on two datasets with both scene and event annotations. Experiments show that the proposed system achieves exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jishengbai/audiolog
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies

MethodsMulti-Head Attention · Attention Is All You Need · Contrastive Learning · Linear Layer · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer