AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning
Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng, Chen, Susanto Rahardja

TL;DR
AudioLog introduces a novel LLM-powered system with hybrid contrastive learning for long audio logging, significantly improving scene and event detection while effectively summarizing lengthy audio sequences.
Contribution
This work is the first to leverage large language models for summarizing long audio sequences using a hybrid token-semantic contrastive learning approach.
Findings
Outperforms existing methods in scene classification
Achieves superior sound event detection accuracy
Effectively summarizes long audio sequences
Abstract
Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-semantic audio Transformer by incorporating contrastive learning between hybrid acoustic representations. We then leverage LLMs to generate audio logs that summarize textual descriptions of the acoustic environment. Finally, we evaluate the AudioLog system on two datasets with both scene and event annotations. Experiments show that the proposed system achieves exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
MethodsMulti-Head Attention · Attention Is All You Need · Contrastive Learning · Linear Layer · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer
