Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection
Yumin Kim, Seonghyeon Go

TL;DR
This paper introduces the Fusion Segment Transformer, a novel model that effectively detects AI-generated full-length music by capturing long-term context through a bi-directional attention mechanism and a Gated Fusion Layer, outperforming existing methods.
Contribution
The paper presents an improved architecture for full-audio AI-generated music detection, incorporating a Gated Fusion Layer to better integrate content and structural information for long-term modeling.
Findings
Achieves state-of-the-art results on SONICS and AIME datasets.
Outperforms previous models and recent baselines.
Effectively captures long-term context in full-length music detection.
Abstract
With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
