The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing   Audio Generation Challenge

Dake Guo; Jixun Yao; Xinfa Zhu; Kangxiang Xia; Zhao Guo; Ziyu Zhang,; Yao Wang; Jie Liu; Lei Xie

arXiv:2410.23815·cs.SD·November 1, 2024

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

Dake Guo, Jixun Yao, Xinfa Zhu, Kangxiang Xia, Zhao Guo, Ziyu Zhang,, Yao Wang, Jie Liu, Lei Xie

PDF

Open Access

TL;DR

This paper introduces the NPU-HWC system for the ISCSLP 2024 Audio Generation Challenge, featuring a speech generator with zero-shot style cloning and a background audio generator using LLMs, achieving top competition results.

Contribution

The paper presents a novel two-module system combining token-based speech synthesis with style cloning and LLM-driven background audio generation for high-quality audio synthesis.

Findings

01

Achieved second place in Track 1

02

Won first place in Track 2

03

Effective decoupling of timbre and style

Abstract

This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis