C3LLM: Conditional Multimodal Content Generation Using Large Language   Models

Zixuan Wang; Qinkai Duan; Yu-Wing Tai; Chi-Keung Tang

arXiv:2405.16136·cs.AI·May 28, 2024

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Zixuan Wang, Qinkai Duan, Yu-Wing Tai, Chi-Keung Tang

PDF

Open Access

TL;DR

C3LLM introduces a unified multimodal framework that leverages large language models to generate audio from video, text, or audio inputs, enhancing fidelity and semantic alignment across modalities.

Contribution

The paper presents a novel hierarchical, discrete-token-based approach for multimodal audio generation using LLMs, integrating video, audio, and text tasks into a single model.

Findings

01

Improved semantic alignment over previous methods

02

Enhanced audio fidelity through hierarchical token generation

03

Unified multimodal generation in an end-to-end framework

Abstract

We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques