C3LLM: Conditional Multimodal Content Generation Using Large Language Models
Zixuan Wang, Qinkai Duan, Yu-Wing Tai, Chi-Keung Tang

TL;DR
C3LLM introduces a unified multimodal framework that leverages large language models to generate audio from video, text, or audio inputs, enhancing fidelity and semantic alignment across modalities.
Contribution
The paper presents a novel hierarchical, discrete-token-based approach for multimodal audio generation using LLMs, integrating video, audio, and text tasks into a single model.
Findings
Improved semantic alignment over previous methods
Enhanced audio fidelity through hierarchical token generation
Unified multimodal generation in an end-to-end framework
Abstract
We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
