JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu; Jungang Li; Yuchong Sun; Shengqiong Wu; Jianzhang Gao; Daoan Zhang; Wei Zhang; Sheng Jin; Sicheng Yu; Geng Zhan; Jiayi Ji; Fan Zhou; Liang Zheng; Shuicheng Yan; Hao Fei; Tat-Seng Chua

arXiv:2512.22905·cs.CV·January 5, 2026

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua

PDF

Open Access 2 Models 4 Datasets 1 Video

TL;DR

JavisGPT is a novel unified multimodal large language model that achieves coherent audio-video comprehension and generation through a specialized architecture and extensive instruction tuning, outperforming existing models in complex scenarios.

Contribution

The paper introduces JavisGPT, the first unified model for joint audio-video understanding and generation, with a new architecture and large-scale instruction dataset.

Findings

01

Outperforms existing MLLMs on JAV benchmarks.

02

Achieves temporally coherent video-audio understanding.

03

Effective three-stage training pipeline enhances multimodal capabilities.

Abstract

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis