OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang; XiaoXiao Ma; Jie Huang; Hang Xu; Hu Yu; Siming Fu; Yuming Li; Zeyue Xue; Lin Song; Haoyang Huang; Nan Duan; Feng Zhao

arXiv:2605.12480·cs.CV·May 13, 2026

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

PDF

1 Models

TL;DR

OmniNFT introduces a modality-aware reinforcement learning framework for joint audio-video generation, enhancing fidelity, alignment, and synchronization through innovative advantage routing, gradient surgery, and region-wise loss reweighting.

Contribution

The paper presents OmniNFT, a novel online diffusion RL method with three key innovations to improve multi-modal audio-video generation quality and alignment.

Findings

01

Achieves state-of-the-art results on JavisBench and VBench datasets.

02

Significantly improves cross-modal alignment and synchronization.

03

Enhances perceptual quality of generated audio and video.

Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zghhui/OmniNFT
model· 59 dl· ♡ 25
59 dl♡ 25

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.