TL;DR
OmniNFT introduces a modality-aware reinforcement learning framework for joint audio-video generation, enhancing fidelity, alignment, and synchronization through innovative advantage routing, gradient surgery, and region-wise loss reweighting.
Contribution
The paper presents OmniNFT, a novel online diffusion RL method with three key innovations to improve multi-modal audio-video generation quality and alignment.
Findings
Achieves state-of-the-art results on JavisBench and VBench datasets.
Significantly improves cross-modal alignment and synchronization.
Enhances perceptual quality of generated audio and video.
Abstract
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
