Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Junchao Liao; Zhenghao Zhang; Xiangyu Meng; Litao Li; Ziying Zhang; Siyu Zhu; Long Qin; Weizhi Wang

arXiv:2604.09057·cs.CV·April 17, 2026

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Junchao Liao, Zhenghao Zhang, Xiangyu Meng, Litao Li, Ziying Zhang, Siyu Zhu, Long Qin, Weizhi Wang

PDF

TL;DR

Tora3 introduces a trajectory-guided framework for audio-video generation that enhances physical coherence, motion realism, and sound synchronization by leveraging object trajectories as shared kinematic priors.

Contribution

It proposes a novel trajectory-aligned motion representation and kinematic-audio alignment module to improve multimodal coherence in AV generation.

Findings

01

Tora3 outperforms existing methods in motion realism.

02

It achieves better sound-motion synchronization.

03

The approach enhances overall AV generation quality.

Abstract

Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.