TL;DR
PAVAS introduces a physics-aware approach to video-to-audio synthesis by integrating physical reasoning and physical parameters into a diffusion-based model, enhancing realism and physical consistency.
Contribution
The paper presents PAVAS, a novel V2A model that incorporates physical parameters estimated via vision-language models and 3D reconstruction, improving physical plausibility in generated sounds.
Findings
PAVAS outperforms existing models in physical realism and perceptual quality.
The new benchmark VGG-Impact evaluates physical realism in V2A.
The Audio-Physics Correlation Coefficient (APCC) measures physical-auditory consistency.
Abstract
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
