Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Weihan Xu; Kan Jen Cheng; Koichi Saito; Muhammad Jehanzeb Mirza; Tingle Li; Yisi Liu; Alexander H. Liu; Liming Wang; Masato Ishii; Takashi Shibuya; Yuki Mitsufuji; Gopala Anumanchipalli; Paul Pu Liang

arXiv:2512.12875·cs.CV·December 16, 2025

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Weihan Xu, Kan Jen Cheng, Koichi Saito, Muhammad Jehanzeb Mirza, Tingle Li, Yisi Liu, Alexander H. Liu, Liming Wang, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji, Gopala Anumanchipalli, Paul Pu Liang

PDF

Open Access

TL;DR

This paper introduces SAVEBench, a new dataset and the Schrodinger Audio-Visual Editor (SAVE), a model that jointly edits audio and visual content at the object level, enabling precise removal while maintaining synchronization and semantic alignment.

Contribution

The paper presents SAVEBench dataset and a novel end-to-end flow-matching model, SAVE, for joint audiovisual editing using a Schrodinger Bridge for direct source-to-target transformation.

Findings

01

SAVE effectively removes target objects from audio and video.

02

SAVE achieves stronger temporal synchronization.

03

SAVE maintains audiovisual semantic correspondence.

Abstract

Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing