AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef; Tavi Halperin; Naomi Ken Korem; Mohammad Salama; Harel Cain; Asaf Joseph; Anthony Chen; Urska Jelercic; and Ofir Bibi

arXiv:2603.24793·cs.CV·March 27, 2026

AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, and Ofir Bibi

PDF

Open Access 5 Models

TL;DR

AVControl is a modular, efficient framework for training diverse audio-visual controls on joint models, outperforming existing methods on multiple benchmarks with minimal data and computational resources.

Contribution

Introduces AVControl, a lightweight, extendable framework that trains separate control modalities as LoRA adapters on a joint audio-visual foundation model, enabling diverse controls without architectural changes.

Findings

01

Outperforms baselines on depth- and pose-guided generation

02

Achieves competitive results on camera control and audio-visual benchmarks

03

Supports a wide range of independently trained modalities

Abstract

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies