ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan; Moran Yanuka; Noa Kraicer; Lior Wolf; Raja Giryes

arXiv:2603.10256·cs.SD·March 12, 2026

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes

PDF

Open Access 6 Models 2 Datasets

TL;DR

ID-LoRA is a novel model that jointly personalizes both visual appearance and voice in videos using minimal data, enabling synchronized audio-visual generation guided by text, images, and audio references.

Contribution

It introduces a unified approach for audio-visual personalization with in-context learning and identity guidance, addressing challenges in token distinction and speaker feature preservation.

Findings

01

Preferred over Kling 2.6 Pro in human studies for voice and style similarity

02

Improves speaker similarity by 24% in cross-environment tests

03

Achieves effective personalization with only ~3K training pairs on a single GPU

Abstract

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing