ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA
Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes

TL;DR
ID-LoRA is a novel model that jointly personalizes both visual appearance and voice in videos using minimal data, enabling synchronized audio-visual generation guided by text, images, and audio references.
Contribution
It introduces a unified approach for audio-visual personalization with in-context learning and identity guidance, addressing challenges in token distinction and speaker feature preservation.
Findings
Preferred over Kling 2.6 Pro in human studies for voice and style similarity
Improves speaker similarity by 24% in cross-environment tests
Achieves effective personalization with only ~3K training pairs on a single GPU
Abstract
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗AviadDahan/LTX-2.3-ID-LoRA-CelebVHQ-3Kmodel· 133 dl· ♡ 29133 dl♡ 29
- 🤗AviadDahan/LTX-2.3-ID-LoRA-TalkVid-3Kmodel· 159 dl· ♡ 29159 dl♡ 29
- 🤗AviadDahan/ID-LoRA-CelebVHQmodel· 39 dl· ♡ 339 dl♡ 3
- 🤗AviadDahan/ID-LoRA-TalkVidmodel· 54 dl· ♡ 754 dl♡ 7
- 🤗qqceqqq/LTX-2.3-ID-LoRA-TalkVid-3Kmodel· 10 dl10 dl
- 🤗qqceqqq/LTX-2.3-ID-LoRA-CelebVHQ-3Kmodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
