LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta

TL;DR
LTX-2 is a novel open-source audiovisual foundation model that generates high-quality, synchronized video and audio content from text prompts, advancing the integration of semantic, emotional, and atmospheric cues in AI-generated media.
Contribution
It introduces a dual-stream transformer architecture with cross-attention and a modality-aware guidance mechanism for efficient, high-quality audiovisual generation from text prompts.
Findings
Achieves state-of-the-art audiovisual quality among open-source models.
Produces coherent audio tracks matching scene context and emotion.
Operates with lower computational cost compared to proprietary systems.
Abstract
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Lightricks/LTX-2.3model· 1.5M dl· ♡ 8581.5M dl♡ 858
- 🤗unsloth/LTX-2.3-GGUFmodel· 232k dl· ♡ 268232k dl♡ 268
- 🤗Lightricks/LTX-2.3-nvfp4model· 17k dl· ♡ 5517k dl♡ 55
- 🤗Lightricks/LTX-2model· 1.0M dl· ♡ 16591.0M dl♡ 1659
- 🤗Lightricks/LTX-2.3-fp8model· 537k dl· ♡ 67537k dl♡ 67
- 🤗Lightricks/LTX-2.3-22b-IC-LoRA-Union-Controlmodel· ♡ 33♡ 33
- 🤗Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Controlmodel· ♡ 30♡ 30
- 🤗unsloth/LTX-2-GGUFmodel· 11k dl· ♡ 12711k dl♡ 127
- 🤗vantagewithai/LTX-2.3-Splitmodel· ♡ 3♡ 3
- 🤗vantagewithai/LTX-2.3-GGUFmodel· 41k dl· ♡ 1641k dl♡ 16
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing
