JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Farzaneh Jafari; Stefano Berretti; Anup Basu

arXiv:2408.01627·cs.CV·December 9, 2025

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Farzaneh Jafari, Stefano Berretti, Anup Basu

PDF

Open Access

TL;DR

JambaTalk is a novel hybrid Transformer-Mamba model for 3D talking head generation that improves lip-sync, facial expressions, and head movements, achieving state-of-the-art performance across multiple metrics.

Contribution

The paper introduces Jamba, a hybrid Transformer-Mamba architecture, combining SSM and Transformer strengths for comprehensive 3D talking head animation.

Findings

01

Achieves comparable or superior performance to state-of-the-art models.

02

Enhances motion variety and lip sync through multimodal integration.

03

Effectively handles long sequences with the Mamba architecture.

Abstract

In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Human Motion and Animation · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections