UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou; Nagasaka Tomohiro

arXiv:2602.15651·cs.SD·March 4, 2026

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou, Nagasaka Tomohiro

PDF

Open Access

TL;DR

This paper introduces UniTAF, a modular framework that unifies text-to-speech and audio-to-face models, enabling feature transfer and emotion control, with a focus on system design and reusability of intermediate representations.

Contribution

It presents a novel modular framework for joint TTS and A2F modeling, demonstrating the feasibility of reusing intermediate features for integrated speech and facial expression generation.

Findings

01

Feasibility of reusing TTS features for joint modeling

02

Extension of emotion control to combined models

03

Open-source implementation available

Abstract

This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Social Robot Interaction and HRI