UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
Qiangong Zhou, Nagasaka Tomohiro

TL;DR
This paper introduces UniTAF, a modular framework that unifies text-to-speech and audio-to-face models, enabling feature transfer and emotion control, with a focus on system design and reusability of intermediate representations.
Contribution
It presents a novel modular framework for joint TTS and A2F modeling, demonstrating the feasibility of reusing intermediate features for integrated speech and facial expression generation.
Findings
Feasibility of reusing TTS features for joint modeling
Extension of emotion control to combined models
Open-source implementation available
Abstract
This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Social Robot Interaction and HRI
