Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant, Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, P{\aa}l Halvorsen

TL;DR
This paper presents a real-time talking-head system that replaces traditional audio feature extraction with OpenAI's Whisper, significantly improving responsiveness and rendering quality for interviewer training applications.
Contribution
It introduces a fully integrated system using Whisper for audio feature extraction, enhancing efficiency and realism in real-time talking-head synthesis.
Findings
Whisper accelerates processing speed.
Improves realism and responsiveness of talking-head interactions.
Effective across multiple datasets.
Abstract
This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
