Speech-driven facial animation using polynomial fusion of features
Triantafyllos Kefalas, Konstantinos Vougioukas, Yannis Panagakis,, Stavros Petridis, Jean Kossaifi, Maja Pantic

TL;DR
This paper introduces a polynomial fusion layer for speech-driven facial animation, capturing higher-order feature interactions to improve video realism, synchronization, and natural blinking in generated talking face videos.
Contribution
It proposes a novel polynomial fusion layer with tensor decomposition to model complex feature interactions in facial animation from speech signals.
Findings
Improved video quality metrics
Enhanced audiovisual synchronization
More natural blinking in generated videos
Abstract
Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces. Recent deep learning approaches to facial synthesis rely on extracting low-dimensional representations and concatenating them, followed by a decoding step of the concatenated vector. This accounts for only first-order interactions of the features and ignores higher-order interactions. In this paper we propose a polynomial fusion layer that models the joint representation of the encodings by a higher-order polynomial, with the parameters modelled by a tensor decomposition. We demonstrate the suitability of this approach through experiments on generated videos evaluated on a range of metrics on video quality, audiovisual synchronisation and generation of blinks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
