Hierarchical Semantic Perceptual Listener Head Video Generation: A   High-performance Pipeline

Zhigang Chang; Weitai Hu; Qing Yang; Shibao Zheng

arXiv:2307.09821·cs.CV·July 20, 2023

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng

PDF

Open Access

TL;DR

This paper presents a high-performance pipeline for generating listener head videos in dyadic interactions, enhancing semantic extraction and achieving top results in a competitive challenge.

Contribution

It introduces improvements to the hierarchical semantic extraction in the audio encoder and refines the decoder, renderer, and post-processing modules for listener head video synthesis.

Findings

01

Achieved first place on the official leaderboard for listening head generation.

02

Enhanced hierarchical semantic extraction improves video synthesis quality.

03

Proposed pipeline outperforms baseline methods in the challenge.

Abstract

In dyadic speaker-listener interactions, the listener's head reactions along with the speaker's head movements, constitute an important non-verbal semantic expression together. The listener Head generation task aims to synthesize responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the Talking-head generation, it is more challenging to capture the correlation clues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing