DiffListener: Discrete Diffusion Model for Listener Generation
Siyeol Jung, Taehwan Kim

TL;DR
DiffListener introduces a discrete diffusion model that generates natural, synchronized listener responses from multimodal speaker cues, improving over autoregressive methods by explicitly modeling facial dynamics.
Contribution
It presents a novel non-autoregressive diffusion-based approach incorporating facial differential information for listener head generation.
Findings
Achieves state-of-the-art quantitative performance.
Produces highly natural and context-aware reactions.
User study confirms high naturalness and synchronization.
Abstract
The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker's multimodal cues. While prior work either rely on limited modalities (e.g. audio and facial information) or employ autoregressive approaches which have limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker's facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsDiffusion
