THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

TL;DR
This paper introduces a comprehensive evaluation framework with 8 metrics for assessing talking head video generation, emphasizing quality, naturalness, and synchronization, and provides extensive experimental validation on a new dataset.
Contribution
The authors propose a new, efficient evaluation framework with 8 metrics for talking head videos, including a curated dataset and public benchmarks to track progress.
Findings
Many models excel at lip synchronization but struggle with expressiveness.
Generated videos often contain artifacts and lack natural expressiveness.
The new dataset helps mitigate training data bias.
Abstract
Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
