A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks
Babak Naderi, Ross Cutler

TL;DR
This paper introduces a large, high-fidelity talking-head video dataset captured from diverse webcams, annotated for perceptual quality, and used to evaluate video compression methods, advancing research in real-time communication video processing.
Contribution
The authors release a near-raw, high-fidelity talking-head video dataset with extensive annotations and benchmarking, filling a critical resource gap in real-time communication research.
Findings
Codec efficiency varies significantly with content and background processing.
VMAF BD-rate savings up to -71.3% with H.266 compared to H.264.
The dataset is five times larger than previous datasets with lossless signal fidelity.
Abstract
Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
