THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond
Letian Wang, Andrei Zanfir, Eduard Gabriel Bazavan, Misha Andriluka, Cristian Sminchisescu

TL;DR
THFM is a unified video foundation model that performs multiple human-centric perception tasks within a single architecture, trained solely on synthetic data, and exhibits emergent generalization properties.
Contribution
The paper introduces THFM, a novel unified perception model derived from a text-to-video diffusion model, capable of handling diverse dense and sparse human perception tasks.
Findings
THFM matches or surpasses state-of-the-art specialized models on various benchmarks.
It generalizes from single-human videos to multiple humans and other object classes.
The model is trained exclusively on synthetic data without real-world training.
Abstract
We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
