From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Junhao Chen, Yang Long

TL;DR
This paper introduces a novel synthetic data pipeline and a ViT-based model for temporally consistent human-centric dense prediction in videos, addressing flickering issues and improving accuracy across sequences.
Contribution
The work presents a scalable synthetic data generation method and a unified ViT-based model with explicit human priors, enhancing temporal consistency in dense human video prediction.
Findings
Achieves state-of-the-art results on THuman2.1 and Hi4D datasets.
Effectively generalizes to in-the-wild videos.
Improves temporal stability and accuracy in dense human prediction.
Abstract
In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed method demonstrates substantial quantitative gains over existing approaches across multiple benchmarks. The improvements are consistent and, in some cases, even comparable to large-scale models such as Sapiens, which highlights the effectiveness and efficiency of the proposed design. 2. The paper successfully builds upon the previous observation from DAViD — that “a single high-fidelity dataset is sufficient to tackle multiple dense prediction tasks and achieve state-of-the-art a
1. The authors highlight a scalable data synthesis pipeline for human-centric frames and videos as one of their main contributions. However, although the pipeline is described in Section 3.1, it is not clear how novel or distinctive this approach is compared to existing dataset synthesis methods. The proposed pipeline appears fairly conventional and lacks clear justification or comparison to prior data generation frameworks. 2. Table 4 only includes ablation results for models using either CSE o
* Synthetic data pipeline: Builds a large human-centric data synthesis pipeline to generate diverse human-centric image and video data. * Simple and straightforward design: Uses a ViT backbone for feature extraction with a temporal head (DPT-style) to enforce temporal consistency, while leveraging human priors and local geometry cues. * Strong empirical results: competitive or superior performance across multiple benchmarks compared to prior methods (Sapiens, DAViD)
* **Insufficient overview of the synthetic data pipeline**. Since the dataset is a key contribution, the paper should include a clear figure of the data generation process and provide additional sample visualizations (e.g., in the supplement) to make the pipeline understandable and auditable. * **Missing training-data ablations**. The model is trained on a mixture of SynthHuman and the proposed dataset, but there is no study isolating data effects (e.g., only SynthHuman vs. proposed vs. mixture)
- Build a scalable data synthesis pipeline for human-centric frames and videos with pixel-accurate depth, normals, and segmentation. - Going beyond static-image training with video supervision improves temporal stability and generalization in natural scenes.
1. Points Requiring Clarification (1) Channel Weight Adaptation (CWA): The manuscript does not explain how the module distinguishes channels dominated by texture and illumination from those that are geometry-related, nor how the reweighting is computed to downweight the former and upweight the latter (thereby weakening the influence of appearance on geometry prediction and maintaining the consistency of the global representation). (2) Human Geometric Prior fusion: There is no description of ho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
