Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

Xuanhan Wang; Huimin Deng; Ke Liu; Jun Wang; Lianli Gao; Jingkuan Song

arXiv:2508.07144·cs.CV·August 12, 2025

Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

Xuanhan Wang, Huimin Deng, Ke Liu, Jun Wang, Lianli Gao, Jingkuan Song

PDF

Open Access 4 Reviews

TL;DR

This paper introduces DPAL, a distillation framework that pretrains lightweight human-centric vision models to generalize well across various tasks by learning from large models and aligning multiple visual patterns.

Contribution

The paper proposes a novel dynamic pattern alignment learning method with a dynamic Mixture of Experts decoder for effective distillation of human-centric visual patterns.

Findings

01

Lightweight models achieve performance comparable to large models.

02

DPAL outperforms previous distillation methods significantly.

03

Models generalize well across diverse datasets.

Abstract

Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- Clear factorization of supervision into global/local/relation levels, with explicit objectives and a unifying D-PaDe adapter; losses are simple and well-defined. - Task-agnostic pretraining that aims for broad transfer across 15 human-centric benchmarks and even cross-species/style settings (HumanArt, AP-10K, Chimpact-Pose). - Practical deployment story: D-PaDe is only used in pretraining; the student backbone is retained for fine-tuning/inference. - Accessible data recipe (LUP1M) and a

Weaknesses

- Reliance on synthetic multi-person composition. Multi-person “interaction” supervision is created by simple copy-paste, which may distort occlusion statistics, scale coherence, and contextual cues that real crowd datasets exhibit. Please quantify domain gap vs. real multi-person corpora (e.g., CrowdHuman) with a pretraining ablation without any synthetic composites and with only real multi-person images, holding compute constant. - Unclear computational overhead & stability of D-PaDe. D-PaDe

Reviewer 02Rating 4Confidence 3

Strengths

- **Practical Research Goal:** The method tackles the important and practical challenge of creating generalizable, lightweight Human-centric Vision Models (HVMs) without relying on massive, private pretraining datasets. - **Novel Distillation Framework:** The paper proposes a novel approach (DPAL) to distill three distinct visual patterns (global, local, and interaction). The core component, a dynamic pattern decoder (D-PaDe), is designed as a dynamic Mixture-of-Experts to mitigate optimization

Weaknesses

- **Ambiguous 'Generalization' Claim:** The claim of distilling "generalization capability" is potentially overstated. The framework's design appears to be a form of multi-task distillation with objectives that are highly tailored to the downstream evaluation categories, rather than a method for learning a truly general representation. - For instance, the global-level objective (aligning multiple student representation from multi-view images to single teacher representation) and the local-le

Reviewer 03Rating 4Confidence 5

Strengths

1. The approach of decomposing human-centric vision into three heterogeneous patterns represents a reasonable analysis, while the D-PaDe architecture that dynamically generates experts based on pattern queries is an innovative application. 2. Ablation experiments thoroughly validate the necessity of each component. The evaluation covers 15 challenging datasets, encompassing diverse tasks such as single-person discrimination, dense prediction, multi-person understanding, and cross-domain generali

Weaknesses

1. The description of D-PaDe's specific implementation mechanism lacks precision. Particularly, the process of how pattern queries interact with input features to dynamically generate expert parameters lacks rigorous mathematical formulation, and how the routing module decides which expert to activate is not sufficiently explained. 2. While the authors emphasize "lightweight", relevant analysis is missing. It is well-known that distillation to small models has been extensively validated (e.g.,

Reviewer 04Rating 6Confidence 3

Strengths

1. High Practical Relevance and Strong Motivation: The paper tackles a highly significant and practical problem. By enabling the creation of powerful lightweight models without access to proprietary large-scale datasets, it significantly lowers the barrier to entry for both research and real-world deployment. 2. Extremely Comprehensive and Compelling Experimental Results: The empirical evaluation is a major strength of this paper. The authors test their method on an impressive 15 datasets, cover

Weaknesses

See the questions below.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning