Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
Laura Bravo-S\'anchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang and, Serena Yeung-Levy

TL;DR
This paper introduces a novel data generation approach using Large Vision Language Models to create a comprehensive dataset for human mesh estimation in close interactions, improving model performance in complex social scenarios.
Contribution
We propose a new data annotation method leveraging LVLMs and introduce the APU dataset, significantly advancing data availability for close human interactions in HME.
Findings
Using our dataset improves mesh estimation accuracy on unseen interactions.
The diffusion-based contact prior enhances test-time optimization.
Our approach reduces annotation effort and addresses data scarcity.
Abstract
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
