Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with   Vision Language Models

Laura Bravo-S\'anchez; Jaewoo Heo; Zhenzhen Weng; Kuan-Chieh Wang and; Serena Yeung-Levy

arXiv:2410.00309·cs.CV·October 2, 2024

Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

Laura Bravo-S\'anchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang and, Serena Yeung-Levy

PDF

Open Access

TL;DR

This paper introduces a novel data generation approach using Large Vision Language Models to create a comprehensive dataset for human mesh estimation in close interactions, improving model performance in complex social scenarios.

Contribution

We propose a new data annotation method leveraging LVLMs and introduce the APU dataset, significantly advancing data availability for close human interactions in HME.

Findings

01

Using our dataset improves mesh estimation accuracy on unseen interactions.

02

The diffusion-based contact prior enhances test-time optimization.

03

Our approach reduces annotation effort and addresses data scarcity.

Abstract

Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems