Pose Priors from Language Models

Sanjay Subramanian; Evonne Ng; Lea M\"uller; Dan Klein; Shiry Ginosar; and Trevor Darrell

arXiv:2405.03689·cs.CV·May 16, 2025

Pose Priors from Language Models

Sanjay Subramanian, Evonne Ng, Lea M\"uller, Dan Klein, Shiry Ginosar, and Trevor Darrell

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel approach that uses large multimodal models as priors to improve 3D human pose estimation by capturing contact semantics, reducing reliance on manual annotations or motion capture data.

Contribution

It presents a scalable method leveraging LMMs to extract contact-relevant descriptors for constraining 3D human pose optimization, enabling more accurate contact pose reconstructions.

Findings

01

Effective in two-person interaction scenarios

02

Accurately captures physical and social contact semantics

03

Offers a scalable alternative to manual annotations

Abstract

Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. Our approach extracts contact-relevant descriptors from an LMM and translates them into tractable losses to constrain 3D human pose optimization. Despite its simplicity, our method produces compelling reconstructions for both two-person interactions and self-contact scenarios, accurately capturing the semantics of physical and social interactions. Our results demonstrate that LMMs can serve as powerful tools for contact prediction and pose estimation, offering an alternative to costly manual human annotations or motion capture data.…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

- The paper is well written and flows well. The terminology and methodology is explained well - Interesting use of LLMs: With the rise in LLMs, it is interesting to see vision problems being solved via language - especially for 3D computer vision and surpassing traditional methods. The authors have used LLMs to generate constraints for pose estimation. - The proposed method performs well compared to traditional techniques.

Weaknesses

My main concerns with the paper are: - It is not clear enough on why solving touch constraints help with better pose optimization. Are there other failure cases in pose optimization than touch constraints. I guess I am not clear on: If I want to use LLMs for better pose estimation - what is the first thing I ll ask LLM. Why is it contact information in this case? Why not some other information say visibility of regions/positioning of regions relative to others. - What is the practical comparison

Reviewer 02Rating 5Confidence 5

Strengths

1. The idea of using semantic guidance from LLM to refine the contact states of 3D human regression makes sense. 2. The experiment results on Hi4D, FlickrCHI3D, and MOYO look promising. The qualitative results look very encouraging. 3. The writing is clear and easy to understand.

Weaknesses

1. In Tab. 1, the performance degradation on CHI3D is confusing, because we may draw conflicting conclusion from the conflicting results. Therefore, to solve this confusing point, we may need further discussion in: a. a brief explanation of why introducing ProsePose makes the PA-MPJPE on CHI3D get worse. b. if there is a dataset-specific characteristics that might cause such performance degradation, digging into this point may provide an in-depth understanding of the proposed method.

Reviewer 03Rating 5Confidence 3

Strengths

This is a well-motivated and well-written systems paper. The inclusion of multiple components and stages makes for a challenging setup, yet the paper successfully scales the approach across multiple datasets, achieving performance that is either better than or on par with existing methods. Leveraging LLMs in a zero-shot setting for contact inference is a valuable contribution, automating the process and adding guidance during optimization. This approach appears especially effective in the two-pe

Weaknesses

1- Although the proposed zero-shot setting is intriguing at first, its practical applications are challenging to foresee. Despite the claims (line 553), the approach falls short of achieving accurate contact modeling in 3D reconstructions. It’s unclear whether language is a suitable modality to provide sufficient information for this task; in fact, its limitations are inherent. For instance, body segmentations must remain coarse due to natural language constraints, and when coupled with the prop

Reviewer 04Rating 5Confidence 3

Strengths

1. The idea of extracting rich semantic pose prior from LLM to aid pose estimation is interesting and has great potential for future applications, including VR/AR, and metaverse. 2. Experiment results demonstrate the effectiveness of the proposed method.

Weaknesses

1. The organization of the introduction makes it hard to follow. It looks piecemeal, as the connection and split between the problem to be solved, the motivation, and the main insight are stiff. For example, in L 56-57, existing methods struggle with solving the proposed problem, but what is the inner reason, why could it inspire the authors to use LLM? 2. The ablation in Table 3 shows that the interpenetration loss (Eq.6) has a negative impact on the estimation results. Is there any possible

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques