Pose Priors from Language Models
Sanjay Subramanian, Evonne Ng, Lea M\"uller, Dan Klein, Shiry Ginosar, and Trevor Darrell

TL;DR
This paper introduces a novel approach that uses large multimodal models as priors to improve 3D human pose estimation by capturing contact semantics, reducing reliance on manual annotations or motion capture data.
Contribution
It presents a scalable method leveraging LMMs to extract contact-relevant descriptors for constraining 3D human pose optimization, enabling more accurate contact pose reconstructions.
Findings
Effective in two-person interaction scenarios
Accurately captures physical and social contact semantics
Offers a scalable alternative to manual annotations
Abstract
Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. Our approach extracts contact-relevant descriptors from an LMM and translates them into tractable losses to constrain 3D human pose optimization. Despite its simplicity, our method produces compelling reconstructions for both two-person interactions and self-contact scenarios, accurately capturing the semantics of physical and social interactions. Our results demonstrate that LMMs can serve as powerful tools for contact prediction and pose estimation, offering an alternative to costly manual human annotations or motion capture data.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is well written and flows well. The terminology and methodology is explained well - Interesting use of LLMs: With the rise in LLMs, it is interesting to see vision problems being solved via language - especially for 3D computer vision and surpassing traditional methods. The authors have used LLMs to generate constraints for pose estimation. - The proposed method performs well compared to traditional techniques.
My main concerns with the paper are: - It is not clear enough on why solving touch constraints help with better pose optimization. Are there other failure cases in pose optimization than touch constraints. I guess I am not clear on: If I want to use LLMs for better pose estimation - what is the first thing I ll ask LLM. Why is it contact information in this case? Why not some other information say visibility of regions/positioning of regions relative to others. - What is the practical comparison
1. The idea of using semantic guidance from LLM to refine the contact states of 3D human regression makes sense. 2. The experiment results on Hi4D, FlickrCHI3D, and MOYO look promising. The qualitative results look very encouraging. 3. The writing is clear and easy to understand.
1. In Tab. 1, the performance degradation on CHI3D is confusing, because we may draw conflicting conclusion from the conflicting results. Therefore, to solve this confusing point, we may need further discussion in: a. a brief explanation of why introducing ProsePose makes the PA-MPJPE on CHI3D get worse. b. if there is a dataset-specific characteristics that might cause such performance degradation, digging into this point may provide an in-depth understanding of the proposed method.
This is a well-motivated and well-written systems paper. The inclusion of multiple components and stages makes for a challenging setup, yet the paper successfully scales the approach across multiple datasets, achieving performance that is either better than or on par with existing methods. Leveraging LLMs in a zero-shot setting for contact inference is a valuable contribution, automating the process and adding guidance during optimization. This approach appears especially effective in the two-pe
1- Although the proposed zero-shot setting is intriguing at first, its practical applications are challenging to foresee. Despite the claims (line 553), the approach falls short of achieving accurate contact modeling in 3D reconstructions. It’s unclear whether language is a suitable modality to provide sufficient information for this task; in fact, its limitations are inherent. For instance, body segmentations must remain coarse due to natural language constraints, and when coupled with the prop
1. The idea of extracting rich semantic pose prior from LLM to aid pose estimation is interesting and has great potential for future applications, including VR/AR, and metaverse. 2. Experiment results demonstrate the effectiveness of the proposed method.
1. The organization of the introduction makes it hard to follow. It looks piecemeal, as the connection and split between the problem to be solved, the motivation, and the main insight are stiff. For example, in L 56-57, existing methods struggle with solving the proposed problem, but what is the inner reason, why could it inspire the authors to use LLM? 2. The ablation in Table 3 shows that the interpenetration loss (Eq.6) has a negative impact on the estimation results. Is there any possible
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
