Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
Daniel Sungho Jung, Kyoung Mu Lee

TL;DR
This paper introduces ContactPrompt, a training-free, zero-shot method leveraging multi-modal large language models for dense hand contact estimation, combining semantic understanding with geometric reasoning.
Contribution
It proposes a novel structured approach that encodes 3D hand geometry and performs multi-stage contact reasoning without training, outperforming supervised methods.
Findings
Outperforms previous supervised dense contact estimation methods
Uses structured hand-part segmentation and vertex-grid representation
Enables precise dense contact prediction without training
Abstract
Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
