EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu, Muzammal Naseer, Chi-Wing Fu, Pheng-Ann Heng

TL;DR
EgoHandICL introduces an in-context learning framework for egocentric 3D hand reconstruction, leveraging multimodal retrieval and masked autoencoders to improve robustness and generalization in challenging scenarios.
Contribution
It is the first to apply in-context learning to egocentric 3D hand reconstruction, integrating vision-language models and novel architectures for enhanced performance.
Findings
Outperforms state-of-the-art methods on ARCTIC and EgoExo4D datasets.
Demonstrates strong generalization to real-world scenarios.
Enhances hand-object interaction reasoning using reconstructed hands as prompts.
Abstract
Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed method is an effective method applying in‑context learning (ICL) in 3D hand reconstruction. The retrieval and then multimodal learning strategy is sound and provides an interesting direction to contextual adaptation when interpreting complex egocentric scenes. - The paper is well structured and easy to follow. - Quantitative results on ARCTIC and EgoExo4D clearly demonstrate the advantage of EgoHandICL, including bimanual and occlusion-heavy cases. Ablations cover mask ratios,
1. The retrieval is confusing and not clear, especially the definition of template/visual images. Based on "few shot" (L.154) and Fig.2, it seems that the template images are from the same dataset or even the frames of the same video clip. In this case, it seems the retrieval is not necessary and using template images is easier. Moreover, it would be better to discuss the influence if the retrieval/prompt is not good enough. 2. The proposed method takes many pretrained/foundational models. Eq.2
Clear motivation: egocentric hand reconstruction is challenging. The paper successfully pointed out and tackled the challenges in the problem and provided sound solutions. Novel and sound idea: The retrieval and in-context learning are novel and sound approaches. Instead of simply training a bigger network or adding more auxiliary cues, authors explicitly retrieve semantically similar examples and feed them as context to guide inference. Especially, instead of simply retrieving raw RGB crops,
The method may depend on the quality of retrieval: Even though the overall performance might depend on the quality of template, the paper does not quantify the success/failure of the retrieval stage. It might be better to include such results. More careful analysis required to validate the effectiveness: Deep learning models are frequently suffering from out-of-distribution (OOD) samples, which denote testing samples far from training samples which are exploited during training. Using the retri
- Novelty: The application of in-context learning to 3D hand reconstruction is highly novel. On top of the state-of-the-art pose estimators, it provides dynamic, example-based reasoning at inference time. Using a VLM for semantic retrieval (e.g., finding similar interactions or occlusion types ) rather than just visual similarity is a powerful and unique idea for this problem. - Technical soundness: The method is well-formulated. Refining a corse prediction with exemplar mano pair is a logical a
Further clarifications on the following points would be appreciated. - Inference cost trade-off: The proposed method introduces significant computational overhead (VLM text generation, template retrieval, multimodal tokenization, and MAE transformer inference) beyond the base regressor. An estimation of this additional burden, using metrics like FPS or FLOPs, is needed to provide a comprehensive view of the test-time bottleneck for potential real-time applications. - Impact of template size (N)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Multimodal Machine Learning Applications
