SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao

TL;DR
SesaHand introduces a novel approach for 3D hand reconstruction by generating controllable, semantically and structurally aligned hand images, improving both image quality and reconstruction accuracy through hierarchical fusion and attention mechanisms.
Contribution
The paper proposes a new framework combining semantic and structural alignment for controllable hand image generation, enhancing 3D reconstruction performance.
Findings
Outperforms prior methods in hand image generation quality
Improves 3D hand reconstruction accuracy
Effective semantic and structural alignment techniques
Abstract
Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures…
Peer Reviews
Decision·ICLR 2026 Poster
1. The problem statement of improving hand image generation quality to improve hand estimation task is sound. Data scarcity and fidelity is a key problems and current SOTA image generation models still fail to generate good hands occasionally. 2. This work shows using COT to improve the caption alignment of VLM and attention bias improves hand image generation compared with some prior works. Thereby, improving 3D hand reconstruction from synthetic images. 3. Writing and presentation clarity are
1. Though the paper focuses on improving semantic and hand structural alignment, objects in HOI alignment seem to be neglected; most examples only show simple cases of hands holding stuff. This work would be stronger if a more challenging and comprehensive study of HOI generation were demonstrated. 2. The examples lack diversity in style and viewpoint perspectives. It's difficult to judge how robust the proposed method is. 3. Comparison with the latest works, like FoundHand, is under investigati
1. The CoT-driven semantic extraction and the hierarchical structural fusion are well motivated and technically coherent; together they target complementary failure modes (semantic drift vs. structural misalignment). 2. The supplemental comparisons against commercial models indicate task-specific controllability and suggest that the proposed design adapts well to hand-centric scenarios where generic generators struggle.
1. The perceptual quality of generated images appears limited: several qualitative examples show unnatural finger articulation and color artifacts. Relative to SOTA hand rendering methods [i-iii] (acknowledging different goals and toolchains), the realism gap remains noticeable. 2. Section 4.2 does not provide sufficient detail for the reconstruction setup. The paper does not specify the number of synthetic images used, training schedule, the proportion of generated vs. original data, or whether
**(1) Good presentation quality** The paper is well written overall, and the text is easy to follow. **(2) Good experimental results** The proposed method achieves strong results in both (1) controlled hand image generation and (2) hand pose estimation (when training an estimation model on the generated hand images). However, I have a few questions regarding the comparison settings (see the weaknesses section below). **(3) Good analysis to justify model design choices** Although the propose
**(1) Questionable omission of FoundHand in quantitative comparisons** I believe FoundHand is one of the most relevant baseline works to this paper. Although it is discussed in the related work section and qualitative comparisons are provided in the supplementary material, I wonder why this baseline is omitted from the main quantitative experiments for both hand image generation and hand pose estimation. Since the code is publicly available, was there any specific reason why this comparison cou
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning
