SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

Zhuoran Zhao; Xianghao Kong; Linlin Yang; Zheng Wei; Pan Hui; Anyi Rao

arXiv:2603.00443·cs.CV·March 3, 2026

SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao

PDF

Open Access 3 Reviews

TL;DR

SesaHand introduces a novel approach for 3D hand reconstruction by generating controllable, semantically and structurally aligned hand images, improving both image quality and reconstruction accuracy through hierarchical fusion and attention mechanisms.

Contribution

The paper proposes a new framework combining semantic and structural alignment for controllable hand image generation, enhancing 3D reconstruction performance.

Findings

01

Outperforms prior methods in hand image generation quality

02

Improves 3D hand reconstruction accuracy

03

Effective semantic and structural alignment techniques

Abstract

Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. The problem statement of improving hand image generation quality to improve hand estimation task is sound. Data scarcity and fidelity is a key problems and current SOTA image generation models still fail to generate good hands occasionally. 2. This work shows using COT to improve the caption alignment of VLM and attention bias improves hand image generation compared with some prior works. Thereby, improving 3D hand reconstruction from synthetic images. 3. Writing and presentation clarity are

Weaknesses

1. Though the paper focuses on improving semantic and hand structural alignment, objects in HOI alignment seem to be neglected; most examples only show simple cases of hands holding stuff. This work would be stronger if a more challenging and comprehensive study of HOI generation were demonstrated. 2. The examples lack diversity in style and viewpoint perspectives. It's difficult to judge how robust the proposed method is. 3. Comparison with the latest works, like FoundHand, is under investigati

Reviewer 02Rating 4Confidence 4

Strengths

1. The CoT-driven semantic extraction and the hierarchical structural fusion are well motivated and technically coherent; together they target complementary failure modes (semantic drift vs. structural misalignment). 2. The supplemental comparisons against commercial models indicate task-specific controllability and suggest that the proposed design adapts well to hand-centric scenarios where generic generators struggle.

Weaknesses

1. The perceptual quality of generated images appears limited: several qualitative examples show unnatural finger articulation and color artifacts. Relative to SOTA hand rendering methods [i-iii] (acknowledging different goals and toolchains), the realism gap remains noticeable. 2. Section 4.2 does not provide sufficient detail for the reconstruction setup. The paper does not specify the number of synthetic images used, training schedule, the proportion of generated vs. original data, or whether

Reviewer 03Rating 6Confidence 2

Strengths

**(1) Good presentation quality** The paper is well written overall, and the text is easy to follow. **(2) Good experimental results** The proposed method achieves strong results in both (1) controlled hand image generation and (2) hand pose estimation (when training an estimation model on the generated hand images). However, I have a few questions regarding the comparison settings (see the weaknesses section below). **(3) Good analysis to justify model design choices** Although the propose

Weaknesses

**(1) Questionable omission of FoundHand in quantitative comparisons** I believe FoundHand is one of the most relevant baseline works to this paper. Although it is discussed in the related work section and qualitative comparisons are provided in the supplementary material, I wonder why this baseline is omitted from the main quantitative experiments for both hand image generation and hand pose estimation. Since the code is publicly available, was there any specific reason why this comparison cou

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning