LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Lingteng Qiu; Peihao Li; Heyuan Li; Qi Zuo; Xiaodong Gu; Yuan Dong; Weihao Yuan; Rui Peng; Siyu Zhu; Xiaoguang Han; Guanying Chen; Zilong Dong

arXiv:2506.13766·cs.CV·March 17, 2026

LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Lingteng Qiu, Peihao Li, Heyuan Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Rui Peng, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong

PDF

Open Access 3 Reviews

TL;DR

LHM++ is a fast, efficient model that reconstructs high-quality, animatable 3D human avatars from casual, pose-free images using a novel transformer architecture and real-time rendering refinement.

Contribution

The paper introduces LHM++, a large-scale human reconstruction model that operates without camera or pose data, utilizing a transformer-based architecture for detailed 3D avatar generation.

Findings

01

Produces high-fidelity 3D humans from pose-free images

02

Operates in seconds with real-time rendering

03

Outperforms existing methods in visual quality and efficiency

Abstract

Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper is easy to follow. Synthesizing 3D/4D humans from images is an interesting task with practical applications. The method is technically sound by leveraging a multimodal transformer architecture to fuse 3D and 2D feature for 3D Gaussian generation.

Weaknesses

Limited technical contribution. This paper is an extension for LHM, and the main difference is that LHM++ replaces the MBHT with PIT mode. However, both MBHT and PIT fuse 3D and 2D features for 3D Gaussian prediction. Does the LHM support multiple image processing by fusing multiple images using MBHT architecture? Why is the PIT required, and how does it outperform MBHT? The paper proposes that the PIT architecture improves the results, whereas the results in Tab. 10 suggest that the number of

Reviewer 02Rating 6Confidence 3

Strengths

• The paper is well-written with a logical structure that makes the technical contributions easy to follow. • The proposed framework is reasonable and well-justified. The experimental results convincingly demonstrate the effectiveness of the approach across different scenarios. • The demo videos are excellent supplementary materials.

Weaknesses

Could you please give a discussion about the diffirence with 3D generation model, such like CLAY (Rodin). I wonder can we use the Rodin to perform 3D avatar generation and then perform auto-rigging such as Mixamo?

Reviewer 03Rating 6Confidence 4

Strengths

* The paper demonstrates the ability to animate loose clothes and generalization, which is challenging in human rendering. * With "merge" and "unmerge", the model runs much faster than LHM with a lower cost in memory. * The paper is clearly written and highlights the contributions.

Weaknesses

* The paper claims that in LHM, the time complexity of self-attention operations scales quadratically with the number of image tokens (and thus with the number of input images). Meanwhile, as the number of input images increases, image tokens begin to dominate the attention computation. Although the proposed “merge” and “unmerge” operations help reduce memory and computational overhead, the overall self-attention complexity remains quadratic with respect to the number of images. These operations

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation

MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Transformer