OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
Zeyu Cai, Yuliang Xiu, Renke Wang, Zhijing Shao, Xiaoben Li, Siyuan Yu, Chao Xu, Yang Liu, Baigui Sun, Jian Yang, Zhenyu Zhang

TL;DR
OmniFit is a versatile, scale-agnostic 3D body fitting method that effectively handles multi-modal inputs and outperforms existing approaches in accuracy and robustness.
Contribution
It introduces a scale-agnostic, multi-modal body fitting approach using a conditional transformer decoder and a scale predictor, surpassing state-of-the-art methods.
Findings
Outperforms state-of-the-art methods by 57.1 to 80.9 percent.
Achieves millimeter-level accuracy on CAPE and 4D-DRESS benchmarks.
Handles diverse inputs including scans, partial observations, and images.
Abstract
Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
