Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br\'egier,, Philippe Weinzaepfel, Gr\'egory Rogez, Thomas Lucas

TL;DR
Multi-HMR is a novel single-shot model capable of recovering full-body 3D human meshes, including hands and facial expressions, from a single RGB image, using a transformer-based architecture and a new dataset for training.
Contribution
It introduces Multi-HMR, a transformer-based approach for multi-person whole-body mesh recovery, and the CUFFS dataset for improved hand pose estimation.
Findings
Achieves state-of-the-art results on whole-body benchmarks.
Incorporating CUFFS dataset improves hand pose predictions.
Fast and competitive performance with ViT-S backbone at 448x448 resolution.
Abstract
We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced X-ray and CT Imaging
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Concatenated Skip Connection · Dense Connections · Label Smoothing · Adam · Vision Transformer · Softmax · Multi-Head Attention
