SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Alexander Huang-Menders; Xinhang Liu; Andy Xu; Yuyao Zhang; Chi-Keung Tang; Yu-Wing Tai

arXiv:2506.04606·cs.CV·June 6, 2025

SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Alexander Huang-Menders, Xinhang Liu, Andy Xu, Yuyao Zhang, Chi-Keung Tang, Yu-Wing Tai

PDF

Open Access

TL;DR

SmartAvatar introduces an AI-driven framework that uses vision-language models and an iterative verification process to generate high-quality, customizable 3D human avatars from images or text prompts, with fine control and animation capabilities.

Contribution

It presents a novel autonomous verification loop leveraging VLMs for precise, iterative refinement of avatars, enhancing control, quality, and customization over existing diffusion-based methods.

Findings

01

Outperforms recent systems in mesh quality and identity fidelity.

02

Enables fine-grained, iterative avatar customization via natural language.

03

Supports pose manipulation with consistent identity and appearance.

Abstract

SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation

MethodsDiffusion