Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality
Yanming Xiu, Joshua Chilukuri, Shunav Sen, Maria Gorlatova

TL;DR
This paper systematically evaluates speech-based 3D content generation methods for augmented reality, comparing direct text-to-3D and text-image-to-3D pipelines in terms of quality, speed, and user satisfaction.
Contribution
It introduces a modular architecture for AR 3D content generation, enabling systematic comparison of different pipelines and insights into their performance trade-offs.
Findings
Text-image-to-3D pipelines produce higher quality outputs.
Direct text-to-3D pipelines are faster, with Shap-E completing in ~20 seconds.
Perceptual quality influences user satisfaction more than generation speed.
Abstract
As augmented reality (AR) applications increasingly require 3D content, generative pipelines driven by natural input such as speech offer an alternative to manual asset creation. In this work, we design a modular, edge-assisted architecture that supports both direct text-to-3D and text-image-to-3D pathways, enabling interchangeable integration of state-of-the-art components and systematic comparison of their performance in AR settings. Using this architecture, we implement and evaluate four representative pipelines through an IRB-approved user study with 11 participants, assessing six perceptual and usability metrics across three object prompts. Overall, text-image-to-3D pipelines deliver higher generation quality: the best-performing pipeline, which used FLUX for image generation and Trellis for 3D generation, achieved an average satisfaction score of 4.55 out of 5 and an intent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAugmented Reality Applications · Virtual Reality Applications and Impacts · Human Motion and Animation
