Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality

Yanming Xiu; Joshua Chilukuri; Shunav Sen; Maria Gorlatova

arXiv:2508.12498·cs.HC·August 19, 2025

Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality

Yanming Xiu, Joshua Chilukuri, Shunav Sen, Maria Gorlatova

PDF

Open Access

TL;DR

This paper systematically evaluates speech-based 3D content generation methods for augmented reality, comparing direct text-to-3D and text-image-to-3D pipelines in terms of quality, speed, and user satisfaction.

Contribution

It introduces a modular architecture for AR 3D content generation, enabling systematic comparison of different pipelines and insights into their performance trade-offs.

Findings

01

Text-image-to-3D pipelines produce higher quality outputs.

02

Direct text-to-3D pipelines are faster, with Shap-E completing in ~20 seconds.

03

Perceptual quality influences user satisfaction more than generation speed.

Abstract

As augmented reality (AR) applications increasingly require 3D content, generative pipelines driven by natural input such as speech offer an alternative to manual asset creation. In this work, we design a modular, edge-assisted architecture that supports both direct text-to-3D and text-image-to-3D pathways, enabling interchangeable integration of state-of-the-art components and systematic comparison of their performance in AR settings. Using this architecture, we implement and evaluate four representative pipelines through an IRB-approved user study with 11 participants, assessing six perceptual and usability metrics across three object prompts. Overall, text-image-to-3D pipelines deliver higher generation quality: the best-performing pipeline, which used FLUX for image generation and Trellis for 3D generation, achieved an average satisfaction score of 4.55 out of 5 and an intent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications · Virtual Reality Applications and Impacts · Human Motion and Animation