Scalable 3D Captioning with Pretrained Models
Tiange Luo, Chris Rockwell, Honglak Lee, Justin Johnson

TL;DR
Cap3D is a scalable, automated method that leverages pretrained models to generate high-quality 3D object descriptions efficiently, surpassing human performance and setting new benchmarks in 3D captioning.
Contribution
This work introduces Cap3D, a novel approach that automates 3D captioning using pretrained models, eliminating manual annotation and achieving state-of-the-art results.
Findings
Cap3D produces captions that outperform human annotations in quality and speed.
Finetuning Text-to-3D models on Cap3D data improves 3D model generation.
Cap3D sets new benchmarks against existing methods like Point-E, Shape-E, and DreamFusion.
Abstract
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Human Motion and Animation
