MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
Sankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier, Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal

TL;DR
This paper introduces MARVEL-40M+, a large-scale dataset with multi-level annotations for 3D assets, and a two-stage text-to-3D pipeline that enhances high-fidelity content creation from text prompts.
Contribution
It presents a novel multi-stage annotation pipeline combining VLMs and LLMs, and develops MARVEL-FX3D, a fast text-to-3D generation method, advancing dataset quality and generation speed.
Findings
MARVEL-40M+ outperforms existing datasets in annotation quality and diversity.
The pipeline achieves 72.41% win rate by GPT-4 and 73.40% by humans.
3D textured meshes generated within 15 seconds.
Abstract
Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Computer Graphics and Visualization Techniques
MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax
