A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation
Philip Xu

TL;DR
Uni4D is a comprehensive framework that enhances large-scale 3D retrieval and 4D generation by aligning text, 3D models, and images through a structured three-level approach, enabling improved semantic understanding and temporal consistency.
Contribution
The paper introduces Uni4D, a novel unified framework that significantly improves cross-modal alignment for 3D retrieval and 4D generation using a three-level alignment strategy.
Findings
Achieves high-quality 3D retrieval results.
Enables controllable 4D generation with temporal consistency.
Demonstrates superior performance on the Align3D 130 dataset.
Abstract
We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
