A3D: Does Diffusion Dream about 3D Alignment?
Savva Ignatyev, Nina Konovalova, Daniil Selikhanovych, Oleg Voynov,, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton, Konushin, Alexander Filippov, Peter Wonka, Evgeny Burnaev

TL;DR
This paper introduces A3D, a method for generating multiple semantically aligned 3D objects from text prompts by embedding them into a shared latent space and optimizing smooth, plausible transitions, improving applications like 3D asset design.
Contribution
A3D is the first approach to achieve aligned 3D object sets from text prompts by optimizing transitions in a shared latent space, ensuring smoothness and plausibility.
Findings
Effective alignment of 3D objects from text prompts.
Improved 3D editing and hybridization applications.
Demonstrated superiority over non-aligned generation methods.
Abstract
We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods handle multiple text queries separately, and therefore the resulting objects have a high variability in object pose and structure. However, in some applications, such as 3D asset design, it may be desirable to obtain a set of objects aligned with each other. In order to achieve the alignment of the corresponding parts of the generated objects, we propose to embed these objects into a common latent space and optimize the continuous transitions between these objects. We enforce two kinds of…
Peer Reviews
Decision·ICLR 2025 Poster
1. The motivation of optimization in a shared latent space for smooth transitions is reasonable. 2. This writing style is easy to follow.
1. Fairness issues. This work is built on MVDream, which highly relies on the pre-trained knowledge from SD2.1. This raises the question: compared with other approaches, does the effectiveness of A3D come from the method itself or the pre-trained knowledge of the SD model? Relevant experiments are needed to demonstrate it. Besides, the authors are encouraged to compare different SD versions, like SD1.5 and SDXL. 2. In Tab.2, MVEdit outperforms A3D on CLIP score and DIFT distance. The authors cla
- This paper is well-written and easy to follow. - The motivation the paper is sound, tackling the task of 3D shape editing through addressing hybridization and alignment of difference 3D scenes corresponding to the given shape, which I believe is an innovative and novel solution to the task at hand. - The technical solution this paper offers to the problem is simple and effective: training a single NeRF model to handle multiple scene from different prompts during the optimization phase, and the
- This paper uses NeRF for seamless representation of difference scenes, and I understand design choice was to enable seamless transition between different optimized scenes within the representation space of NeRF. However, more explicit 3D representations such as Instant-NGP or 3DGS offer their own advantages, such as speed and controllability, and I believe are a viable candidates for comparison. Can this method be applied to such different forms to 3D representation? If so, how do their perfor
1. The overall idea is interesting. The authors observe that the previous 3D editing method fails to generate structurally-aligned high-quality 3D objects (e.g., either lack of structure alignment or detailed texture). A3D provides an effective way to generate aligned 3D objects by introducing additional latent code conditions. 2. The results in Figure 3 demonstrate that A3D indeed learns a smooth transition between different latent codes and can be decoupled with 3D positions by setting anchor
1. I noticed that some generated textures from A3D are still poor. Is this caused by the limitation of SDS loss or the framework of A3D? 2. I am wondering whether A3D is sensitive to text prompts. It would be better if the author could evaluate this point. is A3D robust to various text prompts? 3. In section 4.1, A3D uses some regularization strategies to supervise the network, including limiting the depth of the neural network and enforcing the smoothness of the rendered results. which will cau
Videos
Taxonomy
TopicsManufacturing Process and Optimization
MethodsSparse Evolutionary Training · Diffusion
