HyperFields: Towards Zero-Shot Generation of NeRFs from Text
Sudarshan Babu, Richard Liu, Avery Zhou, Michael Maire, Greg, Shakhnarovich, Rana Hanocka

TL;DR
HyperFields is a novel method that enables zero-shot and few-shot generation of diverse NeRF scenes from text prompts using a dynamic hypernetwork, significantly speeding up scene synthesis.
Contribution
We propose HyperFields, a dynamic hypernetwork approach that learns a general text-to-NeRF mapping, allowing rapid zero-shot and few-shot scene generation from text.
Findings
Enables single-pass generation of over 100 scenes from text.
Capable of zero-shot scene synthesis, including out-of-distribution scenes.
Faster scene synthesis, 5 to 10 times quicker than existing methods.
Abstract
We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing…
Peer Reviews
Decision·ICML 2024 Poster
1. This paper is clearly written and easy to follow. 2. The proposed pipeline intuitively makes sense and it is interesting to see the disentanglement of different attributes learned by the proposed hypernetwork.
1. My primary concern is the technical contributions of this work in comparison to the referenced ATT3D study. Specifically, while this paper clarifies the connections with ATT3D, it remains unclear what novel techniques or insights are newly introduced by this work. A more compelling justification is highly desirable. 2. Furthermore, there is a lack of benchmarking against ATT3D and the reported results indicate that ATT3D may achieve much better visualization effects. This discrepancy arises
(1) The motivation of this work is clear and the technical impact of this work are significant. Limited by the intrinsic mapping relationship between 3D coordinate and color field of NeRF, current work struggles to achieve generalizable 3D shapes with various conditional input with a unified framework. This work proposes a hypernetwork architecture, which is justified as the key innovation to learn a generalized text-to-3D mapping across different inputs. (2) Authors conduct extensive experiment
(1) There seems missing quantitative comparison between the proposed method and baseline (DreamFusion) on CLIP retrieval scores or user study, which may make the work further stronger and more convincing. (2) Another ablation study to conduct is to verify the effectiveness of training across multiple shapes then fine-tuning on a single shape, and compare it with a baseline that train the model on this single shape from scratch. This will demonstrate the advantage of training across a wider range
- The hyper-network is fast since it’s free of optimization. Such feedforward approach can have computational advantages as the training cost can potentially be amortized - The experiment shows that the model can compose concepts in certain fashion. This helps illustrate the benefit that such model can amortize training compute to be used for many inference uses.
- Current methods is trained from scratch, which might be computationally expensive. - The key architecture of this paper is a hyper-network that predicts the weights for the MLP. - A main concern regarding the result is very limited. Most of the results are shown in simple objects and compositions. - An other small concern is regarding the need to create a small dataset of NeRF scenes. Each NeRF can takes minutes, and this prevents it to scale to larger datasets. - If the model weight depend
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Human Pose and Action Recognition
