Positional Encoding Field
Yunpeng Bai, Haoxiang Li, Qixing Huang

TL;DR
This paper introduces the Positional Encoding Field (PE-Field), a 3D extension of positional encodings that enhances diffusion transformers' ability to model geometry, leading to improved 3D reasoning and image editing capabilities.
Contribution
We propose PE-Field, a novel 3D positional encoding method that improves diffusion transformers' spatial understanding and performance in 3D tasks.
Findings
PE-Field enables volumetric reasoning in DiTs.
It achieves state-of-the-art results in novel view synthesis.
PE-Field improves controllable spatial image editing.
Abstract
Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry…
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea is quite neat. With polit study to build up intuitions, and interesting designs, and also good experiment design to validate the idea 2. The results are quite impressive! The visual quality is quite high. I do understand that each simple forward pass of the model can only edits a small view shifts. To address this, the author also showed recurssvely applied results, which is also quite cool. 3. The overall presentation is quite clear, and I can easily follow the logics of the author
I lean towards accept the paper. There are only several minor points, I don't think these are real weakness. 1. This methods is hard to be applied to multi-view novel view synthesis. Maybe it's still possible is to build one single reference images by merging patches from multiple input views? 2. This demo would be even cooler if there are videos attached. (Novel-view-synthesis videos with smoothly moving cameras.)
1.[effectiveness] The proposed method consistently outperforms the previous methods both qualitatively and quantitatively. 2.[motivation] The proposed method is well-motivated. It starts from an observation by shuffling the positional encodings. That helps the readers to better understand the motivation and design of the proposed method. 3.[ablations] Ablation studies has been carried out to demonstrate the effectiveness of individual components in the proposed method in section 4.3 4.[extens
1.[clarity] As described on L033-041, as well as figure 1, the proposed method is motivated by a position encoding shuffling experiment. What would be the worst case for a shuffled or re-ordered position encoding? How does that impact the model? If there is a threshold beyond which the generated output is completely messed up, the authors might have to rethink the connection between the position encoding shuffling expeirment and the proposed method. 2.[typesetting] The macro "\citep" or "\citet
1. Interesting Concept Finding. The authors find that PEs control the spatial structure coherence during image generating or reconstructing process, motivating a new potential way for visual editing or synthesis. 2. Novel and Effective Positional Encoding. The authors consider the depth-aware information and more fine-grained semantics in the patch, then propose a new positional encoding that locates both the mentioned information. 3. Good Synthesis Result. With the proposed positional encodin
1. Over-dependence on Monocular Reconstruction. As shown in the main framework, the monocular reconstruction provides essential 3D positional information for new view generation. How about the model’s robustness to terrible or noisy 3D position info injection? 2. Concerns about Multi-Level Positional Encoding. The head-layer match in fine-grained positional encoding is handcrafted in order. More matching ways should be conducted to verify its robustness for fine-grained positional encoding.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
