PixNerd: Pixel Neural Field Diffusion

Shuai Wang; Ziteng Gao; Chenhui Zhu; Weilin Huang; Limin Wang

arXiv:2507.23268·cs.CV·August 5, 2025

PixNerd: Pixel Neural Field Diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang

PDF

Open Access 2 Models 3 Reviews

TL;DR

PixNerd introduces a single-stage, pixel-space neural field diffusion model that achieves high-quality image generation without complex pipelines or VAEs, demonstrating competitive results on standard benchmarks.

Contribution

It presents a novel end-to-end pixel neural field diffusion approach that simplifies image generation and improves quality without relying on cascades or variational autoencoders.

Findings

01

Achieved 2.15 FID on ImageNet 256x256

02

Achieved 2.84 FID on ImageNet 512x512

03

Secured high scores on GenEval and DPG benchmarks

Abstract

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256 \times 256$ and 2.84 FID on ImageNet $512 \times 512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The paper is well-written with clear structure. - The paper present comprehensive comparisons against current SOTA models both qualitatively and quantitatively. PixNerd matches or exceeds the performance of comparable methods on ImageNet and text-to-image tasks - The ablation studies are conducted systematically to evaluate each design choice.

Weaknesses

- The training memory usage is almost doubled compared to that of latent diffusion counter part. - PixNerd's performance at higher resolutions (512×512) does not scale as strongly as at 256×256. For example, PixNerd is better than SiT-XL on ImageNet256 but falls behind on ImageNet512. Does this imply that the gains from the neural field head diminish at higher resolutions? - Minors: - The citation of Rectified flow seems missing. - Table 4 should specify that the comparison is reported on Im

Reviewer 02Rating 4Confidence 4

Strengths

S1. Originality and Significance - I highly commend the paper’s central motivation and direction. While most recent efforts focus on reducing computational cost by operating entirely in latent space, this work takes the opposite yet equally important perspective to explore how to lower cost directly in pixel space. This inversion of the conventional design philosophy is both insightful and original, addressing a long-standing challenge in diffusion modeling. S2. Experimental Quality and Clarit

Weaknesses

W1. Justification for architectural necessity. - Given that recent advances in diffusion distillation (e.g., one-step or few-step distilled diffusion models) can achieve nearly identical performance to full diffusion models with drastically reduced sampling steps, it is unclear why PixNerd needs to exist as a separate model. Could the authors clarify whether PixNerd itself can serve as a teacher model for distillation? If not, then PixNerd would likely exhibit much higher latency compared to di

Reviewer 03Rating 8Confidence 3

Strengths

- Novel method. The proposed PixNerd architecture, which combines a large-patch diffusion transformer with a neural field decoder, is a novel and interesting approach to pixel-space diffusion modeling. - Strong results. This paper reported a competitive FID score of 1.93 on ImageNet. Moreover, the proposed framework can be applied to text-to-image generation and achieves a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark. - Efficiency. This p

Weaknesses

- While the results on ImageNet at 256x256 resolution are quite competitive (see Table 1), the results at 512 resolution are not so convincing (see Table 6). The authors are encouraged to explain the performance degradation. The authors are also encouraged to provide comparisons at a even higher resolution like 1024 or 768. -Unclear latency comparison. The abstract mentions an 8x latency improvement but does not specify which models were used for comparison. According to Table 1, PixNerd did no

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neuroimaging Techniques and Applications · Stochastic Gradient Optimization Techniques