Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
Yihang Wu, Yihang Sun, Shaofeng Zhang, Zuxuan Wu, Junchi Yan, Xiaosong Jia, and Yu-gang Jiang

TL;DR
This paper introduces a decoupled transformer architecture for feedforward novel view synthesis that separates semantic and spatial representations, enhancing rendering fidelity without increasing inference latency.
Contribution
It proposes a novel decoupled design with semantic and spatial tokens, plus optional supervision and modulation, improving NVS performance over prior mixed-representation models.
Findings
Achieves consistent improvements across models.
Maintains zero additional inference latency.
Effective in both decoder-only and encoder-decoder architectures.
Abstract
Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Pl\"ucker rays) into a shared feature space. Since Pl\"ucker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
