CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models
Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Minseop Kwak, Jin-Hwa Kim, Seungryong Kim

TL;DR
CAMEO introduces a supervision technique for attention maps in multi-view diffusion models, significantly improving view consistency, training efficiency, and synthesis quality by leveraging geometric correspondence during training.
Contribution
This work presents CAMEO, a novel supervision method for attention maps that enhances multi-view diffusion models' performance and training efficiency, applicable across different models.
Findings
Supervising attention maps improves view consistency.
CAMEO halves training iterations for convergence.
Enhanced synthesis quality with geometric supervision.
Abstract
Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The work presents how multi-view diffusion models handle geometry without built-in 3D priors, by analyzing its attention behaviors and their link to NVS performance, which is insightful. The method is straightforward, and it addresses a real issue in training efficiency according to the author's claim. Overall, the clarity of paper presentation is good.
The major weakness of this paper is the insufficient experimental validation. 1. All experiments relies heavily on the model CAT3D, which may limit the evidence of broader applicability. To improve, evaluating on other models (especially another architectures like DiT) would help. 2. While correlations are shown well, establishing causality beyond the proposed fix could use more ablation, such as perturbing attention without alignment. 3. Hyperparameter (e.g., λ) is mentioned but seems to be ch
1. The paper is cearly written. 2. The architecture of CAMEO is simple that it just adds an auxiliary cross-entropy loss.
1. The claim that inflated 3D attention "emerges" as geometric correspondence feels incremental. Prior work has already observed that the attention map encodes point-to-point or identity-consistent correspondences in diffusion models [1,2]. Extending this observation to multi-view NVS is interesting but not obviously surprising. 2. The paper claims that CAMEO achieves 2x faster convergence. However, this is not clearly supported by the results in Figure 1: the LPIPS curves do not exhibit such a
Clear observation that attention maps naturally encode geometric correspondences. Demonstrates improvement and faster convergence over CAT3D.
The correspondence-alignment idea is simple and not very novel. Depends on external geometry estimation, need to preprocess the training data which is not scalable. Its unclear how much the computational cost to get the correspondence. Only marginal final quality improvement over CAT3D, and the claimed 2x training speedup is measured by PSNR, which is not really reliable when its below 20. Limited evaluation on one backbone; lacks broader comparison with SoTA. Reliability and scalability of
* The paper is well-written and well-structured, with clear language and well-designed figures that make the methodology and results easy to follow and understand. Overall, the editorial quality of this work is very good, and helpful. * The authors investigate an important problem: the geometric consistency in novel view synthesis (NVS) task. The proposed idea (supervising attention maps using geometric correspondence signals) is simple yet effective. * The analysis of the validity of the prop
* The novelty of the paper is limited, as the core method is primarily based on a combination and refinement of existing techniques, such as selecting cross-attention layers that best capture the target semantics, the use of an off-the-shelf pretrained model to obtain ground-truth semantic maps, and taking the discrepancy with the ground truth as a regularization loss to refine attention units and improve semantic consistency. The paper uses and adapts these existing ideas to the novel view synt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Image Enhancement Techniques
