Retrieval-guided Cross-view Image Synthesis
Hongji Yang, Yiru Li, Yingying Zhu

TL;DR
This paper introduces a retrieval-guided framework for cross-view image synthesis that leverages semantic similarity embeddings to improve synthesis quality across drastically different viewpoints, supported by a new urban dataset.
Contribution
It proposes a novel retrieval-guided approach with contrastive learning and a fusion mechanism, and introduces the VIGOR-GEN dataset for urban cross-view synthesis.
Findings
Outperforms existing methods on CVUSA, CVACT, and VIGOR-GEN datasets.
Achieves higher retrieval accuracy (R@1) and better synthesis quality (FID).
Bridges information retrieval and image synthesis for complex viewpoint variations.
Abstract
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The proposed method demonstrates superior performance compared to previous approaches, and the overall writing is commendable. - The paper structure is well organized and easy to understand.
- This work is akin to an incremental advancement, merely substituting one embedder for another, which limits its overall innovativeness. - Could you provide a visual comparison with CROSSVIEWDIFF [1]? - Could you present some visual examples to illustrate better the advantages of selecting a retrieval model as an embedder? - There are some citations of methods in the paper that need correction, such as CROSSVIEWDIFF, which should be amended. [1] CROSSVIEWDIFF: A CROSS-VIEW DIFFUSION MODEL FOR
1. This paper achieves cross-view image synthesis without relying on additional semantic segmentation maps. 2. The proposed method demonstrates state-of-the-art performance. 3. The ablation studies are comprehensive.
1. Figure 3 contains question marks (mojibake), making it difficult to interpret. Additionally, the textual description of the overall architecture lacks clarity, which hinders understanding of the details. 2. The motivation for not using preprocessing is unconvincing. The authors claim that polar transformation or geographic projection is computationally burdensome and complex, but these methods are not computationally intensive compared to a network. Additionally, if the decision to avoid thes
This paper proposes a retrieval-guided framework for cross-view image synthesis, which utilizes a retrieval network as an embedder to effectively address the domain gap between different views. This approach preserves both shared and view-specific semantic information while optimizing the generation process, thereby enhancing the quality and practical utility of cross-view image synthesis. The newly introduced VIGOR-GEN dataset enriches urban cross-view image synthesis, offering realistic center
Regarding quantitative evaluation, the a2g task is compared with diffusion-based methods and achieves promising results. However, it appears that the g2a task is only compared with GAN-based methods. Could this imply that the proposed approach may have some limitations in g2a performance relative to diffusion-based models? To the best of my knowledge, there appear to be the following similar works: AerialDiffusion[1],SkyDiffusion[3], and Cross-View Meets Diffusion[2] . [1] Kothandaraman D, Zhou
+ Leveraging viewpoint invariant retrieval features from cross-view synthesis is interesting. + The introduction of random noise to enhance the diversity makes sense, and the satellite and ground cross-view image synthesis is inherently a one-to-many task due to severe occlusions and different illumination/weather conditions. + The proposed method achieves state-of-the-art performance over three benchmarks.
1. While this paper achieves state-of-the-art performance, I am concerned about the technical contributions' novelty and soundness. (a). My overall understanding of this paper's contribution is that this paper leverages retrieval features for cross-view synthesis, which is nice but somewhat incremental. Other modifications, such as attentional AdaIN block and noise injection, are not new in the GAN community. Furthermore, it seems this paper has incorporated many engineering network architectu
- The technique uses existing methods for mapping, retrieval and GANs to composing an effective solution towards cross-view synthesis. - The dataset introduced fills a gap by providing urban setting data. - Improved correspondence was seen between cross-view pairs relative to reported methods.
- W1: Motivation of the work is not well-grounded. The manuscript lists out application areas (L48) without specifying in sufficient detail how cross-view image synthesis as presented in the current work would fit in those applications. Another way to further stengthen this weakness would be to demonstrate how this work would be used down-stream empirically. For instance, if this is meant to be useful for cross-view localization which also uses common datasets employed in this work, then down-st
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
MethodsContrastive Learning · Focus
