Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation
Youngwan Jin, Incheol Park, Hanbin Song, Hyeongjin Ju, Yagiz Nalcakan, and Shiho Kim

TL;DR
Pix2Next introduces a novel framework that leverages vision foundation models and cross-attention mechanisms to generate high-quality NIR images from RGB inputs, improving realism and utility for computer vision tasks.
Contribution
The paper presents a new RGB-to-NIR translation method using a vision foundation model with cross-attention, multi-scale discriminator, and specialized loss functions, outperforming existing approaches.
Findings
34.81% FID score improvement over existing methods
Enhanced NIR image quality demonstrated on RANUS dataset
Improved downstream object detection performance using generated NIR images
Abstract
This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsPatchGAN
