Open-Source Image Editing Models Are Zero-Shot Vision Learners
Wei Liu, Jiaxin Lin, Rui Chen

TL;DR
This paper systematically evaluates open-source image-editing models and finds they possess notable zero-shot vision capabilities across various dense visual prediction tasks without any fine-tuning.
Contribution
It provides the first comprehensive benchmark demonstrating zero-shot visual understanding in open-source image-editing models, highlighting their emergent capabilities.
Findings
FireRed-Image-Edit surpasses fine-tuned models on surface normals
LongCat-Image-Edit achieves high accuracy in depth estimation
Qwen-Image-Edit performs well in semantic segmentation
Abstract
Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models -- Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit -- on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
