Open-Source Image Editing Models Are Zero-Shot Vision Learners

Wei Liu; Jiaxin Lin; Rui Chen

arXiv:2605.04566·cs.CV·May 7, 2026

Open-Source Image Editing Models Are Zero-Shot Vision Learners

Wei Liu, Jiaxin Lin, Rui Chen

PDF

TL;DR

This paper systematically evaluates open-source image-editing models and finds they possess notable zero-shot vision capabilities across various dense visual prediction tasks without any fine-tuning.

Contribution

It provides the first comprehensive benchmark demonstrating zero-shot visual understanding in open-source image-editing models, highlighting their emergent capabilities.

Findings

01

FireRed-Image-Edit surpasses fine-tuned models on surface normals

02

LongCat-Image-Edit achieves high accuracy in depth estimation

03

Qwen-Image-Edit performs well in semantic segmentation

Abstract

Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models -- Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit -- on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.