HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models

Wing Chan; Richard Allen

arXiv:2602.00105·cs.CV·February 3, 2026

HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models

Wing Chan, Richard Allen

PDF

Open Access

TL;DR

HYPE-EDIT-1 is a comprehensive benchmark for evaluating the reliability and cost-effectiveness of image editing models in real-world workflows, considering retries and human review costs.

Contribution

The paper introduces HYPE-EDIT-1, a new benchmark with a standardized evaluation framework for measuring success rates and costs of image editing models in practical scenarios.

Findings

01

Per-attempt pass rates range from 34% to 83%.

02

Effective cost per successful edit varies from USD 0.66 to 1.42.

03

Models with lower per-image prices may incur higher total costs due to retries.

Abstract

Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Scientific Computing and Data Management · Mobile Crowdsensing and Crowdsourcing