HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models
Wing Chan, Richard Allen

TL;DR
HYPE-EDIT-1 is a comprehensive benchmark for evaluating the reliability and cost-effectiveness of image editing models in real-world workflows, considering retries and human review costs.
Contribution
The paper introduces HYPE-EDIT-1, a new benchmark with a standardized evaluation framework for measuring success rates and costs of image editing models in practical scenarios.
Findings
Per-attempt pass rates range from 34% to 83%.
Effective cost per successful edit varies from USD 0.66 to 1.42.
Models with lower per-image prices may incur higher total costs due to retries.
Abstract
Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Scientific Computing and Data Management · Mobile Crowdsensing and Crowdsourcing
