Scaling Image and Video Generation via Test-Time Evolutionary Search

Haoran He; Jiajun Liang; Xintao Wang; Pengfei Wan; Di Zhang; Kun Gai; Ling Pan

arXiv:2505.17618·cs.CV·May 26, 2025

Scaling Image and Video Generation via Test-Time Evolutionary Search

Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan

PDF

3 Reviews

TL;DR

EvoSearch introduces a general, efficient test-time scaling method for image and video generative models that improves quality and diversity without additional training, by framing the process as an evolutionary search.

Contribution

The paper presents EvoSearch, a novel test-time scaling approach using evolutionary search principles to enhance diffusion and flow models for image and video generation.

Findings

01

Outperforms existing TTS methods in quality and diversity

02

Effective across diffusion and flow architectures for images and videos

03

Demonstrates strong generalizability to unseen metrics

Abstract

As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

EvoSearch introduces a fresh perspective by framing inference-time optimization as an evolutionary process. It leverages selection and mutation operators tailored to the diffusion denoising trajectory, which is an original way to maintain diversity and avoid the collapse seen in earlier methods. When enough computation is allocated, EvoSearch does improve generation quality. The experiments show notable gains in output metrics as the number of inference steps increases. The experimental results

Weaknesses

While EvoSearch is novel in framing the problem as an evolutionary algorithm, the practical advantages over existing particle sampling methods are not very convincing. In the Stable Diffusion 2.1 image experiments, EvoSearch’s improvement over a standard particle filtering baseline is quite small. One could argue EvoSearch is a variant of particle sampling with an evolutionary selection twist, yielding comparable results to known methods on images. This weakens the claim of a significant contrib

Reviewer 02Rating 6Confidence 2

Strengths

- Novelty: Using evolutionary algorithms to optimize denoising trajectories during inference is original and conceptually appealing. - Strong empirical evaluation: Extensive experiments, ablations, and analysis support the method's effectiveness. - General applicability: Works across diffusion and flow-based models, and both images and videos. - No training overhead: Fits well within the TTS paradigm—improving sampling quality without retraining.

Weaknesses

- Clarity of algorithm description: The current presentation of EvoSearch's workflow, particularly in the overview figure and pseudocode, lacks clarity and makes it difficult to precisely understand the evolutionary operators. - Ambiguous visualization (Figure 3): The diagram includes unexplained symbols (e.g., check marks, arrows, population filtering meaning, mutation signs), making the process non-transparent. It would strongly benefit from a redesign with clearer semantics and annotations ex

Reviewer 03Rating 6Confidence 3

Strengths

- Performance is promising. - The paper provides extensive analysis including toy example and ablation study with hyper-parameters. - The proposed idea seems to be novel.

Weaknesses

- The writing could be improved. Although the paper focuses on reinterpreting TTS from the perspective of evolutionary algorithms, the terminology used is unconventional. Consequently, the roles of hyperparameters such as the population scheduler $K$ and evolution schedule $T$ are unclear and difficult to interpret. - Line 265: Chung et al., 2023 should be corrected to [1]. - Computing the reward on the fully denoised $x_0$ (lines 264–268) is not new. [2], which is missing from the related wor

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.