TL;DR
EvoSearch introduces a general, efficient test-time scaling method for image and video generative models that improves quality and diversity without additional training, by framing the process as an evolutionary search.
Contribution
The paper presents EvoSearch, a novel test-time scaling approach using evolutionary search principles to enhance diffusion and flow models for image and video generation.
Findings
Outperforms existing TTS methods in quality and diversity
Effective across diffusion and flow architectures for images and videos
Demonstrates strong generalizability to unseen metrics
Abstract
As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel,…
Peer Reviews
Decision·Submitted to ICLR 2026
EvoSearch introduces a fresh perspective by framing inference-time optimization as an evolutionary process. It leverages selection and mutation operators tailored to the diffusion denoising trajectory, which is an original way to maintain diversity and avoid the collapse seen in earlier methods. When enough computation is allocated, EvoSearch does improve generation quality. The experiments show notable gains in output metrics as the number of inference steps increases. The experimental results
While EvoSearch is novel in framing the problem as an evolutionary algorithm, the practical advantages over existing particle sampling methods are not very convincing. In the Stable Diffusion 2.1 image experiments, EvoSearch’s improvement over a standard particle filtering baseline is quite small. One could argue EvoSearch is a variant of particle sampling with an evolutionary selection twist, yielding comparable results to known methods on images. This weakens the claim of a significant contrib
- Novelty: Using evolutionary algorithms to optimize denoising trajectories during inference is original and conceptually appealing. - Strong empirical evaluation: Extensive experiments, ablations, and analysis support the method's effectiveness. - General applicability: Works across diffusion and flow-based models, and both images and videos. - No training overhead: Fits well within the TTS paradigm—improving sampling quality without retraining.
- Clarity of algorithm description: The current presentation of EvoSearch's workflow, particularly in the overview figure and pseudocode, lacks clarity and makes it difficult to precisely understand the evolutionary operators. - Ambiguous visualization (Figure 3): The diagram includes unexplained symbols (e.g., check marks, arrows, population filtering meaning, mutation signs), making the process non-transparent. It would strongly benefit from a redesign with clearer semantics and annotations ex
- Performance is promising. - The paper provides extensive analysis including toy example and ablation study with hyper-parameters. - The proposed idea seems to be novel.
- The writing could be improved. Although the paper focuses on reinterpreting TTS from the perspective of evolutionary algorithms, the terminology used is unconventional. Consequently, the roles of hyperparameters such as the population scheduler $K$ and evolution schedule $T$ are unclear and difficult to interpret. - Line 265: Chung et al., 2023 should be corrected to [1]. - Computing the reward on the fully denoised $x_0$ (lines 264–268) is not new. [2], which is missing from the related wor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
