Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky

TL;DR
MetamerGen is a novel latent diffusion model that generates scene metamers by combining peripheral gist and fixation-based detailed information, aiding understanding of human scene perception.
Contribution
Introduces MetamerGen, a new method for generating scene metamers using dual-stream representations of foveated images, advancing the study of human scene understanding.
Findings
MetamerGen can generate scene metamers aligned with human perception.
High-level semantic features strongly influence metamerism.
Fixation-based conditioning improves the perceptual match of generated scenes.
Abstract
Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing…
Peer Reviews
Decision·ICLR 2026 Oral
This paper introduced a method to understand how humans process visual information, and what kind of information/feature matters in visual perception. This paper designs a reasonable experimental setting to interpret human scene understanding by tracking eye movement and leveraging the movement to generate "fake" images using LDM, and asks the tester to justify the fake images. The findings provide strong evidence that human memory and understanding of a scene rely heavily on high-level semant
This paper only considers a fixed-resolution scenario. It would be interesting to investigate whether varying aspect ratio has an impact on human scene understanding. For example, give a larger resolution version of the original image or give a 9:16 image (the original is 1:1), which adds or removes some side content and maintains the center content.
Overall, this is a solid study with rigorous human subject experiments and valid deep learning algorithms. It provides a novel paradigm to study human visual scene perception. The analysis provides insights into the similarity between DNN and human scene perception, and which level of features is critical.
**Insufficient background on human visual scene perception** Considering the paper is submitted to a machine learning venue, neuroscience/psychology jargon should be explained. Terms like "metamer" should be defined clearly early on, and traditional study paradigms on them should be introduced. **Figure 1** Inconsistent color in Figure 1. in the top part, the fixation is colored blue, but in the bottom part, it is red. **Structure** Important results, such as line 478-479 "While MetamerGen
Formulation of a foveated image-to-image synthesis task that fuses sparse foveal tokens and blurry peripheral tokens into a single latent diffusion generator. A real-time gaze-contingent same/different behavioral paradigm (45 participants) showing that many generated images are judged “same” (i.e., metameric) and that semantic/high-level alignment best predicts metamerism for human-fixation conditioning. The paper repurposes the concept of metamers (historically a low-level vision/color co
Role / necessity of foveation is not fully ablated or isolated: The paper reports that MetamerGen can generate metamers conditioned on random fixations as well as human fixations and reports similar overall fooling rates (27.7% vs 29.4%) but then emphasizes the scientific value of human fixations because they produce stronger correlations with feature hierarchies. This raises two concerns. If random fixations can produce similar fool rates, how much unique value does foveation / fixation cond
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Multimodal Machine Learning Applications
