Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Maria-Teresa De Rosa Palmini; Eva Cetinic

arXiv:2505.17064·cs.CV·February 23, 2026

Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Maria-Teresa De Rosa Palmini, Eva Cetinic

PDF

3 Reviews

TL;DR

This paper introduces a benchmark to evaluate how well text-to-image diffusion models depict historical contexts, revealing systematic inaccuracies and stereotypes in generated images across different eras.

Contribution

It presents a new benchmark dataset and evaluation protocol for assessing historical accuracy in diffusion model imagery, addressing a previously underexplored area.

Findings

01

Models often stereotype historical eras with unstated stylistic cues.

02

Generated images frequently contain anachronisms like modern artifacts.

03

Demographic representations in images often do not match historical plausibility.

Abstract

As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper explores the world knowledge embedded in text-to-image (TTI) models—knowledge that parallels that of modern Large Language Models (LLMs). By examining how these models represent historical contexts, the authors go beyond the typical focus on creativity or imagination to probe their practical understanding of reality. This represents an important and relatively underexplored research direction. 2. The findings reveal previously unexamined layers of bias within TTI models. For instanc

Weaknesses

1. The paper’s positioning could be clearer. The provided dataset, being composed of the outputs from T2I models applied to a set of prompts, offers limited standalone value to the community—apart from ensuring reproducibility. The true contribution appears to lie in the methodological framework for analyzing the historical biases in generative models. It would therefore strengthen the paper to explicitly present the work as proposing a benchmark for estimating VLM biases in representing histori

Reviewer 02Rating 8Confidence 3

Strengths

The paper is a great read, easy to follow, with interesting findings and extensive evaluations. The authors have designed careful and sound evaluation schemes for each of the three aspects that they are studying in the paper. I especially liked that they have used multiple VLMs for evaluation rather than only using one as a judge. All details of the study has been laid out in complete transparent detail. Multiple qualitative examples were very helpful in getting the point across for each aspect

Weaknesses

A comparison with related studies in this direction comparing the number of samples and evaluation strategies will be helpful to better place the paper.

Reviewer 03Rating 6Confidence 4

Strengths

- The paper highlights an important problem of evaluating historical representation in text-to-image models when depicting generic, everyday activities and provides a clear motivation for addressing it. - The paper provides and evaluates an interesting dataset consisting of images that depict a comprehensive set of timeless activities spanning approximately five and a half centuries, offering strong coverage across diverse historical periods. - The findings, particularly the observation of anac

Weaknesses

- Although the VSG score is supported by a robust methodology, the reason for evaluating biases in style associations and the explanation of the distinct style classes are not sufficiently motivated. - The anachronism detection uses an LLM to get a list of possible objects that could be anachronistic in a given activity. How can we ensure that this list is exhaustive? Were other methods, like object detection, considered for this task? - Evaluating demographic representation in the generated s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.