TL;DR
This paper introduces a benchmark to evaluate how well text-to-image diffusion models depict historical contexts, revealing systematic inaccuracies and stereotypes in generated images across different eras.
Contribution
It presents a new benchmark dataset and evaluation protocol for assessing historical accuracy in diffusion model imagery, addressing a previously underexplored area.
Findings
Models often stereotype historical eras with unstated stylistic cues.
Generated images frequently contain anachronisms like modern artifacts.
Demographic representations in images often do not match historical plausibility.
Abstract
As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper explores the world knowledge embedded in text-to-image (TTI) models—knowledge that parallels that of modern Large Language Models (LLMs). By examining how these models represent historical contexts, the authors go beyond the typical focus on creativity or imagination to probe their practical understanding of reality. This represents an important and relatively underexplored research direction. 2. The findings reveal previously unexamined layers of bias within TTI models. For instanc
1. The paper’s positioning could be clearer. The provided dataset, being composed of the outputs from T2I models applied to a set of prompts, offers limited standalone value to the community—apart from ensuring reproducibility. The true contribution appears to lie in the methodological framework for analyzing the historical biases in generative models. It would therefore strengthen the paper to explicitly present the work as proposing a benchmark for estimating VLM biases in representing histori
The paper is a great read, easy to follow, with interesting findings and extensive evaluations. The authors have designed careful and sound evaluation schemes for each of the three aspects that they are studying in the paper. I especially liked that they have used multiple VLMs for evaluation rather than only using one as a judge. All details of the study has been laid out in complete transparent detail. Multiple qualitative examples were very helpful in getting the point across for each aspect
A comparison with related studies in this direction comparing the number of samples and evaluation strategies will be helpful to better place the paper.
- The paper highlights an important problem of evaluating historical representation in text-to-image models when depicting generic, everyday activities and provides a clear motivation for addressing it. - The paper provides and evaluates an interesting dataset consisting of images that depict a comprehensive set of timeless activities spanning approximately five and a half centuries, offering strong coverage across diverse historical periods. - The findings, particularly the observation of anac
- Although the VSG score is supported by a robust methodology, the reason for evaluating biases in style associations and the explanation of the distinct style classes are not sufficiently motivated. - The anachronism detection uses an LLM to get a list of possible objects that could be anachronistic in a given activity. How can we ensure that this list is exhaustive? Were other methods, like object detection, considered for this task? - Evaluating demographic representation in the generated s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
