Quantifying the human visual exposome with vision language models

Christian Rominger (1); Andreas R. Schwerdtfeger (1); Malay Gaherwar Singh (2); Dimitri Khudyakow (2); Elizabeth A. M. Michels (2); Fabian Wolf (2); Jakob Nikolas Kather (2,3,4); Magdalena Katharina Wekenborg (2) ((1) University of Graz; (2) TU Dresden; (3) University Hospital Carl Gustav Carus Dresden; (4) National Center for Tumor Diseases Heidelberg)

arXiv:2605.03863·cs.AI·May 6, 2026

Quantifying the human visual exposome with vision language models

Christian Rominger (1), Andreas R. Schwerdtfeger (1), Malay Gaherwar Singh (2), Dimitri Khudyakow (2), Elizabeth A. M. Michels (2), Fabian Wolf (2), Jakob Nikolas Kather (2,3,4), Magdalena Katharina Wekenborg (2) ((1) University of Graz, (2) TU Dresden

PDF

TL;DR

This paper introduces a scalable method combining vision language models and large language models to quantify the visual environment's impact on mental health, surpassing traditional proxies.

Contribution

It presents a novel, scalable approach to quantify visual context and its association with mental health using advanced AI models and real-world imagery.

Findings

01

VLM-derived greenness estimates predict affect and stress.

02

Over 33% of visual context ratings correlate with mental health measures.

03

The pipeline enables high-throughput decoding of visual environment effects.

Abstract

The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.