Personalized Image Descriptions from Attention Sequences

Ruoyu Xue; Hieu Le; Jingyi Xu; Sounak Mondal; Abe Leite; Gregory Zelinsky; Minh Hoai; Dimitris Samaras

arXiv:2512.06662·cs.CV·December 9, 2025

Personalized Image Descriptions from Attention Sequences

Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

PDF

Open Access

TL;DR

This paper introduces DEPER, a model that personalizes image descriptions by incorporating individual viewing behaviors and linguistic styles, significantly improving description quality and human alignment across diverse datasets.

Contribution

It presents a novel approach that explicitly models personalized viewing patterns in image description generation, enabling effective few-shot personalization without retraining.

Findings

01

Achieves 24% average improvement across datasets.

02

Models personalized attention for more human-aligned descriptions.

03

Enables few-shot personalization with a lightweight adapter.

Abstract

People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Data Visualization and Analytics