PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown

TL;DR
PoSh introduces a scene graph-guided metric for evaluating detailed image descriptions, aligning better with human judgments and enabling nuanced assessment of model performance across diverse image types.
Contribution
The paper presents PoSh, a novel evaluation metric using scene graphs to guide LLMs as judges, improving correlation with human judgments and robustness over existing metrics.
Findings
PoSh outperforms existing metrics in correlating with human judgments.
PoSh is robust across different image types and domains.
Models struggle with detailed scene understanding in complex images.
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper is well-written and easy to follow. * Evaluating the image description is indeed a non-trivial task, and the proposed new metric for evaluating detailed descriptions is important for the field of image captioning. * I agree that a good metric for image-description should be grounded on fine-grained cues, localized on text spans. * The paper introduced the DOCENT dataset, which includes expert-written descriptions and annotations, and the quality is well controlled.
* POSH is reliance on a model to generate the scene graph introduces inaccuracies and errors, which could be a potential bottleneck for its effectiveness. * The use of scene graphs to evaluate image-text alignment has been discussed in previous papers like [1]; the authors need to clarify the uniqueness of POSH. * The proposed dataset covers artworks, but in practical applications, images of natural scenes are more common. [1] Davidsonian Scene Graph: Improving Reliability in Fine-grained Evalu
The paper has numerous strengths: - Firstly, the task of long-form image captioning evaluation is an important one as models continuously improve in capabilities. PoSh acts as an important contribution within this space by proposing a straightforward reference-based metric that converts the references and generations into scene graphs and using these to assess precision and recall. Particularly, the use of questions to assess both precision and recall lends the metric interpretability and granul
The main weaknesses I can see are: - PoSh is going to be sensitive to the accuracy of the extracted scene graphs, where there could be errors either during the dependency parsing process or during coreference resolution. Figure 3 marking "painting" as a mistake acts as one example of this. - While I think PoSh could act as a strong reward model, it does presume access to detailed reference captions, which is expensive to curate on a large scale. PoSh being reference-based similarly restricts it
- They focus on an important aspect of VLM understandings. They focus on the detailed description, and the metric is interpretable. - They propose a benchmark with expert-written descriptions and 900 granular & coarse judgments from raters. The manual effort is massive. - They open-sourced the benchmark and metric, which will benefit the community. - They evaluate multiple open-source and closed-source models.
- The writing could be better. For example, in the table, they use POSH to denote the finetuned Qwen model with POSH reward, while POSH is a metric in the meantime. This is a bit confusing. - For the findings of POSH as a reward function, they only experiment with the Qwen2.5-VL-7B model. The findings may not be model-agnostic. - It is concerning that POSH works better on their proposed DOCENT benchmark but is adequate on other benchmarks like CapArena.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis
