Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis

Aditi Singhania; Krutik Malani; Riddhi Dhawan; Arushi Jain; Garv Tandon; Nippun Sharma; Souymodip Chakraborty; Vineet Batra; Ankit Phogat

arXiv:2511.08087·cs.CV·November 12, 2025

Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis

Aditi Singhania, Krutik Malani, Riddhi Dhawan, Arushi Jain, Garv Tandon, Nippun Sharma, Souymodip Chakraborty, Vineet Batra, Ankit Phogat

PDF

Open Access

TL;DR

This paper presents Beyond the Pixels, a hierarchical VLM-based evaluation framework that improves identity preservation assessment in generative models by decomposing features and guiding structured reasoning, validated against human judgments.

Contribution

It introduces a novel hierarchical evaluation method that enhances fine-grained identity assessment and reduces hallucinations in VLM-based metrics for generative models.

Findings

01

Strong alignment with human judgments on identity consistency

02

Effective decomposition of identity features improves evaluation accuracy

03

New benchmark with diverse and challenging image-prompt pairs

Abstract

Evaluating identity preservation in generative models remains a critical yet unresolved challenge. Existing metrics rely on global embeddings or coarse VLM prompting, failing to capture fine-grained identity changes and providing limited diagnostic insight. We introduce Beyond the Pixels, a hierarchical evaluation framework that decomposes identity assessment into feature-level transformations. Our approach guides VLMs through structured reasoning by (1) hierarchically decomposing subjects into (type, style) -> attribute -> feature decision tree, and (2) prompting for concrete transformations rather than abstract similarity scores. This decomposition grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency. We validate our framework across four state-of-the-art generative models, demonstrating strong alignment with human judgments in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Face Recognition and Perception