MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation
Patryk Bartkowiak, Lennart Petersen, Bartosz Kotrys, Dominik Michels, Soren Pirk, and Wojtek Palubicki

TL;DR
MaSC introduces a masked similarity metric that improves evaluation of concept preservation and prompt following in text-to-image generation by spatially decomposing image features.
Contribution
The paper proposes MaSC, a novel evaluation metric using foreground masks and frozen features, outperforming existing metrics and aligning better with human perception.
Findings
MaSC achieves higher correlation with human ratings than existing metrics.
MaSC nearly perfectly distinguishes same-subject from cross-subject pairs in real-world benchmarks.
Spatially decomposed aggregation enhances evaluation accuracy for concept-driven generation.
Abstract
Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
