TL;DR
This paper critically examines the reliability of MLLM-generated textual explanations for face verification, revealing limitations in their faithfulness and proposing a new evaluation framework, with code available online.
Contribution
It introduces a likelihood-ratio-based framework to assess the evidential strength of explanations and highlights the limitations of current MLLMs in trustworthy face recognition explanations.
Findings
MLLM explanations often rely on non-verifiable facial attributes.
Incorporating face recognition scores improves verification but not explanation faithfulness.
The proposed framework quantifies the evidential strength of textual explanations.
Abstract
Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
