TL;DR
Omni-Persona introduces a comprehensive benchmark for omnimodal personalization, diagnosing grounding behaviors and evaluating models across text, image, and audio modalities with a focus on absent-persona scenarios.
Contribution
It formalizes a new cross-modal routing task, proposes Calibrated Accuracy for better grounding evaluation, and provides diagnostic insights into model behaviors across modalities.
Findings
Open-source models show an audio-visual grounding gap.
Calibration exposes limitations of answerable recall and model size.
RLVR generalizes well but tends to be conservative and lower quality.
Abstract
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ()}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
