MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
Anupam Purwar, Aditya Choudhary

TL;DR
This paper introduces MM-tau-p$^2$, a comprehensive benchmark with 12 metrics for evaluating the robustness of multi-modal, persona-adaptive agents in dual-control settings, considering user input and evolving behaviors.
Contribution
It presents a novel evaluation framework for multi-modal agents that incorporates persona adaptation and dual-control metrics, extending prior work with new metrics and domain-specific assessments.
Findings
State-of-the-art LLMs like GPT-5 and GPT 4.1 show additional robustness challenges.
The benchmark provides a holistic, automated evaluation method for multi-modal agents.
Domain-specific estimates demonstrate the framework's applicability in telecom and retail.
Abstract
Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
