TL;DR
This paper introduces a diagnostic pipeline to evaluate LLM performance on addressee recognition and response selection in multi-party conversations, emphasizing structural and linguistic factors.
Contribution
It proposes a new methodology for analyzing model weaknesses across conversation structures and introduces diagnostic datasets for targeted evaluation.
Findings
Response selection depends more on textual content.
Addressee recognition requires understanding conversation structure.
LLMs show task-dependent sensitivity to prompt variations.
Abstract
Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
