VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion
David Guo, Minqi Sun, Yilun Jiang, Jiazhou Liang, Scott Sanner

TL;DR
VOGUE is a comprehensive multimodal dataset for fashion conversational recommendation, enabling detailed evaluation of dialogue inference, preference alignment, and user satisfaction, highlighting current model limitations.
Contribution
Introduces VOGUE, a detailed dataset with real human dialogues, visual data, and ratings, addressing limitations of existing resources for multimodal conversational recommendation research.
Findings
Reveals distinctive visually grounded dialogue dynamics.
Shows multimodal models approach human alignment but struggle with preference generalization.
Identifies systematic rating distribution errors in current models.
Abstract
Multimodal conversational recommendation has emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet, current multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history, or fail to collect sufficiently detailed feedback, all of which constrain the types of research and evaluation they support. To address these gaps, we introduce VOGUE, a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants. This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground-truth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
