TL;DR
FashionLens introduces a versatile, task-adaptive framework for fashion image retrieval that unifies multiple retrieval scenarios and demonstrates state-of-the-art performance on a comprehensive new benchmark.
Contribution
The paper presents FashionLens, a novel multimodal large language model-based framework with task-specific calibrators and adaptive sampling, along with the U-FIRE benchmark for diverse fashion retrieval tasks.
Findings
FashionLens outperforms existing methods on U-FIRE benchmark.
It generalizes well to unseen retrieval tasks.
The framework effectively handles diverse query formats and search intentions.
Abstract
Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
