Leveraging Large Language Models for Multimodal Search
Oriol Barbany, Michael Huang, Xinliang Zhu, Arnab Dhua

TL;DR
This paper presents a new multimodal search model that combines large language models with visual data to improve search accuracy and user interaction, achieving state-of-the-art results on Fashion200K.
Contribution
It introduces a novel multimodal search framework integrated with LLM-based conversational interface, enhancing natural language understanding and user engagement in search tasks.
Findings
Achieved new performance milestone on Fashion200K dataset.
Developed an LLM-powered search interface for natural language interaction.
Enhanced user experience with human-like search assistance.
Abstract
Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
