Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Yuqing Liu, Yu Wang, Lichao Sun, Philip S. Yu

TL;DR
Rec-GPT4V introduces a novel reasoning scheme leveraging large vision-language models to improve multimodal recommendation by incorporating user preferences and image summaries, addressing limitations of existing LVLMs.
Contribution
The paper proposes Rec-GPT4V with Visual-Summary Thought, a new approach that enhances multimodal recommendation by integrating user preferences and image comprehension using LVLMs.
Findings
VST improves recommendation accuracy across datasets
LVLMs effectively generate item image summaries
Rec-GPT4V outperforms baseline models in experiments
Abstract
The development of large vision-language models (LVLMs) offers the potential to address challenges faced by traditional multimodal recommendations thanks to their proficient understanding of static images and textual dynamics. However, the application of LVLMs in this field is still limited due to the following complexities: First, LVLMs lack user preference knowledge as they are trained from vast general datasets. Second, LVLMs suffer setbacks in addressing multiple image dynamics in scenarios involving discrete, noisy, and redundant image sequences. To overcome these issues, we propose the novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST) of leveraging large vision-language models for multimodal recommendation. We utilize user history as in-context user preferences to address the first challenge. Next, we prompt LVLMs to generate item image summaries and utilize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
