Multi-modal Generative Models in Recommendation System
Arnau Ramisa, Rene Vidal, Yashar Deldjoo, Zhankui He, Julian McAuley,, Anton Korikov, Scott Sanner, Mahesh Sathiamoorthy, Atoosa Kasrizadeh, Silvia, Milano, and Francesco Ricci

TL;DR
This paper discusses the development of multi-modal generative models for recommendation systems, enabling richer user interactions and improved understanding by integrating multiple data modalities like text and images.
Contribution
It reviews approaches that leverage multiple data modalities simultaneously to enhance recommendation systems with richer interactions and better product understanding.
Findings
Multi-modal models improve recommendation relevance.
Visual and textual data integration enhances user experience.
Existing systems often treat modalities independently.
Abstract
Many recommendation systems limit user inputs to text strings or behavior signals such as clicks and purchases, and system outputs to a list of products sorted by relevance. With the advent of generative AI, users have come to expect richer levels of interactions. In visual search, for example, a user may provide a picture of their desired product along with a natural language modification of the content of the picture (e.g., a dress like the one shown in the picture but in red color). Moreover, users may want to better understand the recommendations they receive by visualizing how the product fits their use case, e.g., with a representation of how a garment might look on them, or how a furniture item might look in their room. Such advanced levels of interaction require recommendation systems that are able to discover both shared and complementary information about the product across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
