From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation
Jeeho Shin, Kyungho Kim, Kijung Shin

TL;DR
This paper introduces TESMR, a three-stage framework that systematically refines multimodal features into effective recipe embeddings, significantly improving recommendation performance on real-world datasets.
Contribution
The paper presents TESMR, a novel three-stage method for enhancing multimodal features into embeddings, leading to better recipe recommendation accuracy.
Findings
TESMR achieves 7-15% higher Recall@10 compared to existing methods.
Simple multimodal signals can yield competitive recommendation performance.
Systematic enhancement of multimodal features is highly promising.
Abstract
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
