Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
Boyu Chen, Tai Guo, Weiyu Cui, Yuqing Li, Xingxing Wang, Chuan Shi, Cheng Yang

TL;DR
This paper introduces a staged pretraining approach for multimodal retrieval in food delivery, improving feature utilization and training stability, leading to better retrieval accuracy and increased platform revenue.
Contribution
The paper proposes a novel staged pretraining strategy and semantic ID tasks to enhance multimodal feature learning and address training challenges in retrieval models.
Findings
Achieved significant improvements in retrieval metrics (R@5, R@10, R@20, N@5, N@10, N@20).
Demonstrated a 1.12% revenue increase in real-world A/B testing.
Validated the effectiveness of staged pretraining in practical food delivery scenarios.
Abstract
Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
