TL;DR
F4-ITS introduces a novel vision-language framework that enhances food image-text retrieval through multi-modal feature fusion and ingredient-based re-ranking, significantly improving accuracy in food-related applications.
Contribution
The paper presents a training-free, multi-modal fusion strategy and a feature-based re-ranking mechanism that together improve food image-to-text retrieval performance.
Findings
Achieved ~10% top-1 retrieval improvement in dense caption scenarios.
Realized ~7.7% top-1 retrieval improvement in sparse caption scenarios.
Attained ~28.6% gain in top-k ingredient-level retrieval.
Abstract
The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted…
Peer Reviews
Decision·Submitted to ICLR 2026
S1:The training-free nature of the approach gives the method a clear practical edge, avoiding the high cost of fine-tuning large models on domain-specific data. S2:The finding that lightweight fused models can rival their heavyweight counterparts is valuable for resource limited scenarios.
W1:The key techniques include weighted fusion of image-text embeddings, using VLMs for caption generation, which is not novel. The contribution is primarily an engineering combination of existing methods applied to the food domain. W2:The evaluation uses only small subsets (13K and 15K samples) from the MetaFood Challenge datasets, which raises questions about generalization. No evaluation on other well-known food datasets (e.g., Food-101, Recipe1M) is provided. W3: In Equation 8, the weights (w
This paper tackles an important task: Food Image-Text retrieval. This is important for downstream applications such as dietary monitoring, nutritional analysis, and so on. The framework is easy to understand.
There is no technical innovation in the framework. Using CLIP image and text encoders to extract image and text features is widely used, and fusing them using weights is well-known. The CLIP model is relatively small. Larger models, such as BLIPv3, Qwen2.5-VL, should be used to test the performance of the proposed framework.
- The results are promising. The authors show that with the weighted sum fusion methods, it improve ~28.6% in top-k ingredient retrieval and ~10% in desnse caption retrieval.
- Lack of novelty: The paper's main weakness is its limited novelty. The first key contribution is a weighted sum fusion method. This is a widely-known, basic ensemble technique [1]. The paper attempts to differentiate itself by focusing on the food domain, but this does not constitute a novel algorithmic contribution. - Unclear results: The table 1, 2, and subsequent tables do not specify the types of fusion methods (uni/bi direction fusion) employed to obtain the results. - Missing details: W
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
