ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025
Jing He, Yiqing Wang, Lingling Li, Kexin Zhang, Puhua Chen

TL;DR
This paper introduces ContextRefine-CLIP, a novel model for multi-instance visual-textual retrieval that enhances feature interaction and achieves state-of-the-art results on the EPIC-KITCHENS-100 challenge.
Contribution
It proposes a cross-modal attention flow module for bidirectional feature refinement within a dual-encoder framework, improving retrieval accuracy without ensemble methods.
Findings
Achieves 66.78mAP and 82.08nDCG on EPIC-KITCHENS-100
Outperforms baseline models significantly
Validates effectiveness of cross-modal refinement in retrieval tasks
Abstract
This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need
