ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025

Jing He; Yiqing Wang; Lingling Li; Kexin Zhang; Puhua Chen

arXiv:2506.10550·cs.CV·June 13, 2025

ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025

Jing He, Yiqing Wang, Lingling Li, Kexin Zhang, Puhua Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces ContextRefine-CLIP, a novel model for multi-instance visual-textual retrieval that enhances feature interaction and achieves state-of-the-art results on the EPIC-KITCHENS-100 challenge.

Contribution

It proposes a cross-modal attention flow module for bidirectional feature refinement within a dual-encoder framework, improving retrieval accuracy without ensemble methods.

Findings

01

Achieves 66.78mAP and 82.08nDCG on EPIC-KITCHENS-100

02

Outperforms baseline models significantly

03

Validates effectiveness of cross-modal refinement in retrieval tasks

Abstract

This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

delcayr/contextrefine-clip
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need