ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning
Duc-Tai Dinh, Duc Anh Khoa Dinh

TL;DR
ZSE-Cap is a zero-shot ensemble system for image retrieval and captioning that combines foundation models and prompting, achieving top-4 results in the EVENTA challenge without fine-tuning.
Contribution
It introduces a zero-shot approach that ensembles multiple foundation models and uses prompt-guided captioning, avoiding the need for task-specific fine-tuning.
Findings
Achieved a score of 0.42002, top-4 in private test set.
Effectively combined CLIP, SigLIP, and DINOv2 for retrieval.
Utilized prompt engineering to guide captioning with Gemma 3.
Abstract
We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
