ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

Duc-Tai Dinh; Duc Anh Khoa Dinh

arXiv:2507.20564·cs.CL·July 29, 2025

ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

Duc-Tai Dinh, Duc Anh Khoa Dinh

PDF

Open Access

TL;DR

ZSE-Cap is a zero-shot ensemble system for image retrieval and captioning that combines foundation models and prompting, achieving top-4 results in the EVENTA challenge without fine-tuning.

Contribution

It introduces a zero-shot approach that ensembles multiple foundation models and uses prompt-guided captioning, avoiding the need for task-specific fine-tuning.

Findings

01

Achieved a score of 0.42002, top-4 in private test set.

02

Effectively combined CLIP, SigLIP, and DINOv2 for retrieval.

03

Utilized prompt engineering to guide captioning with Gemma 3.

Abstract

We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis