Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

Kawtar Zaher; Olivier Buisson; Alexis Joly

arXiv:2604.00809·cs.CV·April 30, 2026

Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

Kawtar Zaher, Olivier Buisson, Alexis Joly

PDF

TL;DR

This paper explores improving human-in-the-loop object retrieval by leveraging pre-trained vision transformers and active learning to efficiently identify diverse object instances in complex images.

Contribution

It revisits the task using ViT representations, addressing key design choices and comparing strategies for local versus global feature extraction in multi-object datasets.

Findings

01

Pre-trained ViT representations enhance retrieval performance.

02

Local descriptors improve detection of small, cluttered objects.

03

Active learning strategies effectively reduce annotation effort.

Abstract

Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.