TL;DR
This paper introduces FiCo-ITR, a standardized evaluation framework for comparing fine-grained and coarse-grained image-text retrieval models, providing insights into their performance and efficiency trade-offs.
Contribution
It presents the FiCo-ITR library for standardized evaluation, enabling direct comparison of FG and CG models in image-text retrieval tasks.
Findings
FG models achieve higher accuracy but are more computationally intensive.
CG models are more efficient but less precise.
Empirical analysis reveals trade-offs guiding model choice.
Abstract
In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the FiCo-ITR library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
