Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Mathilde Caron; Alireza Fathi; Cordelia Schmid; Ahmet Iscen

arXiv:2410.23676·cs.CV·November 1, 2024

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel data curation method for web-scale visual entity recognition using multimodal large language models to generate and verify annotations, resulting in improved model performance.

Contribution

It presents a new approach to automatically curate high-quality training data by leveraging LLMs for reasoning, verification, and explanation, enhancing visual entity recognition.

Findings

01

Achieved +6.9% improvement on OVEN entity task

02

Generated high-quality annotations with LLM reasoning

03

Demonstrated the importance of curated data for model performance

Abstract

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach· slideslive

Taxonomy

TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Biomedical Text Mining and Ontologies