Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity   Linking

Zhengfei Xu; Sijia Zhao; Yanchao Hao; Xiaolong Liu; Lili Li; Yuyang; Yin; Bo Li; Xi Chen; Xin Xin

arXiv:2412.13614·cs.CV·December 19, 2024

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang, Yin, Bo Li, Xi Chen, Xin Xin

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces Pixel-Level Visual Entity Linking (PL-VEL), a new task that uses pixel masks for fine-grained visual understanding, supported by a large-scale dataset and a semantic tokenization method that improves accuracy.

Contribution

The paper proposes PL-VEL, a novel pixel-level entity linking task, and constructs the MaskOVEN-Wiki dataset with over 5 million annotations, along with a semantic tokenization approach for enhanced performance.

Findings

01

The reverse annotation framework achieved 94.8% success rate.

02

Models trained on the dataset improved accuracy by 18 points.

03

Semantic tokenization improved accuracy by 5 points.

Abstract

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

np-net-research/pl-vel
noneOfficial

Datasets

NP-NET/mask-oven-wiki
dataset· 17 dl
17 dl

Videos

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Web Data Mining and Analysis

MethodsSoftmax · Attention Is All You Need