Label Anything: An Interpretable, High-Fidelity and Prompt-Free   Annotator

Wei-Bin Kou; Guangxu Zhu; Rongguang Ye; Shuai Wang; Ming Tang; and; Yik-Chung Wu

arXiv:2502.02972·cs.RO·February 6, 2025

Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator

Wei-Bin Kou, Guangxu Zhu, Rongguang Ye, Shuai Wang, Ming Tang, and, Yik-Chung Wu

PDF

Open Access

TL;DR

The paper introduces LAM, a prompt-free, interpretable model that leverages a pretrained Vision Transformer and minimal training data to generate high-fidelity annotations for street scene datasets, reducing manual labeling costs.

Contribution

The novel LAM framework combines a Vision Transformer, a semantic class adapter, and an optimization-based unrolling algorithm to produce accurate annotations with minimal training data and high interpretability.

Findings

01

Achieves nearly 100% mIoU on multiple datasets

02

Requires only a single seed image for training

03

Demonstrates high-fidelity annotations across real-world and simulated datasets

Abstract

Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Mobile Crowdsensing and Crowdsourcing · Text and Document Classification Technologies

MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Residual Connection · Dense Connections · Linear Layer · Entropy Regularization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam