Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator
Wei-Bin Kou, Guangxu Zhu, Rongguang Ye, Shuai Wang, Ming Tang, and, Yik-Chung Wu

TL;DR
The paper introduces LAM, a prompt-free, interpretable model that leverages a pretrained Vision Transformer and minimal training data to generate high-fidelity annotations for street scene datasets, reducing manual labeling costs.
Contribution
The novel LAM framework combines a Vision Transformer, a semantic class adapter, and an optimization-based unrolling algorithm to produce accurate annotations with minimal training data and high interpretability.
Findings
Achieves nearly 100% mIoU on multiple datasets
Requires only a single seed image for training
Demonstrates high-fidelity annotations across real-world and simulated datasets
Abstract
Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Mobile Crowdsensing and Crowdsourcing · Text and Document Classification Technologies
MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Residual Connection · Dense Connections · Linear Layer · Entropy Regularization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam
