SAM2Auto: Auto Annotation Using FLASH
Arash Rocky, Q.M. Jonathan Wu

TL;DR
SAM2Auto is an automated video annotation pipeline that combines robust object detection and real-time segmentation, significantly reducing manual effort and costs while maintaining high accuracy across diverse datasets.
Contribution
It introduces SAM2Auto, the first fully automated, dataset-agnostic video annotation system that eliminates human intervention and dataset-specific training.
Findings
Achieves annotation accuracy comparable to manual methods
Reduces annotation time and labor costs dramatically
Handles diverse datasets without retraining or extensive tuning
Abstract
Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
