Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search
Kirill Paramonov, Jia-Xing Zhong, Umberto Michieli, Jijoong Moon, Mete, Ozay

TL;DR
Swiss DINO is a lightweight, transformer-based framework enabling efficient on-device personal object search without additional training, significantly improving accuracy and resource efficiency for robotic home appliances.
Contribution
It introduces Swiss DINO, a novel one-shot personal object search framework leveraging DINOv2, achieving high accuracy and efficiency without adaptation training.
Findings
Up to 55% improvement in segmentation and recognition accuracy.
Up to 100x reduction in backbone inference time.
Up to 10x reduction in GPU consumption.
Abstract
In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsAttention Is All You Need · Softmax · Residual Connection · Layer Normalization · Linear Layer · Dense Connections · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels
