TL;DR
Paza is a cost-effective, model-agnostic zero-shot retail theft detection system that orchestrates multiple vision models to detect theft behaviors without training new models.
Contribution
It introduces a layered, multi-signal pipeline that significantly reduces expensive model calls and enables easy swapping of vision-language models, improving scalability and adaptability.
Findings
Achieves 89.5% precision and 92.8% specificity at 59.3% recall zero-shot.
Reduces VLM invocations by 240x compared to per-frame analysis.
Operates at a cost of $50-100/month per store, much cheaper than commercial systems.
Abstract
Retail theft costs the global economy over $100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge $200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
