Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
Sherif Abdelwahab

TL;DR
This paper introduces a novel streaming retrieval system for edge cameras that filters for semantic novelty to improve cross-modal retrieval efficiency and accuracy while minimizing power consumption.
Contribution
It proposes an on-device epsilon-net filter for semantic novelty detection, enhancing retrieval performance over offline methods and enabling low-power, high-accuracy edge camera applications.
Findings
Outperforms offline clustering methods in retrieval accuracy.
Achieves 45.6% Hit@5 on egocentric datasets with low power consumption.
Effective across multiple vision-language models and datasets.
Abstract
Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
