Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Sherif Abdelwahab

arXiv:2603.29631·cs.CV·April 1, 2026

Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Sherif Abdelwahab

PDF

TL;DR

This paper introduces a novel streaming retrieval system for edge cameras that filters for semantic novelty to improve cross-modal retrieval efficiency and accuracy while minimizing power consumption.

Contribution

It proposes an on-device epsilon-net filter for semantic novelty detection, enhancing retrieval performance over offline methods and enabling low-power, high-accuracy edge camera applications.

Findings

01

Outperforms offline clustering methods in retrieval accuracy.

02

Achieves 45.6% Hit@5 on egocentric datasets with low power consumption.

03

Effective across multiple vision-language models and datasets.

Abstract

Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.